Class JsonLoaderImpl

java.lang.Object
org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl
All Implemented Interfaces:
JsonLoader, ErrorFactory
Direct Known Subclasses:
KafkaJsonLoader

public class JsonLoaderImpl extends Object implements JsonLoader, ErrorFactory
Revised JSON loader that is based on the ResultSetLoader abstraction. Uses the listener-based JsonStructureParser to walk the JSON tree in a "streaming" fashion, calling events which this class turns into vector write operations. Listeners handle options such as all text mode vs. type-specific parsing. Think of this implementation as a listener-based recursive-descent parser.

The JSON loader mechanism runs two state machines intertwined:

  1. The actual parser (to parse each JSON object, array or scalar according to its inferred type represented by the JsonStructureParser.
  2. The type discovery machine, which is made complex because JSON may include long runs of nulls, represented by this class.

Schema Discovery

Fields are discovered on the fly. Types are inferred from the first JSON token for a field. Type inference is less than perfect: it cannot handle type changes such as first seeing 10, then 12.5, or first seeing "100", then 200.

When a field first contains null or an empty list, "null deferral" logic adds a special state that "waits" for an actual data type to present itself. This allows the parser to handle a series of nulls, empty arrays, or arrays of nulls (when using lists) at the start of the file. If no type ever appears, the loader forces the field to "text mode", hoping that the field is scalar.

To slightly help the null case, if the projection list shows that a column must be an array or a map, then that information is used to guess the type of a null column.

The code includes a prototype mechanism to provide type hints for columns. At present, it is just used to handle nulls that are never "resolved" by the end of a batch. Would be much better to use the hints (or a full schema) to avoid the huge mass of code needed to handle nulls.

Provided Schema

The JSON loader accepts a provided schema which removes type ambiguities. If we have the examples above (runs of nulls, or shifting types), then the provided schema says the vector type to create; the individual column listeners attempt to convert the JSON token type to the target vector type. The result is that, if the schema provides the correct type, the loader can ride over ambiguities in the input.

Comparison to Original JSON Reader

This class replaces the JsonReader class used in Drill versions 1.17 and before. Compared with the previous version, this implementation:
  • Materializes parse states as classes rather than as methods and boolean flags as in the prior version.
  • Reports errors as UserException objects, complete with context information, rather than as generic Java exception as in the prior version.
  • Moves parse options into a separate JsonLoaderOptions class.
  • Iteration protocol is simpler: simply call readBatch() until it returns false. Errors are reported out-of-band via an exception.
  • The result set loader abstraction is perfectly happy with an empty schema. For this reason, this version (unlike the original) does not make up a dummy column if the schema would otherwise be empty.
  • Projection pushdown is handled by the ResultSetLoader rather than the JSON loader. This class always creates a vector writer, but the result set loader will return a dummy (no-op) writer for non-projected columns.
  • Like the original version, this version "free wheels" over unprojected objects and arrays; watching only for matching brackets, but ignoring all else.
  • Writes boolean values as SmallInt values, rather than as bits in the prior version.
  • This version also "free-wheels" over all unprojected values. If the user finds that they have inconsistent data in some field f, then the user can project fields except f; Drill will ignore the inconsistent values in f.
  • Because of this free-wheeling capability, this version does not need a "counting" reader; this same reader handles the case in which no fields are projected for SELECT COUNT(*) queries.
  • Runs of null values result in a "deferred null state" that patiently waits for an actual value token to appear, and only then "realizes" a parse state for that type.
  • Provides the same limited error recovery as the original version. See DRILL-4653 and DRILL-5953.
  • Constructor Details

  • Method Details

    • options

      public JsonLoaderOptions options()
    • parser

      public JsonStructureParser parser()
    • fieldFactory

      public FieldFactory fieldFactory()
    • listenerColumnMap

      public Map<String,Object> listenerColumnMap()
    • readBatch

      public boolean readBatch()
      Description copied from interface: JsonLoader
      Read one batch of row data.
      Specified by:
      readBatch in interface JsonLoader
      Returns:
      true if at least one record was loaded, false if EOF.
    • addNullMarker

      public void addNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)
    • removeNullMarker

      public void removeNullMarker(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)
    • endBatch

      protected void endBatch()
      Finish reading a batch of data. We may have pending "null" columns: a column for which we've seen only nulls, or an array that has always been empty. The batch needs to finish, and needs a type, but we still don't know the type. Since we must decide on one, we do the following guess Varchar, and switch to text mode.

      This choices is not perfect. Switching to text mode means results will vary from run to run depending on the order that we see empty and non-empty values for this column. Plus, since the system is distributed, the decision made here may conflict with that made in some other fragment.

      The only real solution is for the user to provide a schema.

      Bottom line: the user is responsible for not giving Drill ambiguous data that would require Drill to predict the future.

    • close

      public void close()
      Description copied from interface: JsonLoader
      Releases resources held by this class including the input stream. Does not close the result set loader passed into this instance.
      Specified by:
      close in interface JsonLoader
    • parseError

      public RuntimeException parseError(String msg, com.fasterxml.jackson.core.JsonParseException e)
      Description copied from interface: ErrorFactory
      The Jackson JSON parser failed to start on the input file.
      Specified by:
      parseError in interface ErrorFactory
    • ioException

      public RuntimeException ioException(IOException e)
      Description copied from interface: ErrorFactory
      I/O error reported from the Jackson JSON parser.
      Specified by:
      ioException in interface ErrorFactory
    • structureError

      public RuntimeException structureError(String msg)
      Description copied from interface: ErrorFactory
      General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser. =
      Specified by:
      structureError in interface ErrorFactory
    • syntaxError

      public RuntimeException syntaxError(com.fasterxml.jackson.core.JsonParseException e)
      Description copied from interface: ErrorFactory
      The Jackson parser reported a syntax error. Will not occur if recovery is enabled.
      Specified by:
      syntaxError in interface ErrorFactory
    • typeError

      Description copied from interface: ErrorFactory
      The Jackson parser reported an error when trying to convert a value to a specific type. Should never occur since we only convert to the type that Jackson itself identified.
      Specified by:
      typeError in interface ErrorFactory
    • syntaxError

      public RuntimeException syntaxError(com.fasterxml.jackson.core.JsonToken token)
      Description copied from interface: ErrorFactory
      Received an unexpected token. Should never occur as the Jackson parser itself catches errors.
      Specified by:
      syntaxError in interface ErrorFactory
    • unrecoverableError

      public RuntimeException unrecoverableError()
      Description copied from interface: ErrorFactory
      Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.
      Specified by:
      unrecoverableError in interface ErrorFactory
    • typeConversionError

      public UserException typeConversionError(ColumnMetadata schema, ValueDef valueDef)
    • typeConversionError

      public UserException typeConversionError(ColumnMetadata schema, String tokenType)
    • dataConversionError

      public UserException dataConversionError(ColumnMetadata schema, String tokenType, String value)
    • nullDisallowedError

      public UserException nullDisallowedError(ColumnMetadata schema)
    • unsupportedType

      public UserException unsupportedType(ColumnMetadata schema)
    • unsupportedJsonTypeException

      public UserException unsupportedJsonTypeException(String key, ValueDef.JsonType jsonType)
    • messageParseError

      Description copied from interface: ErrorFactory
      Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.
      Specified by:
      messageParseError in interface ErrorFactory
    • buildError

      public UserException buildError(ColumnMetadata schema, UserException.Builder builder)
    • buildError

      protected UserException buildError(UserException.Builder builder)