All Implemented Interfaces:

public class TupleParser extends ObjectParser
Accepts { name : value ... }

The structure parser maintains a map of known fields. Each time a field is parsed, looks up the field in the map. If not found, the parser looks ahead to find a value token, if any, and calls this class to add a new column. This class creates a column writer based either on the type provided in a provided schema, or inferred from the JSON token.

As it turns out, most of the semantic action occurs at the tuple level: that is where fields are defined, types inferred, and projection is computed.


Much code here deals with null types, especially leading nulls, leading empty arrays, and so on. The object parser creates a parser for each value; a parser which "does the right thing" based on the data type. For example, for a Boolean, the parser recognizes true, false and null.

But what happens if the first value for a field is null? We don't know what kind of parser to create because we don't have a schema. Instead, we have to create a temporary placeholder parser that will consume nulls, waiting for a real type to show itself. Once that type appears, the null parser can replace itself with the correct form. Each vector's "fill empties" logic will back-fill the newly created vector with nulls for prior rows.

Two null parsers are needed: one when we see an empty list, and one for when we only see null. The one for {@code null{@code must morph into the one for empty lists if we see:<br> {@code {a: null} {a: [ ] }}<br> <p> If we get all the way through the batch, but have still not seen a type, then we have to guess. A prototype type system can tell us, otherwise we guess {@code VARCHAR}. ({@code VARCHAR} is the right choice for all-text mode, it is as good a guess as any for other cases.) <h4>Projection List Hints</h4> To help, we consult the projection list, if any, for a column. If the projection is of the form {@code a[0]}, we know the column had better be an array. Similarly, if the projection list has {@code b.c}, then {@code b} had better be an object. <h4>Array Handling</h4> The code here handles arrays in two ways. JSON normally uses the {@code LIST} type. But, that can be expensive if lists are well-behaved. So, the code here also implements arrays using the classic {@code REPEATED} types. The repeated type option is disabled by default. It can be enabled, for efficiency, if Drill ever supports a JSON schema. If an array is well-behaved, mark that column as able to use a repeated type. <h4>Ambiguous Types</h4> JSON nulls are untyped. A run of nulls does not tell us what type will eventually appear. The best solution is to provide a schema. Without a schema, the code is forgiving: defers selection of the column type until the first non-null value (or, forces a type at the end of the batch.) <p> For scalars the pattern is: <code>{a: null} {a: "foo"}</code>. Type selection happens on the value {@code "foo"}. <p> For arrays, the pattern is: <code>{a: []} {a: ["foo"]}</code>. Type selection happens on the first array element. Note that type selection must happen on the first element, even if tha element is null (which, as we just said, ambiguous.) <p> If we are forced to pick a type (because we hit the end of a batch, or we see {@code [null]}, then we pick {@code VARCHAR} as we allow any scalar to be converted to {@code VARCHAR}. This helps for a single-file query, but not if multiple fragments each make their own (inconsistent) decisions. Only a schema provides a consistent answer.

  • Constructor Details

  • Method Details

    • loader

      public JsonLoaderImpl loader()
    • writer

      public TupleWriter writer()
    • providedSchema

      protected TupleMetadata providedSchema()
    • fieldFactory

      protected FieldFactory fieldFactory()
    • onField

      public ElementParser onField(String key, TokenIterator tokenizer)
      Description copied from class: ObjectParser
      The structure parser has just encountered a new field for this object. This method returns a parser for the field, along with an optional listener to handle events within the field. The field typically uses a value parser create by the FieldParserFactory class. However, special cases (such as Mongo extended types) can create a custom parser.

      If the field is not projected, the method should return a dummy parser from FieldParserFactory.ignoredFieldParser(). The dummy parser will "free-wheel" over whatever values the field contains. (This is one way to avoid structure errors in a JSON file: just ignore them.) Otherwise, the parser will look ahead to guess the field type and will call one of the "add" methods, each of which should return a value listener for the field itself.

      A normal field will respond to the structure of the JSON file as it appears. The associated value listener receives events for the field value. The value listener may be asked to create additional structure, such as arrays or nested objects.

      Parse position: { ... field : ^ ? for a newly-seen field. Constructs a value parser and its listeners by looking ahead some number of tokens to "sniff" the type of the value. For example:

      • foo: <value> - Field value
      • foo: [ <value> ] - 1D array value
      • foo: [ [<value> ] ] - 2D array value
      • Etc.

      There are two cases in which no type estimation is possible:

      • foo: null
      • foo: []
      Specified by:
      onField in class ObjectParser
      key - name of the field
      tokenizer - an instance of a token iterator
      a parser for the newly-created field
    • resolveField

      public ElementParser resolveField(String key, TokenIterator tokenizer)
    • resolveArray

      public ElementParser resolveArray(String key, TokenIterator tokenizer)
    • forceNullResolution

      public void forceNullResolution(String key)
    • forceEmptyArrayResolution

      public void forceEmptyArrayResolution(String key)