- Direct Known Subclasses:
ResultSetLoader abstraction. Uses the listener-based
JsonStructureParser to walk the JSON tree in a "streaming"
fashion, calling events which this class turns into vector write
operations. Listeners handle options such as all text mode
vs. type-specific parsing. Think of this implementation as a
listener-based recursive-descent parser.
The JSON loader mechanism runs two state machines intertwined:
- The actual parser (to parse each JSON object, array or scalar according
to its inferred type represented by the
- The type discovery machine, which is made complex because JSON may include long runs of nulls, represented by this class.
Schema DiscoveryFields are discovered on the fly. Types are inferred from the first JSON token for a field. Type inference is less than perfect: it cannot handle type changes such as first seeing 10, then 12.5, or first seeing "100", then 200.
When a field first contains null or an empty list, "null deferral" logic adds a special state that "waits" for an actual data type to present itself. This allows the parser to handle a series of nulls, empty arrays, or arrays of nulls (when using lists) at the start of the file. If no type ever appears, the loader forces the field to "text mode", hoping that the field is scalar.
To slightly help the null case, if the projection list shows that a column must be an array or a map, then that information is used to guess the type of a null column.
The code includes a prototype mechanism to provide type hints for columns. At present, it is just used to handle nulls that are never "resolved" by the end of a batch. Would be much better to use the hints (or a full schema) to avoid the huge mass of code needed to handle nulls.
Provided SchemaThe JSON loader accepts a provided schema which removes type ambiguities. If we have the examples above (runs of nulls, or shifting types), then the provided schema says the vector type to create; the individual column listeners attempt to convert the JSON token type to the target vector type. The result is that, if the schema provides the correct type, the loader can ride over ambiguities in the input.
Comparison to Original JSON ReaderThis class replaces the
JsonReader class used in Drill versions 1.17
and before. Compared with the previous version, this implementation:
- Materializes parse states as classes rather than as methods and boolean flags as in the prior version.
- Reports errors as
UserExceptionobjects, complete with context information, rather than as generic Java exception as in the prior version.
- Moves parse options into a separate
- Iteration protocol is simpler: simply call
readBatch()until it returns
false. Errors are reported out-of-band via an exception.
- The result set loader abstraction is perfectly happy with an empty schema. For this reason, this version (unlike the original) does not make up a dummy column if the schema would otherwise be empty.
- Projection pushdown is handled by the
ResultSetLoaderrather than the JSON loader. This class always creates a vector writer, but the result set loader will return a dummy (no-op) writer for non-projected columns.
- Like the original version, this version "free wheels" over unprojected objects and arrays; watching only for matching brackets, but ignoring all else.
- Writes boolean values as SmallInt values, rather than as bits in the prior version.
- This version also "free-wheels" over all unprojected values. If the user finds that they have inconsistent data in some field f, then the user can project fields except f; Drill will ignore the inconsistent values in f.
- Because of this free-wheeling capability, this version does not need a
"counting" reader; this same reader handles the case in which no fields are
- Runs of null values result in a "deferred null state" that patiently waits for an actual value token to appear, and only then "realizes" a parse state for that type.
- Provides the same limited error recovery as the original version. See DRILL-4653 and DRILL-5953.
Nested Class SummaryModifier and TypeClassDescription
Method SummaryModifier and TypeMethodDescription
close()Releases resources held by this class including the input stream.
endBatch()Finish reading a batch of data.I/O error reported from the Jackson JSON parser.Parser is configured to find a message tag within the JSON and a syntax occurred when following the data path.
options()The Jackson JSON parser failed to start on the input file.
booleanRead one batch of row data.
(org.apache.drill.exec.store.easy.json.loader.JsonLoaderImpl.NullTypeMarker marker)General structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser.
(com.fasterxml.jackson.core.JsonParseException e)The Jackson parser reported a syntax error.
(com.fasterxml.jackson.core.JsonToken token)Received an unexpected token.The Jackson parser reported an error when trying to convert a value to a specific type.Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.
optionspublic JsonLoaderOptions options()
parserpublic JsonStructureParser parser()
fieldFactorypublic FieldFactory fieldFactory()
readBatchpublic boolean readBatch()Description copied from interface:
JsonLoaderRead one batch of row data.
addNullMarkerpublic void addNullMarker
removeNullMarkerpublic void removeNullMarker
endBatchprotected void endBatch()Finish reading a batch of data. We may have pending "null" columns: a column for which we've seen only nulls, or an array that has always been empty. The batch needs to finish, and needs a type, but we still don't know the type. Since we must decide on one, we do the following guess Varchar, and switch to text mode.
This choices is not perfect. Switching to text mode means results will vary from run to run depending on the order that we see empty and non-empty values for this column. Plus, since the system is distributed, the decision made here may conflict with that made in some other fragment.
The only real solution is for the user to provide a schema.
Bottom line: the user is responsible for not giving Drill ambiguous data that would require Drill to predict the future.
closepublic void close()Description copied from interface:
JsonLoaderReleases resources held by this class including the input stream. Does not close the result set loader passed into this instance.
parseErrorThe Jackson JSON parser failed to start on the input file.
ioExceptionI/O error reported from the Jackson JSON parser.
structureErrorGeneral structure-level error: something very unusual occurred in the JSON that passed Jackson, but failed in the structure parser. =
syntaxErrorpublic RuntimeException syntaxError
(com.fasterxml.jackson.core.JsonParseException e)The Jackson parser reported a syntax error. Will not occur if recovery is enabled.
typeErrorThe Jackson parser reported an error when trying to convert a value to a specific type. Should never occur since we only convert to the type that Jackson itself identified.
syntaxErrorpublic RuntimeException syntaxError
(com.fasterxml.jackson.core.JsonToken token)Received an unexpected token. Should never occur as the Jackson parser itself catches errors.
unrecoverableErrorpublic RuntimeException unrecoverableError()Error recovery is on, the structure parser tried to recover, but encountered too many other errors and gave up.
messageParseErrorParser is configured to find a message tag within the JSON and a syntax occurred when following the data path.