Package org.apache.drill.exec.store.easy.json.extended


package org.apache.drill.exec.store.easy.json.extended
Provides parsing for Mongo extended types which are generally of the form { "$type": value }. Supports both V1 and V2 names. Supports both the Canonical and Relaxed formats.

Does not support all types as some appear internal to Mongo. Supported types:

  • <a href="https://docs.mongodb.com/manual/reference/mongodb-extended-json/#bson.Array> Array
  • Binary, translated to a Drill VARBINARY. The data must be encoded in the default Jackson Base64 format. The subType field, if present, is ignored.
  • Date, translated to a Drill TIMESTAMP. Drill's times are in the server local time. The UTC date in Mongo will be shifted to the local time zone on read.
  • Decimal (V1), translated to a Drill VARDECIMAL.
  • Decimal128 (V2), translated to a Drill VARDECIMAL, but limited to the supported DECIMAL range.
  • Document which is translated to a Drill MAP. The map fields must be consistent across documents: same names and types. (This is a restriction of Maps in Drill's relational data model.) Field names cannot be the same as any of the extended type names.
  • Double, translated to a Drill FLOAT8.
  • Int64, translated to a Drill BIGINT.
  • Int32, translated to a Drill INT.
  • Object ID, translated to a Drill VARCHAR.
Unsupported types:

The unsupported types appear more for commands and queries rather than data. They do not represent a Drill type. If they appear in data, they will be translated to a Drill map.

Drill defines a few "extended extended" types:

  • Date ($dateDay) - a date-only field in the form YYYY-MM-DD which maps to a Drill DATE vector.
  • Time ($time) - a time-only field in the form HH:MM:SS.SSS which maps to a Drill TIME vector.
  • Interval ($interval) - a date/time interval in ISO format which maps to a Drill INTERVAL vector.

Drill extends the extended types to allow null values in the usual way. Drill accepts normal "un-extended" JSON in the same file, but doing so can lead to ambiguities (see below.)

Once Drill defines a field as an extended type, parsing rules are tighter than for normal "non-extended" types. For example an extended double will not convert from a Boolean or float value.

Provided Schema

If used with a provided schema, then:
  • If the first field is in canonical format (with a type), then the extended type must agree with the provided type, or an error will occur.
  • If the first field is in relaxed format, or is null, then the provided schema will force the given type as though the data were in canonical format.

Ambiguities

Extended JSON is subject to the same ambiguities as normal JSON. If Drill sees a field in relaxed mode before extended mode, Drill will use its normal type inference rules. Thus, if the first field presents as a: "30", Drill will infer the type as string, even if a later field presents as a: { "numberInt": 30 }. To avoid ambiguities, either use only the canonical format, or use a provided schema.

Implementation

Extended types disabled by default and must be enabled using the store.json.extended_types system/session option ( ExecConstants.JSON_EXTENDED_TYPES_KEY).

Extended types are implemented via a field factory. The field factory builds the structure needed each time the JSON structure parser sees a new field. For extended types, the field factory looks ahead to detect an extended type, specifically for the pattern { "$type":. If the pattern is found, and the name is one of the supported type names, then the factory creates a parser to accept the enhanced type in either the canonical or relaxed forms.

Each field is represented by a Mongo-specific parser along with an associated value listener. The implementation does not reify the object structure; that structure is consumed by the field parser itself. The value listener receives value tokens as if the data were in relaxed format.

See Also:
  • MapVectorOutput for an older implementation