See: Description
Interface | Description |
---|---|
ScanSchemaTracker |
Computes scan output schema from a variety of sources.
|
Class | Description |
---|---|
AbstractSchemaTracker |
Base class for the projection-based and defined-schema-based
scan schema trackers.
|
DynamicSchemaFilter |
Projection filter based on the scan schema which typically starts as fully
dynamic, then becomes more concrete as the scan progresses.
|
DynamicSchemaFilter.DynamicTupleFilter |
Filter for a map, represented by a
TupleMetadata . |
DynamicSchemaFilter.RowSchemaFilter |
Filter for the top-level dynamic schema.
|
MutableTupleSchema |
A mutable form of a tuple schema.
|
MutableTupleSchema.ColumnHandle |
Holder for a column to allow inserting and replacing columns within
the top-level project list.
|
ProjectedColumn |
Enhanced form of a dynamic column which records all information from
the project list.
|
ProjectionSchemaTracker |
Schema tracker for the "normal" case in which schema starts from a simple
projection list of column names, optionally with a provided schema.
|
ScanProjectionParser |
Parse the projection list into a dynamic tuple schema.
|
ScanProjectionParser.ProjectionParseResult | |
ScanSchemaConfigBuilder |
Builds the configuration given to the
ScanSchemaTracker . |
ScanSchemaResolver |
Resolves a schema against the existing scan schema.
|
SchemaBasedTracker |
Simple "tracker" based on a defined, fixed schema.
|
SchemaUtils |
Set of schema utilities that don't fit well as methods on the column
or tuple classes.
|
Enum | Description |
---|---|
DynamicSchemaFilter.NewColumnsMode |
Describes how to handle candidate columns not currently in the
scan schema, which turns out to be a surprisingly complex
question.
|
ScanSchemaResolver.SchemaType |
Indicates the source of the schema to be analyzed.
|
ScanSchemaTracker.ProjectionType |
Resolves a scan schema throughout the scan lifecycle. Schema resolution comes from a variety of sources. Resolution starts with preparing the schema for the first reader:
Internally, the schema may start as open (has a wildcard), but may transition to closed when processing a strict provided schema.
Once this class is complete, the scan can add columns only to an open schema. All such columns are inserted at the wildcard location. If the wildcard appears by itself, columns are appended. If the wildcard appears along with implicit columns, then the reader columns appear at the wildcard location, before the implicit columns.
Once we have the initial reader input schema, we can then further refine the schema with:
*
). In this case, the set of columns is
driven by the reader. Each scan might drive one, two or many readers. In an ideal
world, every reader would produce the same schema. In the real world, files tend
the evolve: early files have three columns, later files have five. In this case
some readers will produce one schema, other readers another. Much of the complexity
of Drill comes from this simple fact that Drill is a SQL engine that requires a
single schema for all rows, but Drill reads data sources which are free to return
any schema that they want.
A wildcard projection starts by accepting the schema produced by the first reader. In "classic" mode, later readers can add columns (causing a schema change to be sent downstream), but cannot change the types of existing columns. The code here supports a "no schema change" mode in which the first reader discovers the schema, which is then fixed for all subsequent readers. This mode cannot, however prevent schema conflicts across scans running in different fragments.
The projection list holds the columns
as requested by the user in the SELECT
clause of the query,
in the order which columns appear in that clause, along with additional
columns implied by other columns. The planner
determines which columns to project. In Drill, projection is speculative:
it is a list of names which the planner hopes will appear in the data
files. The reader must make up columns (the infamous nullable INT) when
it turns out that no such column exists. Else, the reader must figure out
the data type for any columns that does exist.
An explicit projection starts with the requested set of columns, then looks in the table schema to find matches. Columns not in the project list are not projected (not written to vectors). The reader columns provide the types of the projected columns, "resolving" them to a concrete type.
An explicit projection may include columns that do not exist in the source schema. In this case, we fill in null columns for unmatched projections.
The challenge in this case is that Drill cannot know the type of missing columns; Drill can only guess. If a reader in Scan 1 guesses a type, but a reader in Scan 2 reads a column with a different type, then a schema conflict will occur downstream.
m
, where m
is a map.
In this case, we project all members of the map. That is, the map itself
is open in the above sense. Note that a map can be open even if the scan
schema itself is closed. That is, if the projection list contains only
m
, the scan schema is closed, but the map is open (the reader will
discover the fields that make up the map.)m.x, m.y
. In this
case, we know that the downstream Project operator will pull just those two
members to the top level and discard the rest of the map. We can thus
project just those two members in the scan. As a result, the map is closed
in the above sense: any additional map members discovered by the reader will
be unprojected.m, m.x
. Here, the
generic projection takes precedence. If the specific projection includes
qualifiers, m, m.x[1]
, then that information is used to check the
type of column x
.Drill is unique in that it uses a dynamic schema with columns and/or types "to be named later." The scan must convert the dynamic schema into a concrete schema sent downstream. This class implements some of the steps in doing so.
The result of this class is a schema identical to a defined schema that a planner might produce. Since Drill is dynamic, the planner must be able to produce a dynamic schema of the form described above. If the planner has table metadata (here represented by a provided schema), then the planner could produce a concrete defined schema (all types are defined.) Or, with a lenient provided schema, the planner might produce a dynamic defined schema: one with some concrete columns, some dynamic (name-only) columns.
filename,
dir0
, etc. Implicit columns are available (at present) only for the file
storage plugin, but could be added for other storage plugins. The project list
can contain the names of implicit columns. If the query contains a wildcard,
then the project list may also contain implicit columns:
filename, *, dir0
.
Implicit columns are known to Drill, so Drill itself can provide type information
for those columns, by an external implicit column parser. That parser locates
implicit columns by name, marks the columns as implicit, and takes care of
populating the columns at read time. We use a column property,
IMPLICIT_COL_TYPE
, to mark a column as implicit. Later the scan mechanism
will omit such columns when preparing the reader schema.
If the planner were to provide a defined schema, then the planner would have parsed out the implicit columns, provided their types, and marked them as implicit. So, again, we see that this class produces, at scan time, the same defined schema that the planner might produce at plan time.
Because of the way we handle implicit columns, we can allow the provided
schema to include them. The provided schema simply adds a column (with any
name), and sets the IMPLICIT_COL_TYPE
property to indicate which
implicit column definition to use for that column. This is handy for allowing the
implicit column to include partition directories as regular columns.
We now have a parsing flow for this package:
Drill has long had a source of ambiguity: what happens if the reader has a column with the same name as an implicit column. In this flow, the ambiguity is resolved as follows:
The ProjectionFilter
classes handle projection. As it turns out, this
class must follow (variations of) the same rules when merging the provided
schema with the projection list and so on. To ensure a single implementation
of the complex projection rules, this class uses a projection filter when
resolving the provided schema. The devil is in the details, knowing when
a map is open or closed, enforcing consistency with known information, etc.
If the projection list is a wildcard, then the wildcard expands to include all columns from the provided schema, in the order of that schema. If the schema is strict, then the scan schema becomes fixed, as if an explicit projection list where used.
If the projection list is explicit, then each column is resolved against the provided schema. If the projection list includes a column not in the provided schema, then it falls to the reader (or missing columns mechanism) to resolve that particular column.
An early reader schema can work with a provided schema. In this case, the early reader schema must declare the same column type as the provided schema. This is not a large obstacle: the provided schema should have originally come from the reader (or a description of the reader) so conflicts should not occur in normal operation.
ResultSetLoader
) the reader's output schema: the set of columns
actually read by the reader.
If the projection list contained a wildcard, then the reader output schema will determine the set of columns that replaces the wildcard. (That is, all reader columns are projected and the scan schema expands to reflect the actual columns.)
If the projection list is explicit (or made so by a strict provided schema),
then the reader output schema must be a subset of the scan schema: it is an error
for the reader to include extra columns as the scan mechanism won't know what to
do with those vectors. The projection mechanism (see below) integrates with the
ResultSetLoader
to project only those columns needed; the others are
given to the reader as "dummy" column writers: writers that accept, but discard
their data.
Note the major difference between the early reader schema and the reader output schema. The early reader schema includes all the columns that the reader can read. The reader output schema includes only those columns that the reader actually read (as controlled by the projection filter.) For most readers (CSV, JSON, etc.), there is no early reader schema, there is only the reader output schema: the set of columns (modulo projection) that turned out to be in the data source.
The scan schema tracker tracks all three schemas together to form the scan output schema. Tracking the combined schema ensures we preserve the user's requested project ordering. The reader manager builds the vectors using the above mechanisms, then merges the vectors (very easy to do in a columnar system) to produce the output batch which matches the scan schema.
Scan Plan | v +--------------+ | Project List | | Parser | +--------------+ | v +-------------+ | Scan Schema | +-------------------+ | Tracker | --->| Projection Filter | +-------------+ +-------------------+ | | v v +------+ +------------+ +------------+ +-----------+ | File | ---> | Reader |---->| Result Set | ---> | Data File | | Data | | | | Loader | <--- | Reader | +------+ +------------+ +------------+ +-----------+ | | v | +------------+ Reader | | Reader | Schema | | Lifecycle | <----------+ +------------+ | | | v | +---------+ Loaded | | Output | Vectors | | Builder | <-----------+ +---------+ | v Output BatchOmitted are the details of implicit and missing columns. The scan lifecycle (not shown) orchestrates the whole process.
The result is a scan schema which can start entirely dynamic (just a wildcard or list of column names), which is then resolved via a series of steps (some of which involve the real work of the scanner: reading data.) The bottom is the output: a full-resolved scan schema which exactly describes an output data batch.
Copyright © 1970 The Apache Software Foundation. All rights reserved.