Interface ScanSchemaTracker
- All Known Implementing Classes:
AbstractSchemaTracker
,ProjectionSchemaTracker
,SchemaBasedTracker
The scan operator output schema can be defined or dynamic.
Defined Schema
The planner computes a defined schema from metadata, as in a typical query engine. A defined schema defines the output schema directly: the defined schema is the output schema. Drill's planner does not yet support a defined schema, but work is in progress to get there for some cases.
With a defined schema, the reader is given a fully-defined schema and
its job is to produce vectors that match the given schema. (The details
are handled by the ResultSetLoader
.)
At present, since the planner does not actually provide a defined schema, we support it in this class, and verify that the defined schema, if provided, exactly matches the names in the project list in the same order.
Dynamic Schema
A dynamic schema is one defined at run time: the traditional Drill approach. A dynamic schema starts with a projection list : a list of column names without types. This class converts the project list into a dynamic reader schema which is a schema in which each column has the typeLATE
, which basically means
"a type to be named later" by the reader.
Hybrid Schema
Some readers support a provided schema, which is an concept similar to, but distinct from, a defined schema. The provided schema provides hints about a schema. At present, it is an extra; not used or understood by the planner. Thus, the projection list is independent of the provided schema: the lists may be disjoint.With a provided schema, the project list defines the output schema. If the provided schema provides projected columns, then the provided schema for those columns flow to the output schema, just as for a defined schema. Similarly, the reader is given a defined schema for those columns.
Where a provided schema differs is that the project list can include columns not in the provided schema, such columns act like the dynamic case: the reader defines the column type.
Projection Types
Drill will pass in a project list which is one of three kinds:>SELECT *
: Project all data source columns, whatever they happen to be. Create columns using names from the data source. The data source also determines the order of columns within the row.>SELECT a, b, c, ...
: Project a specific set of columns, identified by case-insensitive name. The output row uses the names from the SELECT list, but types from the data source. Columns appear in the row in the order specified by theSELECT
.
<li
>SELECT ...
: Project nothing, occurs in >SELECT COUNT(*)
type queries. The provided projection list contains no (table) columns, though
it may contain metadata columns.
- Wildcard ("*") column: indicates the place in the projection list to insert the table columns once found in the table projection plan.
- Data source columns: columns from the underlying table. The table projection planner will determine if the column exists, or must be filled in with a null column.
- The generic data source columns array:
>columns
, or optionally specific members of the>columns
array such as>columns[1]
. (Supported only by specific readers.) - Implicit columns:
>fqn
,>filename
,>filepath
and>suffix
. These reference parts of the name of the file being scanned. - Partition columns:
>dir0
,>dir1
, ...: These reference parts of the path name of the file.
Empty Schema
A special case occurs if the projection list is empty which indicates that the query is aCOUNT(*)
: we need only a count of columns, but none
of the values. Implementation of the count is left to the specific reader
as some can optimize this case. The output schema may include a single
dummy column. In this case, the first batch defines the schema expected
from all subsequent readers and batches.
Implicit Columns
The project list can contain implicit columns for data sources which support them. Implicit columns are disjoint from data source columns and are provided by Drill itself. This class effectively splits the projection list into a set of implicit columns, and the remainder of the list which are the reader columns.Reader Input Schema
The various forms of schema above produce a reader input schema: the schema given to the reader. The reader input schema is the set of projected columns, minus implicit columns, along with available type information.If the reader can produce only one type for each column, then the provided or defined schema should already specify that type, and the reader can simply ignore the reader input schema. (This feature allows this scheme to be compatible with older readers.)
However, if the reader can convert a column to multiple types, then the
reader should use the reader input schema to choose a type. If the input
schema is dynamic (type is LATE
), then the reader chooses the
column type and should chose the "most natural" type.
Reader Output Schema
The reader proceeds to read a batch of data, choosing types for dynamic columns. The reader may provide a subset of projected columns if, say the reader reads an older file that is missing some columns or (for a dynamic schema), the user specified columns which don't actually exist.
The result is the reader output schema: a subset of the reader
input schema in which each included column has a concrete type. (The
reader may have provided extra columns. In this case, the
ResultSetLoader
will have ignored those columns, providing a
dummy column writer, and omitting non-projected columns from the reader
output schema.)
The reader output schema is provided to this class which resolves any dynamic columns to the concrete type provided by the reader. If the column was already resolved, this class ensures that the reader's column type matches the resolved type to prevent column type changes.
Dynamic Wildcard Schema
Traditional query planners resolve the wildcard (*
) in the
planner. When using a dynamic schema, Drill resolves the wildcard at
run time. In this case, the reader input schema is empty and the reader
defines the entire set of columns: names and types. This class then
replaces the wildcard with the columns from the reader.
Missing Columns
When the reader output schema is a subset of the reader input schema, the we have a set of missing columns (also called "null columns"). A part of the scan framework must invent vectors for these columns. If the type is available, then that is the type used, otherwise the missing column handler must invent a type (such as the classicnullable INT
historically used.) If the mode is
nullable, the column is filled with nulls. If non-nullable, the column
is filled with a default value. All of this work happens outside of
this class.
The missing column handler defined its own output schema which is resolved by this class identical to how the reader schema is resolved. The result is that all columns are now resolved to a concrete type.
Missing columns may be needed even for a wildcard if a first reader discovered 3 columns, say, but a later reader encounters only two of them.
Subsequent Readers and Schema Changes
All of the above occurs during the first batch of data. After that, the schema is fully defined: subsequent readers will encounter only a fully defined schema, which it must handle the same as if the scan was given a defined schema.
This rule works file for an explicit project list. However, if the
project list is dynamic, and contains a wildcard, then the reader
defines the output schema. What happens if a reader adds columns
(or a second or later reader discovers new columns)? Traditionally,
Drill simply adds those columns and sends a OK_NEW_SCHEMA
(schema change) downstream for other operators to deal with.
This class supports the traditional approach as an option. This class also supports a more rational, strict rule: the schema is fixed after the first batch. That is, the first batch defines a schema commit point after which the scan agrees not to change the schema. In this scenario, the first batch defines a schema (and project list) given to all subsequent readers. Any new columns are ignored (with a warning in the log.)
Output Schema
All of the above contribute to the output schema: the schema sent downstream to the next operator. All of the above work is done to either:- Pass the defined schema to the output, with the reader (and missing columns handler) producing columns that match that schema.
- Expand the dynamic schema with details provided by the reader (and missing columns hander), including the actual set of columns if the dynamic schema includes a wildcard.
Either way, the result is a schema which describes the actual vectors sent downstream.
Consumers
Information from this class is used in multiple ways:- A project list is given to the
ResultSetLoader
to specify which columns to project to vectors, and which to satisfy with a dummy column reader. - The reader, via the {code SchemaNegotiator} uses the reader input schema.
- The reader, via the
ResultSetLoader
provides the reader output schema. - An implicit column manager handles the various implicit and partition directory columns: identifying them then later providing vector values.
- A missing columns handler fills in missing columns.
Design
Schema resolution is a set of layers of choices. Each level and choice is represented by a class: virtual method pick the right path based on class type rather than using a large collection of if-statements.Maps
Maps present a difficult challenge. Drill allows projection within maps and we wish to exploit that in the scan. For example:m.a
. The
column state classes provide a map class. However, the projection notation
is ambiguous: m.a
could be a map `m`
with a child column
'a'
. Or, it could be a DICT
with a {code VARCHAR} key.
To handle this, if we only have the project list, we use an unresolved
column state, even if the projection itself has internal structure. We
use a projection-based filter in the ResultSetLoader
to handle
the ambiguity. The projection filter, when presented with the reader's
choice for column type, will check if that type is consistent with projection.
If so, the reader will later present the reader output schema which we
use to resolve the projection-only unresolved column to a map column.
(Or, if the column turns out to be a DICT
, to a simple unresolved
column.)
If the scan contains a second reader, then the second reader is given a
stricter form of projection filter: one based on the actual MAP
(or DICT
) column.
If a defined or provided schema is available, then the schema tracker does have sufficient information to resolve the column directly to a map column, and the first reader will have the strict projection filter.
A user can project a map column which does not actually exist (or, at least, is not known to the first reader.) In that case, the missing column logic applies, but within the map. As a result, a second reader may encounter a type conflict if it discovers the previously-missing column, and finds that the default type conflicts with the real type.
-
Nested Class Summary
-
Method Summary
Modifier and TypeMethodDescriptionvoid
applyEarlyReaderSchema
(TupleMetadata readerSchema) If a reader can define a schema before reading data, apply that schema to the scan schema.Indicate that implicit column parsing is complete.void
applyReaderSchema
(TupleMetadata readerOutputSchema, CustomErrorContext errorContext) Once a reader has read a batch, the reader will have provided a type for each projected column which the reader knows about.columnProjection
(String colName) Return the projection for a column, if any.The scan-level error context used for errors which may occur before the first reader starts.void
expandImplicitCol
(ColumnMetadata resolved, ImplicitColumnMarker marker) Drill defines a wildcard to expand not just reader columns, but also partition columns.Returns the internal scan schema.boolean
Is the scan schema resolved? The schema is resolved depending on the complex lifecycle explained in the class comment.missingColumns
(TupleMetadata readerOutputSchema) Identifies the missing columns given a reader output schema.Returns the scan output schema which is a somewhat complicated computation that depends on the projection type.projectionFilter
(CustomErrorContext errorContext) A reader is responsible for reading columns in the reader input schema.The schema which the reader should produce.void
resolveMissingCols
(TupleMetadata missingCols) The missing column handler obtains the list of missing columns from#missingColumns()
.int
Gives the output schema version which will start at some arbitrary positive number.
-
Method Details
-
projectionType
ScanSchemaTracker.ProjectionType projectionType() -
columnProjection
Return the projection for a column, if any. -
isResolved
boolean isResolved()Is the scan schema resolved? The schema is resolved depending on the complex lifecycle explained in the class comment. Resolution occurs when the wildcard (if any) is expanded, and all explicit projection columns obtain a definite type. If schema change is disabled, the schema will not change once it is resolved. If schema change is allowed, then batches or readers may extend the schema, triggering a schema change, and so the scan schema may move from one resolved state to another.The schema will be fully resolved after the first batch of data arrives from a reader (since the reader lifecycle will then fill in any missing columns.) The schema may be resolved sooner (such as if a strict provided schema, or an early reader schema is available and there are no missing columns.)
- Returns:
- if the schema is resolved, and hence the
outputSchema()
is available,false
if the schema contains one or more dynamic columns which are not yet resolved.
-
schemaVersion
int schemaVersion()Gives the output schema version which will start at some arbitrary positive number.If schema change is allowed, the schema version allows detecting schema changes as the scan schema moves from one resolved state to the next. Each schema will have a unique, increasing version number. A schema change has occurred if the version is newer than the previous output schema version.
- Returns:
- the schema version. The absolute number is not important, rather an increase indicates one or more columns were added at the top level or within a map at some nesting level
-
expandImplicitCol
Drill defines a wildcard to expand not just reader columns, but also partition columns. When the implicit column handlers sees that the query has a wildcard (by calling#isProjectAll()
), the handler then determines which partition columns are needed and calls this method to add each one. -
applyImplicitCols
TupleMetadata applyImplicitCols()Indicate that implicit column parsing is complete. Returns the implicit columns as identified by the implicit column handler, in the order of the projection list. Implicit columns do not appear in a reader input schema, and it is an error for the reader to produce such columns.- Returns:
- a sub-schema of only implicit columns, in the order in which they appear in the output schema
-
applyEarlyReaderSchema
If a reader can define a schema before reading data, apply that schema to the scan schema. Allows the scan to report its output schema before the first batch of data if the scan schema becomes resolved after the early reader schema. -
readerInputSchema
TupleMetadata readerInputSchema()The schema which the reader should produce. Depending on the type of the scan (specifically, if#isProjectAll()
istrue
), the reader may produce additional columns beyond those in the the reader input schema. However, for any batch, the reader, plus the missing columns handler, must produce all columns in the reader input schema.Formally:
reader input schema = output schema - implicit col schema
- Returns:
- the sub-schema which includes those columns which the reader should provide, excluding implicit columns
-
missingColumns
Identifies the missing columns given a reader output schema. The reader output schema are those columns which the reader actually produced.Formally:
missing cols = reader input schema - reader output schema
The reader output schema can contain extra, newly discovered columns. Those are ignored when computing missing columns. Thus, the subtraction is set subtraction: remove columns common to the two sets.
-
outputSchema
TupleMetadata outputSchema()Returns the scan output schema which is a somewhat complicated computation that depends on the projection type.For a wildcard schema:
output schema = implicit cols U reader output schema
For an explicit projection:
Where the projection list is augmented by types from the provided schema, implicit columns or readers.output schema = projection list
A defined schema is the output schema, so:
output schema = defined schema
- Returns:
- the complete output schema provided by the scan to downstream operators. Includes both reader and implicit columns, in the order of the projection list or, for a wildcard, in the order of the first reader
-
projectionFilter
A reader is responsible for reading columns in the reader input schema. A reader may read additional columns. The projection filter is passed to theResultSetLoader
to determine which columns should be projected, allowing the reader to be blissfully ignorant of which columns are needed. The result set loader provides a dummy reader for unprojected columns. (A reader can, via the result set loader, find if a column is projected if doing so helps reader efficiency.)The projection filter is the first line of defense for schema conflicts. The {code ResultSetLoader} will query the filter with a full column schema. If that schema conflicts with the scan schema for that column, this method will raise a
UserException
, which typically indicates a programming error, or a very odd data source in which a column changes types between batches.- Parameters:
errorContext
- the reader-specific error context to use if errors are found- Returns:
- a filter used to decide which reader columns to project during reading
-
applyReaderSchema
Once a reader has read a batch, the reader will have provided a type for each projected column which the reader knows about. For a wildcard projection, the reader will have added all the columns that it found. This call takes the reader output schema and merges it with the current scan schema to resolve dynamic types to concrete types and to add newly discovered columns.The process can raise an exception if the reader projects a column that it shouldn't (which is not actually possible because of the way the
ResultSetLoader
works.) An error can also occur if the reader provides a type different than that already defined in the scan schema by a defined schema, a provided schema, or a previous reader in the same scan. In such cases, the reader is expected to have converted its input type to the specified type, which was presumably selected because the reader is capable of the required conversion.- Parameters:
readerOutputSchema
- the actual schema produced by a reader when reading a record batcherrorContext
- the reader-specific error context to use if errors are found
-
resolveMissingCols
The missing column handler obtains the list of missing columns from#missingColumns()
. Depending on the scan lifecycle, some of the columns may have a type, others may be dynamic. The missing column handler chooses a type for any dynamic columns, then calls this method to tell the scan schema tracker the now-resolved column type.Note: a goal of the provided/defined schema system is to avoid the need to guess types for missing columns since doing so quite often leads to problems further downstream in the query. Ideally, the type of missing columns will be known (via the provided or defined schema) to avoid such conflicts.
-
errorContext
CustomErrorContext errorContext()The scan-level error context used for errors which may occur before the first reader starts. The reader will provide a more detailed error context that describes what is being read.- Returns:
- the scan-level error context
-
internalSchema
MutableTupleSchema internalSchema()Returns the internal scan schema. Primarily for testing.- Returns:
- the internal mutable scan schema
-