public interface ScanSchemaTracker
The scan operator output schema can be defined or dynamic.
With a defined schema, the reader is given a fully-defined schema and
its job is to produce vectors that match the given schema. (The details
are handled by the ResultSetLoader
.)
At present, since the planner does not actually provide a defined schema, we support it in this class, and verify that the defined schema, if provided, exactly matches the names in the project list in the same order.
LATE
, which basically means
"a type to be named later" by the reader.
With a provided schema, the project list defines the output schema. If the provided schema provides projected columns, then the provided schema for those columns flow to the output schema, just as for a defined schema. Similarly, the reader is given a defined schema for those columns.
Where a provided schema differs is that the project list can include columns not in the provided schema, such columns act like the dynamic case: the reader defines the column type.
>SELECT *
: Project all data source columns, whatever they happen
to be. Create columns using names from the data source. The data source
also determines the order of columns within the row.>SELECT a, b, c, ...
: Project a specific set of columns, identified by
case-insensitive name. The output row uses the names from the SELECT list,
but types from the data source. Columns appear in the row in the order
specified by the SELECT
.>SELECT COUNT(*)
type queries. The provided projection list contains no (table) columns, though
it may contain metadata columns.>columns
, or optionally
specific members of the >columns
array such as >columns[1]
.
(Supported only by specific readers.)>fqn
, >filename
, >filepath
and >suffix
. These reference
parts of the name of the file being scanned.>dir0
, >dir1
, ...: These reference
parts of the path name of the file.COUNT(*)
: we need only a count of columns, but none
of the values. Implementation of the count is left to the specific reader
as some can optimize this case. The output schema may include a single
dummy column. In this case, the first batch defines the schema expected
from all subsequent readers and batches.
If the reader can produce only one type for each column, then the provided or defined schema should already specify that type, and the reader can simply ignore the reader input schema. (This feature allows this scheme to be compatible with older readers.)
However, if the reader can convert a column to multiple types, then the
reader should use the reader input schema to choose a type. If the input
schema is dynamic (type is LATE
), then the reader chooses the
column type and should chose the "most natural" type.
The result is the reader output schema: a subset of the reader
input schema in which each included column has a concrete type. (The
reader may have provided extra columns. In this case, the
ResultSetLoader
will have ignored those columns, providing a
dummy column writer, and omitting non-projected columns from the reader
output schema.)
The reader output schema is provided to this class which resolves any dynamic columns to the concrete type provided by the reader. If the column was already resolved, this class ensures that the reader's column type matches the resolved type to prevent column type changes.
*
) in the
planner. When using a dynamic schema, Drill resolves the wildcard at
run time. In this case, the reader input schema is empty and the reader
defines the entire set of columns: names and types. This class then
replaces the wildcard with the columns from the reader.
nullable INT
historically used.) If the mode is
nullable, the column is filled with nulls. If non-nullable, the column
is filled with a default value. All of this work happens outside of
this class.
The missing column handler defined its own output schema which is resolved by this class identical to how the reader schema is resolved. The result is that all columns are now resolved to a concrete type.
Missing columns may be needed even for a wildcard if a first reader discovered 3 columns, say, but a later reader encounters only two of them.
This rule works file for an explicit project list. However, if the
project list is dynamic, and contains a wildcard, then the reader
defines the output schema. What happens if a reader adds columns
(or a second or later reader discovers new columns)? Traditionally,
Drill simply adds those columns and sends a OK_NEW_SCHEMA
(schema change) downstream for other operators to deal with.
This class supports the traditional approach as an option. This class also supports a more rational, strict rule: the schema is fixed after the first batch. That is, the first batch defines a schema commit point after which the scan agrees not to change the schema. In this scenario, the first batch defines a schema (and project list) given to all subsequent readers. Any new columns are ignored (with a warning in the log.)
Either way, the result is a schema which describes the actual vectors sent downstream.
ResultSetLoader
to specify which
columns to project to vectors, and which to satisfy with a dummy column
reader.ResultSetLoader
provides the reader output
schema.m.a
. The
column state classes provide a map class. However, the projection notation
is ambiguous: m.a
could be a map `m`
with a child column
'a'
. Or, it could be a DICT
with a {code VARCHAR} key.
To handle this, if we only have the project list, we use an unresolved
column state, even if the projection itself has internal structure. We
use a projection-based filter in the ResultSetLoader
to handle
the ambiguity. The projection filter, when presented with the reader's
choice for column type, will check if that type is consistent with projection.
If so, the reader will later present the reader output schema which we
use to resolve the projection-only unresolved column to a map column.
(Or, if the column turns out to be a DICT
, to a simple unresolved
column.)
If the scan contains a second reader, then the second reader is given a
stricter form of projection filter: one based on the actual MAP
(or DICT
) column.
If a defined or provided schema is available, then the schema tracker does have sufficient information to resolve the column directly to a map column, and the first reader will have the strict projection filter.
A user can project a map column which does not actually exist (or, at least, is not known to the first reader.) In that case, the missing column logic applies, but within the map. As a result, a second reader may encounter a type conflict if it discovers the previously-missing column, and finds that the default type conflicts with the real type.
ImplicitColumnExplorer}, the class from which this class
evolved
Modifier and Type | Interface and Description |
---|---|
static class |
ScanSchemaTracker.ProjectionType |
Modifier and Type | Method and Description |
---|---|
void |
applyEarlyReaderSchema(TupleMetadata readerSchema)
If a reader can define a schema before reading data, apply that
schema to the scan schema.
|
TupleMetadata |
applyImplicitCols()
Indicate that implicit column parsing is complete.
|
void |
applyReaderSchema(TupleMetadata readerOutputSchema,
CustomErrorContext errorContext)
Once a reader has read a batch, the reader will have provided a type
for each projected column which the reader knows about.
|
ProjectedColumn |
columnProjection(String colName)
Return the projection for a column, if any.
|
CustomErrorContext |
errorContext()
The scan-level error context used for errors which may occur before the
first reader starts.
|
void |
expandImplicitCol(ColumnMetadata resolved,
ImplicitColumnMarker marker)
Drill defines a wildcard to expand not just reader columns, but also
partition columns.
|
MutableTupleSchema |
internalSchema()
Returns the internal scan schema.
|
boolean |
isResolved()
Is the scan schema resolved? The schema is resolved depending on the
complex lifecycle explained in the class comment.
|
TupleMetadata |
missingColumns(TupleMetadata readerOutputSchema)
Identifies the missing columns given a reader output schema.
|
TupleMetadata |
outputSchema()
Returns the scan output schema which is a somewhat complicated
computation that depends on the projection type.
|
ProjectionFilter |
projectionFilter(CustomErrorContext errorContext)
A reader is responsible for reading columns in the reader input schema.
|
ScanSchemaTracker.ProjectionType |
projectionType() |
TupleMetadata |
readerInputSchema()
The schema which the reader should produce.
|
void |
resolveMissingCols(TupleMetadata missingCols)
The missing column handler obtains the list of missing columns from
#missingColumns() . |
int |
schemaVersion()
Gives the output schema version which will start at some arbitrary
positive number.
|
ScanSchemaTracker.ProjectionType projectionType()
ProjectedColumn columnProjection(String colName)
boolean isResolved()
The schema will be fully resolved after the first batch of data arrives from a reader (since the reader lifecycle will then fill in any missing columns.) The schema may be resolved sooner (such as if a strict provided schema, or an early reader schema is available and there are no missing columns.)
outputSchema()
is available, false
if the schema
contains one or more dynamic columns which are not yet resolved.int schemaVersion()
If schema change is allowed, the schema version allows detecting schema changes as the scan schema moves from one resolved state to the next. Each schema will have a unique, increasing version number. A schema change has occurred if the version is newer than the previous output schema version.
void expandImplicitCol(ColumnMetadata resolved, ImplicitColumnMarker marker)
#isProjectAll()
), the handler
then determines which partition columns are needed and calls this
method to add each one.TupleMetadata applyImplicitCols()
void applyEarlyReaderSchema(TupleMetadata readerSchema)
TupleMetadata readerInputSchema()
#isProjectAll()
is true
),
the reader may produce additional columns beyond those in the the
reader input schema. However, for any batch, the reader, plus the
missing columns handler, must produce all columns in the reader input
schema.
Formally:
reader input schema = output schema - implicit col schema
TupleMetadata missingColumns(TupleMetadata readerOutputSchema)
Formally:
missing cols = reader input schema - reader output schema
The reader output schema can contain extra, newly discovered columns. Those are ignored when computing missing columns. Thus, the subtraction is set subtraction: remove columns common to the two sets.
TupleMetadata outputSchema()
For a wildcard schema:
output schema = implicit cols U reader output schema
For an explicit projection:
output schema = projection list
Where the projection list is augmented by types from the
provided schema, implicit columns or readers.
A defined schema is the output schema, so:
output schema = defined schema
ProjectionFilter projectionFilter(CustomErrorContext errorContext)
ResultSetLoader
to determine which columns should be projected,
allowing the reader to be blissfully ignorant of which columns are needed.
The result set loader provides a dummy reader for unprojected columns.
(A reader can, via the result set loader, find if a column is projected if
doing so helps reader efficiency.)
The projection filter is the first line of defense for schema conflicts.
The {code ResultSetLoader} will query the filter with a full column
schema. If that schema conflicts with the scan schema for that column,
this method will raise a UserException
, which typically indicates
a programming error, or a very odd data source in which a column changes
types between batches.
errorContext
- the reader-specific error context to use if
errors are foundvoid applyReaderSchema(TupleMetadata readerOutputSchema, CustomErrorContext errorContext)
The process can raise an exception if the reader projects a column that
it shouldn't (which is not actually possible because of the way the
ResultSetLoader
works.) An error can also occur if the reader
provides a type different than that already defined in the scan schema
by a defined schema, a provided schema, or a previous reader in the same
scan. In such cases, the reader is expected to have converted its input
type to the specified type, which was presumably selected because the
reader is capable of the required conversion.
readerOutputSchema
- the actual schema produced by a reader when
reading a record batcherrorContext
- the reader-specific error context to use if
errors are foundvoid resolveMissingCols(TupleMetadata missingCols)
#missingColumns()
. Depending on the scan lifecycle, some of the
columns may have a type, others may be dynamic. The missing column handler
chooses a type for any dynamic columns, then calls this method to tell
the scan schema tracker the now-resolved column type.
Note: a goal of the provided/defined schema system is to avoid the need to guess types for missing columns since doing so quite often leads to problems further downstream in the query. Ideally, the type of missing columns will be known (via the provided or defined schema) to avoid such conflicts.
CustomErrorContext errorContext()
MutableTupleSchema internalSchema()
Copyright © 1970 The Apache Software Foundation. All rights reserved.