Defines the projection, vector continuity and other operations for
a set of one or more readers. Separates the core reader protocol from
the logic of working with batches.
Schema Evolution
Drill discovers schema on the fly. The scan operator hosts multiple readers.
In general, each reader may have a distinct schema, though the user typically
arranges data in a way that scanned files have a common schema (else SQL
is the wrong tool for analysis.) Still, subtle changes can occur: file A
is an old version without a new column c, while file B includes the column.
And so on.
The scan operator works to ensure schema continuity as much as
possible, smoothing out "soft" schema changes that are simply artifacts of
reading a collection of files. Only "hard" changes (true changes) are
passed downstream.
First, let us define three types of schema change:
- Trivial: a change in vectors underlying a column. Column a may
be a required integer, but if the actual value vector changes, some
operators treat this as a schema change.
- Soft: a change in column order, but not the type or existence of
columns. Soft changes cause problems for operators that index columns by
position, since column positions change.
- Hard: addition or deletion of a column, or change of column
data type.
The goal of this package is to smooth out trivial and soft schema changes,
and avoid hard schema changes where possible. To do so, we must understand
the sources of schema change.
- Trivial schema changes due to multiple readers. In general, each reader
should be independent of other readers, since each file is independent of
other files. Since readers are independent, they should have control over
the schema (and vectors) used to read the file. However, distinct vectors
trigger a trivial schema change.
- Schema changes due to readers that discover data schema as the read
progresses. The start of a file might have two columns, a and b. A second
batch might discover column c. Then, when reading a second file, the process
might repeat. In general, if we have already seen column c in the first file,
there is no need to trigger a schema change upon discovering it in the
second. (Though there are subtle issues, such as handling required types.)
- Schema changes due to columns that appear in one file, but not another.
This is a variation of the above issue, but occurs across files, as
explained above.
- Schema changes due to guessing wrong about the type of a missing
column. A query might request columns a, b and c. The first file has columns
a and b, forcing the scan operator to "make up" a column c. Typically Drill
uses a nullable int. But, a later reader might find a column c and realize
that it is actually a Varchar.
- Actual schema changes in the data: a file might contain a run of numbers,
only to insert a string later. A file A might have columns a and b, while a
second file adds column c. (In general, columns are not removed, only added
or have the type changed.)
To avoid soft schema changes, this scan
operator uses a variety of schema smoothing "levels":
- Level 0: anything that the reader might do, such as a JDBC data source
requesting only the required columns.
- Level 1: for queries with an explicit select (SELECT a, b, c...), the
result set loader will filter out unwanted columns. So, if file A also
includes column d, and file B adds d and f, the result set loader will
project out the unneeded columns, avoiding schema change (and unnecessary
vector writes.
- Level 2: for multiple readers, or readers with evolving schema, a
buffering layer fills in missing columns using the type already seen,
if possible.
- Level 3: soft changes are avoided by projecting table columns (and
metadata columns) into the order requested by an explicit select.
- Level 4: the scan operator itself monitors the resulting schema,
watching for changes that cancel out. For example, each reader builds
its own schema. If the two files have an identical schema, then the resulting
schemas are identical and no schema change need be advertised downstream.
Keep the above in mind when looking at individual classes. Mapping of
levels to classes:
- Level 1:
LogicalTupleSet
in the ResultSetLoader
class.
- Level 2:
ProjectionPlanner
, ScanProjection
and
ScanProjector
.
- Level 3:
ScanProjector
.
- Level 4:
OperatorRecordBatch.SchemaTracker
.
Selection List Processing
A key challenge in the scan operator is mapping of table schemas to the
schema returned by the scan operator. We recognize three distinct kinds of
selection:
- Wildcard: SELECT *
- Generic: SELECT columns, where "columns" is an array of column
values. Supported only by readers that return text columns.
- Explicit: SELECT a, b, c, ... where "a", "b", "c" and so on are
the expected names of table columns
Since Drill uses schema-on-read, we are not sure of the table schema until
we actually ask the reader to open the table (file or other data source.)
Then, a table can be "early schema" (known at open time) or "late schema"
(known only after reading data.)
A selection list goes through three distinct phases to result in a final
schema of the batch returned downstream.
- Query selection planning: resolves column names to metadata (AKA implicit
or partition) columns, to "*", to the special "columns" column, or to a list
of expected table columns.
- File selection planning: determines the values of metadata columns based
on file and/or directory names.
- Table selection planning: determines the fully resolved output schema by
resolving the wildcard or selection list against the actual table columns.
A late-schema table may do table selection planning multiple times: once
each time the schema changes.
Output Batch Construction
The batches produced by the scan have up to four distinct kinds of columns:
- File metadata (filename, fqn, etc.)
- Directory (partition metadata: (dir0, dir1, etc.)
- Table columns (or the special "columns" column)
- Null columns (expected table columns that turned out to not match any
actual table column. To avoid errors, Drill returns these as columns filled
with nulls.
In practice, the scan operator uses three distinct result set loaders to create
these columns:
- Table loader: for the table itself. This loader uses a selection layer to
write only the data needed for the output batch, skipping unused columns.
- Metadata loader: to create the file and directory metadata columns.
- Null loader: to populate the null columns.
The scan operator merges the three sets of columns to produce the output
batch. The merge step also performs projection: it places columns into the
output batch in the order specified by the original SELECT list (or table order,
if the original SELECT had a wildcard.) Fortunately, this is just involves
moving around pointers to vectors; no actual data is moved during projection.
Class Structure
Some of the key classes here include:
RowBatchReader
an extremely simple interface for reading data.
We would like many developers to create new plugins and readers. The simplified
interface pushes all complexity into the scan framework, leaving the reader to
just read.
ShimBatchReader
an implementation of the above that converts from
the simplified API to add additional structure to work with the result set loader.
(The base interface is agnostic about how rows are read.)
ScheamNegotiator
and interface that allows a batch reader to
"negotiate" a schema with the scan framework. The scan framework knows the
columns that are to be projected. The reader knows what columns it can offer.
The schema negotiator works out how to combine the two. It expresses the result
as a result set loader. Column writers are defined for all columns that the
reader wants to read, but only the materialized (projected) columns have actual
vectors behind them. The non-projected columns are "free-wheeling" "dummy"
writers.
And, yes, sorry for the terminology. File "readers" read from files, but
use column "writers" to write to value vectors.