Skip navigation links

Package org.apache.drill.exec.physical.impl.scan.framework

Defines the projection, vector continuity and other operations for a set of one or more readers.

See: Description

Package org.apache.drill.exec.physical.impl.scan.framework Description

Defines the projection, vector continuity and other operations for a set of one or more readers. Separates the core reader protocol from the logic of working with batches.

Schema Evolution

Drill discovers schema on the fly. The scan operator hosts multiple readers. In general, each reader may have a distinct schema, though the user typically arranges data in a way that scanned files have a common schema (else SQL is the wrong tool for analysis.) Still, subtle changes can occur: file A is an old version without a new column c, while file B includes the column. And so on.

The scan operator works to ensure schema continuity as much as possible, smoothing out "soft" schema changes that are simply artifacts of reading a collection of files. Only "hard" changes (true changes) are passed downstream.

First, let us define three types of schema change:

The goal of this package is to smooth out trivial and soft schema changes, and avoid hard schema changes where possible. To do so, we must understand the sources of schema change. To avoid soft schema changes, this scan operator uses a variety of schema smoothing "levels": Keep the above in mind when looking at individual classes. Mapping of levels to classes:

Selection List Processing

A key challenge in the scan operator is mapping of table schemas to the schema returned by the scan operator. We recognize three distinct kinds of selection: Since Drill uses schema-on-read, we are not sure of the table schema until we actually ask the reader to open the table (file or other data source.) Then, a table can be "early schema" (known at open time) or "late schema" (known only after reading data.)

A selection list goes through three distinct phases to result in a final schema of the batch returned downstream.

  1. Query selection planning: resolves column names to metadata (AKA implicit or partition) columns, to "*", to the special "columns" column, or to a list of expected table columns.
  2. File selection planning: determines the values of metadata columns based on file and/or directory names.
  3. Table selection planning: determines the fully resolved output schema by resolving the wildcard or selection list against the actual table columns. A late-schema table may do table selection planning multiple times: once each time the schema changes.
  4. Output Batch Construction

    The batches produced by the scan have up to four distinct kinds of columns:
    • File metadata (filename, fqn, etc.)
    • Directory (partition metadata: (dir0, dir1, etc.)
    • Table columns (or the special "columns" column)
    • Null columns (expected table columns that turned out to not match any actual table column. To avoid errors, Drill returns these as columns filled with nulls.
    In practice, the scan operator uses three distinct result set loaders to create these columns:
    • Table loader: for the table itself. This loader uses a selection layer to write only the data needed for the output batch, skipping unused columns.
    • Metadata loader: to create the file and directory metadata columns.
    • Null loader: to populate the null columns.
    The scan operator merges the three sets of columns to produce the output batch. The merge step also performs projection: it places columns into the output batch in the order specified by the original SELECT list (or table order, if the original SELECT had a wildcard.) Fortunately, this is just involves moving around pointers to vectors; no actual data is moved during projection.

    Class Structure

    Some of the key classes here include:
    • RowBatchReader an extremely simple interface for reading data. We would like many developers to create new plugins and readers. The simplified interface pushes all complexity into the scan framework, leaving the reader to just read.
    • ShimBatchReader an implementation of the above that converts from the simplified API to add additional structure to work with the result set loader. (The base interface is agnostic about how rows are read.)
    • ScheamNegotiator and interface that allows a batch reader to "negotiate" a schema with the scan framework. The scan framework knows the columns that are to be projected. The reader knows what columns it can offer. The schema negotiator works out how to combine the two. It expresses the result as a result set loader. Column writers are defined for all columns that the reader wants to read, but only the materialized (projected) columns have actual vectors behind them. The non-projected columns are "free-wheeling" "dummy" writers.
    • And, yes, sorry for the terminology. File "readers" read from files, but use column "writers" to write to value vectors.
Skip navigation links

Copyright © 1970 The Apache Software Foundation. All rights reserved.