Skip navigation links

Package org.apache.drill.exec.physical.impl.scan.v3.schema

Provides run-time semantic analysis of the projection list for the scan operator.

See: Description

Package org.apache.drill.exec.physical.impl.scan.v3.schema Description

Provides run-time semantic analysis of the projection list for the scan operator. The project list can include table columns and a variety of special columns. Requested columns can exist in the table, or may be "missing" with null values applied. The code here prepares a run-time projection plan based on the actual table schema.

Resolves a scan schema throughout the scan lifecycle. Schema resolution comes from a variety of sources. Resolution starts with preparing the schema for the first reader:

The result is a defined schema which may include; The schema itself can be one of two forms:

Internally, the schema may start as open (has a wildcard), but may transition to closed when processing a strict provided schema.

Once this class is complete, the scan can add columns only to an open schema. All such columns are inserted at the wildcard location. If the wildcard appears by itself, columns are appended. If the wildcard appears along with implicit columns, then the reader columns appear at the wildcard location, before the implicit columns.

Once we have the initial reader input schema, we can then further refine the schema with:

Implicit (Wildcard) Projection

A query can contain a wildcard (*). In this case, the set of columns is driven by the reader. Each scan might drive one, two or many readers. In an ideal world, every reader would produce the same schema. In the real world, files tend the evolve: early files have three columns, later files have five. In this case some readers will produce one schema, other readers another. Much of the complexity of Drill comes from this simple fact that Drill is a SQL engine that requires a single schema for all rows, but Drill reads data sources which are free to return any schema that they want.

A wildcard projection starts by accepting the schema produced by the first reader. In "classic" mode, later readers can add columns (causing a schema change to be sent downstream), but cannot change the types of existing columns. The code here supports a "no schema change" mode in which the first reader discovers the schema, which is then fixed for all subsequent readers. This mode cannot, however prevent schema conflicts across scans running in different fragments.

Explicit Projection

Explicit projection provides the list of columns, but not their types. Example: SELECT a, b, c.

The projection list holds the columns as requested by the user in the SELECT clause of the query, in the order which columns appear in that clause, along with additional columns implied by other columns. The planner determines which columns to project. In Drill, projection is speculative: it is a list of names which the planner hopes will appear in the data files. The reader must make up columns (the infamous nullable INT) when it turns out that no such column exists. Else, the reader must figure out the data type for any columns that does exist.

An explicit projection starts with the requested set of columns, then looks in the table schema to find matches. Columns not in the project list are not projected (not written to vectors). The reader columns provide the types of the projected columns, "resolving" them to a concrete type.

An explicit projection may include columns that do not exist in the source schema. In this case, we fill in null columns for unmatched projections.

The challenge in this case is that Drill cannot know the type of missing columns; Drill can only guess. If a reader in Scan 1 guesses a type, but a reader in Scan 2 reads a column with a different type, then a schema conflict will occur downstream.

Maps

Maps introduce a large amount of additional complexity. First, maps appear in the project list as either:

Schema Definition

This resolver is the first step in the scan schema process. The result is a (typically dynamic) defined schema. To understand this concept, it helps to compare Drill with other query engines. In most engines, the planner is responsible for working out the scan schema from table metadata, from the project list and so on. The scan is given a fully-defined schema which it must use.

Drill is unique in that it uses a dynamic schema with columns and/or types "to be named later." The scan must convert the dynamic schema into a concrete schema sent downstream. This class implements some of the steps in doing so.

The result of this class is a schema identical to a defined schema that a planner might produce. Since Drill is dynamic, the planner must be able to produce a dynamic schema of the form described above. If the planner has table metadata (here represented by a provided schema), then the planner could produce a concrete defined schema (all types are defined.) Or, with a lenient provided schema, the planner might produce a dynamic defined schema: one with some concrete columns, some dynamic (name-only) columns.

Implicit Columns

This class handles one additional source of schema information: implicit columns: those defined by Drill itself. Examples include filename, dir0, etc. Implicit columns are available (at present) only for the file storage plugin, but could be added for other storage plugins. The project list can contain the names of implicit columns. If the query contains a wildcard, then the project list may also contain implicit columns: filename, *, dir0.

Implicit columns are known to Drill, so Drill itself can provide type information for those columns, by an external implicit column parser. That parser locates implicit columns by name, marks the columns as implicit, and takes care of populating the columns at read time. We use a column property, IMPLICIT_COL_TYPE, to mark a column as implicit. Later the scan mechanism will omit such columns when preparing the reader schema.

If the planner were to provide a defined schema, then the planner would have parsed out the implicit columns, provided their types, and marked them as implicit. So, again, we see that this class produces, at scan time, the same defined schema that the planner might produce at plan time.

Because of the way we handle implicit columns, we can allow the provided schema to include them. The provided schema simply adds a column (with any name), and sets the IMPLICIT_COL_TYPE property to indicate which implicit column definition to use for that column. This is handy for allowing the implicit column to include partition directories as regular columns.

We now have a parsing flow for this package:

Drill has long had a source of ambiguity: what happens if the reader has a column with the same name as an implicit column. In this flow, the ambiguity is resolved as follows:

Projection

In prior versions of the scan operator, projection tended to be quite simple: just check if a name appears in the project list. As we've seen from the above, projection is actually quite complex with the need to reuse type information where available, open and closed top-level and map schemas, the need to avoid projecting columns with the same name as implicit columns, etc.

The ProjectionFilter classes handle projection. As it turns out, this class must follow (variations of) the same rules when merging the provided schema with the projection list and so on. To ensure a single implementation of the complex projection rules, this class uses a projection filter when resolving the provided schema. The devil is in the details, knowing when a map is open or closed, enforcing consistency with known information, etc.

Provided Schema

With the advent of provided schema in Drill 1.16, the query plan can provide not just column names (dynamic columns) but also the data type (concrete columns.) In this case, the scan schema can resolve projected columns against the provided schema, rather than waiting for the reader schema. Readers can use the provided schema to choose a column type when the choice is ambiguous, or multiple choices are possible.

If the projection list is a wildcard, then the wildcard expands to include all columns from the provided schema, in the order of that schema. If the schema is strict, then the scan schema becomes fixed, as if an explicit projection list where used.

If the projection list is explicit, then each column is resolved against the provided schema. If the projection list includes a column not in the provided schema, then it falls to the reader (or missing columns mechanism) to resolve that particular column.

Early Reader Schema

Some readers can declare their schema before reading data. For example, a JDBC query gets back a row schema during the initial prepare step. In this case, the reader is said to be early schema. The reader indicates an early schema via its schema negotiator. The framework then uses this schema to resolve the dynamic columns in the scan schema. If all columns are resolved this way, then the scan can declare its own schema before reading any data.

An early reader schema can work with a provided schema. In this case, the early reader schema must declare the same column type as the provided schema. This is not a large obstacle: the provided schema should have originally come from the reader (or a description of the reader) so conflicts should not occur in normal operation.

Reader Output Schema

Once a reader loads a batch of data, it provides (via the ResultSetLoader) the reader's output schema: the set of columns actually read by the reader.

If the projection list contained a wildcard, then the reader output schema will determine the set of columns that replaces the wildcard. (That is, all reader columns are projected and the scan schema expands to reflect the actual columns.)

If the projection list is explicit (or made so by a strict provided schema), then the reader output schema must be a subset of the scan schema: it is an error for the reader to include extra columns as the scan mechanism won't know what to do with those vectors. The projection mechanism (see below) integrates with the ResultSetLoader to project only those columns needed; the others are given to the reader as "dummy" column writers: writers that accept, but discard their data.

Note the major difference between the early reader schema and the reader output schema. The early reader schema includes all the columns that the reader can read. The reader output schema includes only those columns that the reader actually read (as controlled by the projection filter.) For most readers (CSV, JSON, etc.), there is no early reader schema, there is only the reader output schema: the set of columns (modulo projection) that turned out to be in the data source.

Projection

To handle this, projection is driven by the (evolving) scan schema. In fact, the schema mechanism uses the same projection implementation when applying the provided schema and early reader schema.

Assembling the Output Schema and Batch

The scan output schema consists of up to three parts: Distinct mechanisms build each kind of schema. The reader builds the vectors for the reader schema. A missing column handler builds the missing columns (using provided or inferred types and values.) An implicit column manager fills in the implicit columns based on file information.

The scan schema tracker tracks all three schemas together to form the scan output schema. Tracking the combined schema ensures we preserve the user's requested project ordering. The reader manager builds the vectors using the above mechanisms, then merges the vectors (very easy to do in a columnar system) to produce the output batch which matches the scan schema.

Architecture Overview

                   Scan Plan
                       |
                       v
               +--------------+
               | Project List |
               |    Parser    |
               +--------------+
                       |
                       v
                +-------------+
                | Scan Schema |     +-------------------+
                |   Tracker   | --->| Projection Filter |
                +-------------+     +-------------------+
                       |                  |
                       v                  v
  +------+      +------------+     +------------+      +-----------+
  | File | ---> |   Reader   |---->| Result Set | ---> | Data File |
  | Data |      |            |     |   Loader   | <--- |  Reader   |
  +------+      +------------+     +------------+      +-----------+
                       |                  |
                       v                  |
                +------------+    Reader  |
                |   Reader   |    Schema  |
                | Lifecycle  | <----------+
                +------------+            |
                       |                  |
                       v                  |
                  +---------+    Loaded   |
                  | Output  |    Vectors  |
                  | Builder | <-----------+
                  +---------+
                       |
                       v
                 Output Batch
 
Omitted are the details of implicit and missing columns. The scan lifecycle (not shown) orchestrates the whole process.

The result is a scan schema which can start entirely dynamic (just a wildcard or list of column names), which is then resolved via a series of steps (some of which involve the real work of the scanner: reading data.) The bottom is the output: a full-resolved scan schema which exactly describes an output data batch.

Skip navigation links

Copyright © 1970 The Apache Software Foundation. All rights reserved.