Package org.apache.drill.exec.physical.impl.scan.v3.schema


package org.apache.drill.exec.physical.impl.scan.v3.schema
Provides run-time semantic analysis of the projection list for the scan operator. The project list can include table columns and a variety of special columns. Requested columns can exist in the table, or may be "missing" with null values applied. The code here prepares a run-time projection plan based on the actual table schema.

Resolves a scan schema throughout the scan lifecycle. Schema resolution comes from a variety of sources. Resolution starts with preparing the schema for the first reader:

  • Project list (wildcard, empty, or explicit)
  • Optional provided schema (strict or lenient)
  • Implicit columns
  • An "early" reader schema (one determined before reading any data.
The result is a defined schema which may include;
  • Dynamic columns: those from the project list where we know only the column name, but not its type.
  • Resolved columns: implicit or provided columns where we know the name and type.
The schema itself can be one of two forms:
  • Open: meaning that the reader can add other columns. An open schema results from a wildcard projection. Since the wildcard can appear along with implicit columns, the schema can be open and have a set of columns. If a provided schema appears, then the provided schema is expanded here. If the schema is "lenient", then the reader can add additional columns as it discovers them.
  • Closed: meaning that the reader cannot add additional columns. A closed schema results from an empty or explicit projection list. A closed schema also results from a wildcard projection and a strict schema.

Internally, the schema may start as open (has a wildcard), but may transition to closed when processing a strict provided schema.

Once this class is complete, the scan can add columns only to an open schema. All such columns are inserted at the wildcard location. If the wildcard appears by itself, columns are appended. If the wildcard appears along with implicit columns, then the reader columns appear at the wildcard location, before the implicit columns.

Once we have the initial reader input schema, we can then further refine the schema with:

  • The reader "output" schema: the columns actually read by the reader.
  • The set of "missing" columns: those projected, but which the reader did not provide. We must make up a type for missing columns (and hope we guess correctly.) In fact, the purpose of the provided (and possibly early reader) schema is to avoid the need to guess.

Implicit (Wildcard) Projection

A query can contain a wildcard (*). In this case, the set of columns is driven by the reader. Each scan might drive one, two or many readers. In an ideal world, every reader would produce the same schema. In the real world, files tend the evolve: early files have three columns, later files have five. In this case some readers will produce one schema, other readers another. Much of the complexity of Drill comes from this simple fact that Drill is a SQL engine that requires a single schema for all rows, but Drill reads data sources which are free to return any schema that they want.

A wildcard projection starts by accepting the schema produced by the first reader. In "classic" mode, later readers can add columns (causing a schema change to be sent downstream), but cannot change the types of existing columns. The code here supports a "no schema change" mode in which the first reader discovers the schema, which is then fixed for all subsequent readers. This mode cannot, however prevent schema conflicts across scans running in different fragments.

Explicit Projection

Explicit projection provides the list of columns, but not their types. Example: SELECT a, b, c.

The projection list holds the columns as requested by the user in the SELECT clause of the query, in the order which columns appear in that clause, along with additional columns implied by other columns. The planner determines which columns to project. In Drill, projection is speculative: it is a list of names which the planner hopes will appear in the data files. The reader must make up columns (the infamous nullable INT) when it turns out that no such column exists. Else, the reader must figure out the data type for any columns that does exist.

An explicit projection starts with the requested set of columns, then looks in the table schema to find matches. Columns not in the project list are not projected (not written to vectors). The reader columns provide the types of the projected columns, "resolving" them to a concrete type.

An explicit projection may include columns that do not exist in the source schema. In this case, we fill in null columns for unmatched projections.

The challenge in this case is that Drill cannot know the type of missing columns; Drill can only guess. If a reader in Scan 1 guesses a type, but a reader in Scan 2 reads a column with a different type, then a schema conflict will occur downstream.

Maps

Maps introduce a large amount of additional complexity. First, maps appear in the project list as either:
  • A generic projection: just the name m, where m is a map. In this case, we project all members of the map. That is, the map itself is open in the above sense. Note that a map can be open even if the scan schema itself is closed. That is, if the projection list contains only m, the scan schema is closed, but the map is open (the reader will discover the fields that make up the map.)
  • A specific projection: a list of map members: m.x, m.y. In this case, we know that the downstream Project operator will pull just those two members to the top level and discard the rest of the map. We can thus project just those two members in the scan. As a result, the map is closed in the above sense: any additional map members discovered by the reader will be unprojected.
  • Hybrid: a projection list that includes both: m, m.x. Here, the generic projection takes precedence. If the specific projection includes qualifiers, m, m.x[1], then that information is used to check the type of column x.
  • Implied: in a wildcard projection, a column may turn out to be a map. In this case, the map is open when the schema itself is open. (Remember that a wildcard projection can result in a closed schema if paired with a strict provided schema.

Schema Definition

This resolver is the first step in the scan schema process. The result is a (typically dynamic) defined schema. To understand this concept, it helps to compare Drill with other query engines. In most engines, the planner is responsible for working out the scan schema from table metadata, from the project list and so on. The scan is given a fully-defined schema which it must use.

Drill is unique in that it uses a dynamic schema with columns and/or types "to be named later." The scan must convert the dynamic schema into a concrete schema sent downstream. This class implements some of the steps in doing so.

The result of this class is a schema identical to a defined schema that a planner might produce. Since Drill is dynamic, the planner must be able to produce a dynamic schema of the form described above. If the planner has table metadata (here represented by a provided schema), then the planner could produce a concrete defined schema (all types are defined.) Or, with a lenient provided schema, the planner might produce a dynamic defined schema: one with some concrete columns, some dynamic (name-only) columns.

Implicit Columns

This class handles one additional source of schema information: implicit columns: those defined by Drill itself. Examples include filename, dir0, etc. Implicit columns are available (at present) only for the file storage plugin, but could be added for other storage plugins. The project list can contain the names of implicit columns. If the query contains a wildcard, then the project list may also contain implicit columns: filename, *, dir0.

Implicit columns are known to Drill, so Drill itself can provide type information for those columns, by an external implicit column parser. That parser locates implicit columns by name, marks the columns as implicit, and takes care of populating the columns at read time. We use a column property, IMPLICIT_COL_TYPE, to mark a column as implicit. Later the scan mechanism will omit such columns when preparing the reader schema.

If the planner were to provide a defined schema, then the planner would have parsed out the implicit columns, provided their types, and marked them as implicit. So, again, we see that this class produces, at scan time, the same defined schema that the planner might produce at plan time.

Because of the way we handle implicit columns, we can allow the provided schema to include them. The provided schema simply adds a column (with any name), and sets the IMPLICIT_COL_TYPE property to indicate which implicit column definition to use for that column. This is handy for allowing the implicit column to include partition directories as regular columns.

We now have a parsing flow for this package:

  • Projection list (so we know what to include)
  • Provided schema (to add/mark columns as implicit)
  • Implicit columns, which looks for only for a) columns tagged as implicit or b) dynamic columns (those not defined in the provided schema.

Drill has long had a source of ambiguity: what happens if the reader has a column with the same name as an implicit column. In this flow, the ambiguity is resolved as follows:

  • If a provided schema has a column explicitly tagged as an implicit column, then that column is unambiguously an implicit column independent of name.
  • If a provided schema has a column with the same name as an implicit column (the names can be changed by a system/session option), then the fact that the column is not marked as implicit unambiguously tells us that the column is not implicit, despite the name.
  • If a column appears in the project list, but not in the provided schema, and that column matches the (effective) name of some implicit column, then the column is marked as implicit and is not passed to the reader. Further, the projection filter will mark that column as unprojected in the reader, even if the reader otherwise has a wildcard schema.

Projection

In prior versions of the scan operator, projection tended to be quite simple: just check if a name appears in the project list. As we've seen from the above, projection is actually quite complex with the need to reuse type information where available, open and closed top-level and map schemas, the need to avoid projecting columns with the same name as implicit columns, etc.

The ProjectionFilter classes handle projection. As it turns out, this class must follow (variations of) the same rules when merging the provided schema with the projection list and so on. To ensure a single implementation of the complex projection rules, this class uses a projection filter when resolving the provided schema. The devil is in the details, knowing when a map is open or closed, enforcing consistency with known information, etc.

Provided Schema

With the advent of provided schema in Drill 1.16, the query plan can provide not just column names (dynamic columns) but also the data type (concrete columns.) In this case, the scan schema can resolve projected columns against the provided schema, rather than waiting for the reader schema. Readers can use the provided schema to choose a column type when the choice is ambiguous, or multiple choices are possible.

If the projection list is a wildcard, then the wildcard expands to include all columns from the provided schema, in the order of that schema. If the schema is strict, then the scan schema becomes fixed, as if an explicit projection list where used.

If the projection list is explicit, then each column is resolved against the provided schema. If the projection list includes a column not in the provided schema, then it falls to the reader (or missing columns mechanism) to resolve that particular column.

Early Reader Schema

Some readers can declare their schema before reading data. For example, a JDBC query gets back a row schema during the initial prepare step. In this case, the reader is said to be early schema. The reader indicates an early schema via its schema negotiator. The framework then uses this schema to resolve the dynamic columns in the scan schema. If all columns are resolved this way, then the scan can declare its own schema before reading any data.

An early reader schema can work with a provided schema. In this case, the early reader schema must declare the same column type as the provided schema. This is not a large obstacle: the provided schema should have originally come from the reader (or a description of the reader) so conflicts should not occur in normal operation.

Reader Output Schema

Once a reader loads a batch of data, it provides (via the ResultSetLoader) the reader's output schema: the set of columns actually read by the reader.

If the projection list contained a wildcard, then the reader output schema will determine the set of columns that replaces the wildcard. (That is, all reader columns are projected and the scan schema expands to reflect the actual columns.)

If the projection list is explicit (or made so by a strict provided schema), then the reader output schema must be a subset of the scan schema: it is an error for the reader to include extra columns as the scan mechanism won't know what to do with those vectors. The projection mechanism (see below) integrates with the ResultSetLoader to project only those columns needed; the others are given to the reader as "dummy" column writers: writers that accept, but discard their data.

Note the major difference between the early reader schema and the reader output schema. The early reader schema includes all the columns that the reader can read. The reader output schema includes only those columns that the reader actually read (as controlled by the projection filter.) For most readers (CSV, JSON, etc.), there is no early reader schema, there is only the reader output schema: the set of columns (modulo projection) that turned out to be in the data source.

Projection</h4 The projection list tells the reader which columns to read. In this mechanism, the projection list undergoes multiple transforms (expanding into a provided schema, identifying implicit columns, etc.) Further, as columns are resolved (via a provided schema, an earlier reader, etc.), the projection list can provide type information as well.

To handle this, projection is driven by the (evolving) scan schema. In fact, the schema mechanism uses the same projection implementation when applying the provided schema and early reader schema.

Assembling the Output Schema and Batch

The scan output schema consists of up to three parts:
  • Reader columns (the reader output schema)
  • Missing columns (reader input columns which the reader does not actually provide.)
  • Implicit columns.
Distinct mechanisms build each kind of schema. The reader builds the vectors for the reader schema. A missing column handler builds the missing columns (using provided or inferred types and values.) An implicit column manager fills in the implicit columns based on file information.

The scan schema tracker tracks all three schemas together to form the scan output schema. Tracking the combined schema ensures we preserve the user's requested project ordering. The reader manager builds the vectors using the above mechanisms, then merges the vectors (very easy to do in a columnar system) to produce the output batch which matches the scan schema.

Architecture Overview

                   Scan Plan
                       |
                       v
               +--------------+
               | Project List |
               |    Parser    |
               +--------------+
                       |
                       v
                +-------------+
                | Scan Schema |     +-------------------+
                |   Tracker   | --->| Projection Filter |
                +-------------+     +-------------------+
                       |                  |
                       v                  v
  +------+      +------------+     +------------+      +-----------+
  | File | ---> |   Reader   |---->| Result Set | ---> | Data File |
  | Data |      |            |     |   Loader   | <--- |  Reader   |
  +------+      +------------+     +------------+      +-----------+
                       |                  |
                       v                  |
                +------------+    Reader  |
                |   Reader   |    Schema  |
                | Lifecycle  | <----------+
                +------------+            |
                       |                  |
                       v                  |
                  +---------+    Loaded   |
                  | Output  |    Vectors  |
                  | Builder | <-----------+
                  +---------+
                       |
                       v
                 Output Batch
 
Omitted are the details of implicit and missing columns. The scan lifecycle (not shown) orchestrates the whole process.

The result is a scan schema which can start entirely dynamic (just a wildcard or list of column names), which is then resolved via a series of steps (some of which involve the real work of the scanner: reading data.) The bottom is the output: a full-resolved scan schema which exactly describes an output data batch.