Skip navigation links

Package org.apache.drill.exec.physical.impl.scan.project

Provides run-time semantic analysis of the projection list for the scan operator.

See: Description

Package org.apache.drill.exec.physical.impl.scan.project Description

Provides run-time semantic analysis of the projection list for the scan operator. The project list can include table columns and a variety of special columns. Requested columns can exist in the table, or may be "missing" with null values applied. The code here prepares a run-time projection plan based on the actual table schema.

Overview

The projection framework look at schema as a set of transforms:

Projection turns out to be a very complex operation in a schema-on-read system such as Drill. Provided schema helps resolve ambiguities inherent in schema-on-read, but at the cost of some additional complexity.

Background

The Scan-level projection holds the list of columns (or the wildcard) as requested by the user in the query. The planner determines which columns to project. In Drill, projection is speculative: it is a list of names which the planner hopes will appear in the data files. The reader must make up columns (the infamous nullable INT) when it turns out that no such column exists. Else, the reader must figure out the data type for any columns that does exist.

With the advent of provided schema in Drill 1.16, the scan level projection integrates that schema information with the projection list provided in the physical operator. If a schema is provided, then each scan-level column tracks the schema information for that column.

The scan-level projection also implements the special rules for a "strict" provided schema: if the operator projection list contains a wildcard, a schema is provided, and the schema is strict, then the scan level projection expands the wildcard into the set of columns in the provided schema. Doing so ensures that the scan output contains exactly those columns from the schema, even if the columns must be null or at a default value. (The result set loader does additional filtering as well.)

The scan project list defines the set of columns which the scan operator is obliged to send downstream. Ideally, the scan operator sends exactly the same schema (the project list with types filled in) for all batches. Since batches may come from different files, the scan operator is obligated to unify the schemas from those files (or blocks.)

Reader (file)-level projection occurs for each reader. A single scan may use multiple readers to read data. From the reader's perspective, it offers the schema it discovers in the file. The reader itself is rather inflexible: it must deal with the data it finds, of the type found in the data source.

The reader thus tells the result set loader that it has such-and-so schema. It does that either at open time (so-called "early" schema, such as for CSV, JDBC or Parquet) or as it discovers the columns (so-called "late" schema as in JSON.) Again, in each case, the data source schema is what it is; it can't be changed due to the wishes of the scan-level projection.

Readers obtain column schema from the file or data source. For example, a Parquet reader can obtain schema information from the Parquet headers. A JDBC reader obtains schema information from the returned schema. As noted above, we use the term "early schema" when type information is available at open time, before reading the first row of data.

By contrast eaders such as JSON and CSV are "late schema": they don't know the data schema until they read the file. This is true "schema on read." Further, for JSON, the data may change from one batch to the next as the reader "discovers" fields that did not appear in earlier batches. This requires some amount of "schema smoothing": the ability to preserve a consistent output schema even as the input schema jiggles around some.

Drill supports many kinds of data sources via plugins. The DFS plugin works with files in a distributed store such as HDFS. Such file-based readers add implicit file or partition columns. Since these columns are generic to all format plugins, they are factored out into a file scan framework which inserts the "implicit" columns separate from the reader-provided columns.

Design

This leads to a multi-stage merge operation. The result set loader is presented with each column one-by-one (either at open time or during read.) When a column is presented, the projection framework makes a number of decisions:

The result is a refined schema: the scan level schema with more information filled in. For Parquet, all projection information can be filled in. For CSV or JSON, we can only add file metadata information, but not yet the actual data schema.

Batch-level schema: once a reader reads actual data, it now knows exactly what it read. This is the "schema on read model." Thus, after reading a batch, any remaining uncertainty about the projected schema is removed. The actual data defined data types and so on.

The goal of this mechanism is to handle the above use cases cleanly, in a common set of classes, and to avoid the need for each reader to figure out all these issues for themselves (as was the case with earlier versions of Drill.)

Because these issues are complex, the code itself is complex. To make the code easier to manage, each bit of functionality is encapsulated in a distinct class. Classes combine via composition to create a "framework" suitable for each kind of reader: whether it be early or late schema, file-based or something else, etc.

Nuances of Reader-Level Projection

We've said that the scan-level projection identifies what the query wants. We've said that the reader identifies what the external data actually is. We've mentioned how we bridge between the two. Here we explore this in more detail.

Run-time schema resolution occurs at various stages:

Reader-Level Projection Set

The Projection Set mechanism is designed to handle the increasing nuances of Drill run-time projection by providing a source of information about each column that the reader may discover:

Projection Via Rewrites

The core concept is one of successive refinement of the project list through a set of rewrites:

The following outlines the steps from scan plan to per-file data loading to producing the output batch. The center path is the projection metadata which turns into an actual output batch.
                   Scan Plan
                       |
                       v
               +--------------+
               | Project List |
               |    Parser    |
               +--------------+
                       |
                       v
                +------------+
                | Scan Level |     +----------------+
                | Projection | --->| Projection Set |
                +------------+     +----------------+
                       |                  |
                       v                  v
  +------+      +------------+     +------------+      +-----------+
  | File | ---> | File Level |     | Result Set | ---> | Data File |
  | Data |      | Projection |     |   Loader   | <--- |  Reader   |
  +------+      +------------+     +------------+      +-----------+
                       |                  |
                       v                  |
               +--------------+   Reader  |
               | Reader Level |   Schema  |
               |  Projection  | <---------+
               +--------------+           |
                       |                  |
                       v                  |
                  +--------+   Loaded     |
                  | Output |   Vectors    |
                  | Mapper | <------------+
                  +--------+
                       |
                       v
                 Output Batch
 

The left side can be thought of as the "what we want" description of the schema, with the right side being "what the reader actually discovered."

The output mapper includes mechanisms to populate implicit columns, create null columns, and to merge implicit, null and data columns, omitting unprojected data columns.

In all cases, projection must handle maps, which are a recursive structure much like a row. That is, Drill consists of nested tuples (the row and maps), each of which contains columns which can be maps. Thus, there is a set of alternating layers of tuples, columns, tuples, and so on until we get to leaf (non-map) columns. As a result, most of the above structures are in the form of tuple trees, requiring recursive algorithms to apply rules down through the nested layers of tuples.

The above mechanism is done at runtime, in each scan fragment. Since Drill is schema-on-read, and has no plan-time schema concept, run-time projection is required. On the other hand, if Drill were ever to support the "classic" plan-time schema resolution, then much of this work could be done at plan time rather than (redundantly) at runtime. The main change would be to do the work abstractly, working with column and row descriptions, rather than concretely with vectors as is done here. Then, that abstract description would feed directly into these mechanisms with the "final answer" about projection, batch layout, and so on. The parts of this mechanism that create and populate vectors would remain.

Skip navigation links

Copyright © 1970 The Apache Software Foundation. All rights reserved.