Core interface for a projected column.
Queries can contain a wildcard (*), table columns, or special system-defined columns (the file metadata columns AKA implicit columns, the `columns` column of CSV, etc.).
Reader-level projection is customizable.
Interface for add-on parsers, avoids the need to create a single, tightly-coupled parser for all types of columns.
Generic mechanism for retrieving vectors from a source tuple when projecting columns to the output tuple.
Represents a projected column that has not yet been bound to a table column, special column or a null column.
Represents an unresolved table column to be provided by the reader (or filled in with nulls.) May be associated with a provided schema column.
Populate metadata columns either file metadata (AKA "implicit columns") or directory metadata (AKA "partition columns.") In both cases the column type is nullable Varchar and the column value is predefined by the projection planner; this class just copies that value into each row.
Perform a schema projection for the case of an explicit list of projected columns.
Do-nothing implementation of the metadata manager.
Manages null columns by creating a null column loader for each set of non-empty null columns.
Create and populate null columns for the case in which a SELECT statement refers to columns that do not exist in the actual table.
Computes the full output schema given a table (or batch) schema.
Orchestrates projection tasks for a single reader within the set that the scan operator manages.
A resolved column has a name, and a specification for how to project data from a source vector to a vector in the final output container.
Represents a column which is implicitly a map (because it has children in the project list), but which does not match any column in the table.
Projected column that serves as both a resolved column (provides projection mapping) and a null column spec (provides the information needed to create the required null vectors.)
Column that matches one provided by the table.
Drill rows are made up of a tree of tuples, with the row being the root tuple.
Represents a map implied by the project list, whether or not the map actually appears in the table schema.
Represents a map tuple (not the map column, rather the value of the map column.) When projecting, we create a new repeated map vector, but share the offsets vector from input to output.
Represents the top-level tuple which is projected to a vector container.
Parses and analyzes the projection list passed to the scanner.
Performs projection of a record reader, along with a set of static columns, to produce the final "public" result set (record batch) for the scan operator.
Implements a "schema smoothing" algorithm.
Resolve a table schema against the prior schema.
Base class for columns that take values based on the reader, not individual rows.
Perform a wildcard projection.
Perform a wildcard projection with an associated output schema.
Identifies the kind of projection done for this scan.
Exception thrown if the prior schema is not compatible with the new table schema.
Projection turns out to be a very complex operation in a schema-on-read system such as Drill. Provided schema helps resolve ambiguities inherent in schema-on-read, but at the cost of some additional complexity.
With the advent of provided schema in Drill 1.16, the scan level projection integrates that schema information with the projection list provided in the physical operator. If a schema is provided, then each scan-level column tracks the schema information for that column.
The scan-level projection also implements the special rules for a "strict" provided schema: if the operator projection list contains a wildcard, a schema is provided, and the schema is strict, then the scan level projection expands the wildcard into the set of columns in the provided schema. Doing so ensures that the scan output contains exactly those columns from the schema, even if the columns must be null or at a default value. (The result set loader does additional filtering as well.)
The scan project list defines the set of columns which the scan operator is obliged to send downstream. Ideally, the scan operator sends exactly the same schema (the project list with types filled in) for all batches. Since batches may come from different files, the scan operator is obligated to unify the schemas from those files (or blocks.)
Reader (file)-level projection occurs for each reader. A single scan may use multiple readers to read data. From the reader's perspective, it offers the schema it discovers in the file. The reader itself is rather inflexible: it must deal with the data it finds, of the type found in the data source.
The reader thus tells the result set loader that it has such-and-so schema. It does that either at open time (so-called "early" schema, such as for CSV, JDBC or Parquet) or as it discovers the columns (so-called "late" schema as in JSON.) Again, in each case, the data source schema is what it is; it can't be changed due to the wishes of the scan-level projection.
Readers obtain column schema from the file or data source. For example, a Parquet reader can obtain schema information from the Parquet headers. A JDBC reader obtains schema information from the returned schema. As noted above, we use the term "early schema" when type information is available at open time, before reading the first row of data.
By contrast eaders such as JSON and CSV are "late schema": they don't know the data schema until they read the file. This is true "schema on read." Further, for JSON, the data may change from one batch to the next as the reader "discovers" fields that did not appear in earlier batches. This requires some amount of "schema smoothing": the ability to preserve a consistent output schema even as the input schema jiggles around some.
Drill supports many kinds of data sources via plugins. The DFS plugin works with files in a distributed store such as HDFS. Such file-based readers add implicit file or partition columns. Since these columns are generic to all format plugins, they are factored out into a file scan framework which inserts the "implicit" columns separate from the reader-provided columns.
The result is a refined schema: the scan level schema with more information filled in. For Parquet, all projection information can be filled in. For CSV or JSON, we can only add file metadata information, but not yet the actual data schema.
Batch-level schema: once a reader reads actual data, it now knows exactly what it read. This is the "schema on read model." Thus, after reading a batch, any remaining uncertainty about the projected schema is removed. The actual data defined data types and so on.
The goal of this mechanism is to handle the above use cases cleanly, in a common set of classes, and to avoid the need for each reader to figure out all these issues for themselves (as was the case with earlier versions of Drill.)
Because these issues are complex, the code itself is complex. To make the code easier to manage, each bit of functionality is encapsulated in a distinct class. Classes combine via composition to create a "framework" suitable for each kind of reader: whether it be early or late schema, file-based or something else, etc.
Run-time schema resolution occurs at various stages:
SchemaPathentries into internal column nodes, tagging the nodes with the column type: wildcard, unresolved table column, or special columns (such as file metadata.) The scan-level rewrite is done once per scan operator.
Scan Plan | v +--------------+ | Project List | | Parser | +--------------+ | v +------------+ | Scan Level | +----------------+ | Projection | --->| Projection Set | +------------+ +----------------+ | | v v +------+ +------------+ +------------+ +-----------+ | File | ---> | File Level | | Result Set | ---> | Data File | | Data | | Projection | | Loader | <--- | Reader | +------+ +------------+ +------------+ +-----------+ | | v | +--------------+ Reader | | Reader Level | Schema | | Projection | <---------+ +--------------+ | | | v | +--------+ Loaded | | Output | Vectors | | Mapper | <------------+ +--------+ | v Output Batch
The left side can be thought of as the "what we want" description of the schema, with the right side being "what the reader actually discovered."
The output mapper includes mechanisms to populate implicit columns, create null columns, and to merge implicit, null and data columns, omitting unprojected data columns.
In all cases, projection must handle maps, which are a recursive structure much like a row. That is, Drill consists of nested tuples (the row and maps), each of which contains columns which can be maps. Thus, there is a set of alternating layers of tuples, columns, tuples, and so on until we get to leaf (non-map) columns. As a result, most of the above structures are in the form of tuple trees, requiring recursive algorithms to apply rules down through the nested layers of tuples.
The above mechanism is done at runtime, in each scan fragment. Since Drill is schema-on-read, and has no plan-time schema concept, run-time projection is required. On the other hand, if Drill were ever to support the "classic" plan-time schema resolution, then much of this work could be done at plan time rather than (redundantly) at runtime. The main change would be to do the work abstractly, working with column and row descriptions, rather than concretely with vectors as is done here. Then, that abstract description would feed directly into these mechanisms with the "final answer" about projection, batch layout, and so on. The parts of this mechanism that create and populate vectors would remain.
Copyright © 1970 The Apache Software Foundation. All rights reserved.