Class ScanLevelProjection

java.lang.Object
org.apache.drill.exec.physical.impl.scan.project.ScanLevelProjection

public class ScanLevelProjection extends Object
Parses and analyzes the projection list passed to the scanner. The scanner accepts a projection list and a plugin-specific set of items to read. The scan operator produces a series of output batches, which (in the best case) all have the same schema. Since Drill is "schema on read", in practice batch schema may evolve. The framework tries to "smooth" such changes where possible. An output schema adds another level of stability by specifying the set of columns to project (for wildcard queries) and the types of those columns (for all queries.)

The projection list is per scan, independent of any tables that the scanner might scan. The projection list is then used as input to the per-table projection planning.

Overview

In most query engines, this kind of projection analysis is done at plan time. But, since Drill is schema-on-read, we don't know the available columns, or their types, until we start scanning a table. The table may provide the schema up-front, or may discover it as the read proceeds. Hence, the job here is to make sense of the project list based on static a-priori information, then to create a list that can be further resolved against an table schema when it appears. This give us two steps:
  • Scan-level projection: this class, that handles schema for the entire scan operator.
  • Table-level projection: defined elsewhere, that merges the table and scan-level projections.

Accepts the inputs needed to plan a projection, builds the mappings, and constructs the projection mapping object.

Builds the per-scan projection plan given a set of projected columns. Determines the output schema, which columns to project from the data source, which are metadata, and so on.

An annoying aspect of SQL is that the projection list (the list of columns to appear in the output) is specified after the SELECT keyword. In Relational theory, projection is about columns, selection is about rows...

Projection Mappings

Mappings can be based on three primary use cases:

  • SELECT *: Project all data source columns, whatever they happen to be. Create columns using names from the data source. The data source also determines the order of columns within the row.
  • SELECT columns: Similar to SELECT * in that it projects all columns from the data source, in data source order. But, rather than creating individual output columns for each data source column, creates a single column which is an array of Varchars which holds the (text form) of each column as an array element.
  • SELECT a, b, c, ...: Project a specific set of columns, identified by case-insensitive name. The output row uses the names from the SELECT list, but types from the data source. Columns appear in the row in the order specified by the SELECT.
  • <liSELECT ...: SELECT nothing, occurs in SELECT COUNT(*) type queries. The provided projection list contains no (table) columns, though it may contain metadata columns.
Names in the SELECT list can reference any of five distinct types of output columns:

  • Wildcard ("*") column: indicates the place in the projection list to insert the table columns once found in the table projection plan.
  • Data source columns: columns from the underlying table. The table projection planner will determine if the column exists, or must be filled in with a null column.
  • The generic data source columns array: columns, or optionally specific members of the columns array such as columns[1].
  • Implicit columns: fqn, filename, filepath and suffix. These reference parts of the name of the file being scanned.
  • Partition columns: dir0, dir1, ...: These reference parts of the path name of the file.

Projection with a Schema

The client can provide an output schema that defines the types (and defaults) for the tuple produced by the scan. When a schema is provided, the above use cases are extended as follows:

  • SELECT * with strict schema: All columns in the output schema are projected, and only those columns. If a reader offers additional columns, those columns are ignored. If the reader omits output columns, the default value (if any) for the column is used.
  • SELECT * with a non-strict schema: the output tuple contains all columns from the output schema as explained above. In addition, if the reader provides any columns not in the output schema, those columns are appended to the end of the tuple. (That is, the output schema acts as it it were from an imaginary "0th" reader.)
  • Explicit projection: only the requested columns appear, whether from the output schema, the reader, or as nulls.

  • Field Details

  • Method Details

    • builder

      public static ScanLevelProjection.Builder builder()
    • build

      public static ScanLevelProjection build(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers)
      Builder shortcut, primarily for tests.
    • build

      public static ScanLevelProjection build(List<SchemaPath> projectionList, List<ScanLevelProjection.ScanProjectionParser> parsers, TupleMetadata outputSchema)
      Builder shortcut, primarily for tests.
    • addTableColumn

      public void addTableColumn(ColumnProjection outCol)
    • addMetadataColumn

      public void addMetadataColumn(ColumnProjection outCol)
    • context

      public CustomErrorContext context()
    • requestedCols

      public List<SchemaPath> requestedCols()
      Return the set of columns from the SELECT list
      Returns:
      the SELECT list columns, in SELECT list order
    • columns

      public List<ColumnProjection> columns()
      The entire set of output columns, in output order. Output order is that specified in the SELECT (for an explicit list of columns) or table order (for SELECT * queries).
      Returns:
      the set of output columns in output order
    • projectionType

      public ScanLevelProjection.ScanProjectionType projectionType()
    • projectAll

      public boolean projectAll()
      Return whether this is a SELECT * query
      Returns:
      true if this is a SELECT * query
    • isEmptyProjection

      public boolean isEmptyProjection()
      Returns true if the projection list is empty. This usually indicates a SELECT COUNT(*) query (though the scan operator does not have the context to know that an empty list does, in fact, imply a count-only query...)
      Returns:
      true if no table columns are projected, false if at least one column is projected (or the query contained the wildcard)
    • rootProjection

      public RequestedTuple rootProjection()
    • readerProjection

      public ProjectionFilter readerProjection()
    • hasReaderSchema

      public boolean hasReaderSchema()
    • readerSchema

      public TupleMetadata readerSchema()
    • toString

      public String toString()
      Overrides:
      toString in class Object