Interface ScanSchemaTracker

All Known Implementing Classes:
AbstractSchemaTracker, ProjectionSchemaTracker, SchemaBasedTracker

public interface ScanSchemaTracker
Computes scan output schema from a variety of sources.

The scan operator output schema can be defined or dynamic.

Defined Schema

The planner computes a defined schema from metadata, as in a typical query engine. A defined schema defines the output schema directly: the defined schema is the output schema. Drill's planner does not yet support a defined schema, but work is in progress to get there for some cases.

With a defined schema, the reader is given a fully-defined schema and its job is to produce vectors that match the given schema. (The details are handled by the ResultSetLoader.)

At present, since the planner does not actually provide a defined schema, we support it in this class, and verify that the defined schema, if provided, exactly matches the names in the project list in the same order.

Dynamic Schema

A dynamic schema is one defined at run time: the traditional Drill approach. A dynamic schema starts with a projection list : a list of column names without types. This class converts the project list into a dynamic reader schema which is a schema in which each column has the type LATE, which basically means "a type to be named later" by the reader.

Hybrid Schema

Some readers support a provided schema, which is an concept similar to, but distinct from, a defined schema. The provided schema provides hints about a schema. At present, it is an extra; not used or understood by the planner. Thus, the projection list is independent of the provided schema: the lists may be disjoint.

With a provided schema, the project list defines the output schema. If the provided schema provides projected columns, then the provided schema for those columns flow to the output schema, just as for a defined schema. Similarly, the reader is given a defined schema for those columns.

Where a provided schema differs is that the project list can include columns not in the provided schema, such columns act like the dynamic case: the reader defines the column type.

Projection Types

Drill will pass in a project list which is one of three kinds:

  • >SELECT *: Project all data source columns, whatever they happen to be. Create columns using names from the data source. The data source also determines the order of columns within the row.
  • >SELECT a, b, c, ...: Project a specific set of columns, identified by case-insensitive name. The output row uses the names from the SELECT list, but types from the data source. Columns appear in the row in the order specified by the SELECT.
  • <li>SELECT ...: Project nothing, occurs in >SELECT COUNT(*) type queries. The provided projection list contains no (table) columns, though it may contain metadata columns.
Names in the project list can reference any of five distinct types of output columns:

  • Wildcard ("*") column: indicates the place in the projection list to insert the table columns once found in the table projection plan.
  • Data source columns: columns from the underlying table. The table projection planner will determine if the column exists, or must be filled in with a null column.
  • The generic data source columns array: >columns, or optionally specific members of the >columns array such as >columns[1]. (Supported only by specific readers.)
  • Implicit columns: >fqn, >filename, >filepath and >suffix. These reference parts of the name of the file being scanned.
  • Partition columns: >dir0, >dir1, ...: These reference parts of the path name of the file.

Empty Schema

A special case occurs if the projection list is empty which indicates that the query is a COUNT(*): we need only a count of columns, but none of the values. Implementation of the count is left to the specific reader as some can optimize this case. The output schema may include a single dummy column. In this case, the first batch defines the schema expected from all subsequent readers and batches.

Implicit Columns

The project list can contain implicit columns for data sources which support them. Implicit columns are disjoint from data source columns and are provided by Drill itself. This class effectively splits the projection list into a set of implicit columns, and the remainder of the list which are the reader columns.

Reader Input Schema

The various forms of schema above produce a reader input schema: the schema given to the reader. The reader input schema is the set of projected columns, minus implicit columns, along with available type information.

If the reader can produce only one type for each column, then the provided or defined schema should already specify that type, and the reader can simply ignore the reader input schema. (This feature allows this scheme to be compatible with older readers.)

However, if the reader can convert a column to multiple types, then the reader should use the reader input schema to choose a type. If the input schema is dynamic (type is LATE), then the reader chooses the column type and should chose the "most natural" type.

Reader Output Schema

The reader proceeds to read a batch of data, choosing types for dynamic columns. The reader may provide a subset of projected columns if, say the reader reads an older file that is missing some columns or (for a dynamic schema), the user specified columns which don't actually exist.

The result is the reader output schema: a subset of the reader input schema in which each included column has a concrete type. (The reader may have provided extra columns. In this case, the ResultSetLoader will have ignored those columns, providing a dummy column writer, and omitting non-projected columns from the reader output schema.)

The reader output schema is provided to this class which resolves any dynamic columns to the concrete type provided by the reader. If the column was already resolved, this class ensures that the reader's column type matches the resolved type to prevent column type changes.

Dynamic Wildcard Schema

Traditional query planners resolve the wildcard (*) in the planner. When using a dynamic schema, Drill resolves the wildcard at run time. In this case, the reader input schema is empty and the reader defines the entire set of columns: names and types. This class then replaces the wildcard with the columns from the reader.

Missing Columns

When the reader output schema is a subset of the reader input schema, the we have a set of missing columns (also called "null columns"). A part of the scan framework must invent vectors for these columns. If the type is available, then that is the type used, otherwise the missing column handler must invent a type (such as the classic nullable INT historically used.) If the mode is nullable, the column is filled with nulls. If non-nullable, the column is filled with a default value. All of this work happens outside of this class.

The missing column handler defined its own output schema which is resolved by this class identical to how the reader schema is resolved. The result is that all columns are now resolved to a concrete type.

Missing columns may be needed even for a wildcard if a first reader discovered 3 columns, say, but a later reader encounters only two of them.

Subsequent Readers and Schema Changes

All of the above occurs during the first batch of data. After that, the schema is fully defined: subsequent readers will encounter only a fully defined schema, which it must handle the same as if the scan was given a defined schema.

This rule works file for an explicit project list. However, if the project list is dynamic, and contains a wildcard, then the reader defines the output schema. What happens if a reader adds columns (or a second or later reader discovers new columns)? Traditionally, Drill simply adds those columns and sends a OK_NEW_SCHEMA (schema change) downstream for other operators to deal with.

This class supports the traditional approach as an option. This class also supports a more rational, strict rule: the schema is fixed after the first batch. That is, the first batch defines a schema commit point after which the scan agrees not to change the schema. In this scenario, the first batch defines a schema (and project list) given to all subsequent readers. Any new columns are ignored (with a warning in the log.)

Output Schema

All of the above contribute to the output schema: the schema sent downstream to the next operator. All of the above work is done to either:
  • Pass the defined schema to the output, with the reader (and missing columns handler) producing columns that match that schema.
  • Expand the dynamic schema with details provided by the reader (and missing columns hander), including the actual set of columns if the dynamic schema includes a wildcard.

Either way, the result is a schema which describes the actual vectors sent downstream.

Consumers

Information from this class is used in multiple ways:
  • A project list is given to the ResultSetLoader to specify which columns to project to vectors, and which to satisfy with a dummy column reader.
  • The reader, via the {code SchemaNegotiator} uses the reader input schema.
  • The reader, via the ResultSetLoader provides the reader output schema.
  • An implicit column manager handles the various implicit and partition directory columns: identifying them then later providing vector values.
  • A missing columns handler fills in missing columns.

Design

Schema resolution is a set of layers of choices. Each level and choice is represented by a class: virtual method pick the right path based on class type rather than using a large collection of if-statements.

Maps

Maps present a difficult challenge. Drill allows projection within maps and we wish to exploit that in the scan. For example: m.a. The column state classes provide a map class. However, the projection notation is ambiguous: m.a could be a map `m` with a child column 'a'. Or, it could be a DICT with a {code VARCHAR} key.

To handle this, if we only have the project list, we use an unresolved column state, even if the projection itself has internal structure. We use a projection-based filter in the ResultSetLoader to handle the ambiguity. The projection filter, when presented with the reader's choice for column type, will check if that type is consistent with projection. If so, the reader will later present the reader output schema which we use to resolve the projection-only unresolved column to a map column. (Or, if the column turns out to be a DICT, to a simple unresolved column.)

If the scan contains a second reader, then the second reader is given a stricter form of projection filter: one based on the actual MAP (or DICT) column.

If a defined or provided schema is available, then the schema tracker does have sufficient information to resolve the column directly to a map column, and the first reader will have the strict projection filter.

A user can project a map column which does not actually exist (or, at least, is not known to the first reader.) In that case, the missing column logic applies, but within the map. As a result, a second reader may encounter a type conflict if it discovers the previously-missing column, and finds that the default type conflicts with the real type.

  • Method Details

    • projectionType

    • columnProjection

      ProjectedColumn columnProjection(String colName)
      Return the projection for a column, if any.
    • isResolved

      boolean isResolved()
      Is the scan schema resolved? The schema is resolved depending on the complex lifecycle explained in the class comment. Resolution occurs when the wildcard (if any) is expanded, and all explicit projection columns obtain a definite type. If schema change is disabled, the schema will not change once it is resolved. If schema change is allowed, then batches or readers may extend the schema, triggering a schema change, and so the scan schema may move from one resolved state to another.

      The schema will be fully resolved after the first batch of data arrives from a reader (since the reader lifecycle will then fill in any missing columns.) The schema may be resolved sooner (such as if a strict provided schema, or an early reader schema is available and there are no missing columns.)

      Returns:
      if the schema is resolved, and hence the outputSchema() is available, false if the schema contains one or more dynamic columns which are not yet resolved.
    • schemaVersion

      int schemaVersion()
      Gives the output schema version which will start at some arbitrary positive number.

      If schema change is allowed, the schema version allows detecting schema changes as the scan schema moves from one resolved state to the next. Each schema will have a unique, increasing version number. A schema change has occurred if the version is newer than the previous output schema version.

      Returns:
      the schema version. The absolute number is not important, rather an increase indicates one or more columns were added at the top level or within a map at some nesting level
    • expandImplicitCol

      void expandImplicitCol(ColumnMetadata resolved, ImplicitColumnMarker marker)
      Drill defines a wildcard to expand not just reader columns, but also partition columns. When the implicit column handlers sees that the query has a wildcard (by calling #isProjectAll()), the handler then determines which partition columns are needed and calls this method to add each one.
    • applyImplicitCols

      TupleMetadata applyImplicitCols()
      Indicate that implicit column parsing is complete. Returns the implicit columns as identified by the implicit column handler, in the order of the projection list. Implicit columns do not appear in a reader input schema, and it is an error for the reader to produce such columns.
      Returns:
      a sub-schema of only implicit columns, in the order in which they appear in the output schema
    • applyEarlyReaderSchema

      void applyEarlyReaderSchema(TupleMetadata readerSchema)
      If a reader can define a schema before reading data, apply that schema to the scan schema. Allows the scan to report its output schema before the first batch of data if the scan schema becomes resolved after the early reader schema.
    • readerInputSchema

      TupleMetadata readerInputSchema()
      The schema which the reader should produce. Depending on the type of the scan (specifically, if #isProjectAll() is true), the reader may produce additional columns beyond those in the the reader input schema. However, for any batch, the reader, plus the missing columns handler, must produce all columns in the reader input schema.

      Formally:

      
       reader input schema = output schema - implicit col schema
       
      Returns:
      the sub-schema which includes those columns which the reader should provide, excluding implicit columns
    • missingColumns

      TupleMetadata missingColumns(TupleMetadata readerOutputSchema)
      Identifies the missing columns given a reader output schema. The reader output schema are those columns which the reader actually produced.

      Formally:

      
       missing cols = reader input schema - reader output schema
       

      The reader output schema can contain extra, newly discovered columns. Those are ignored when computing missing columns. Thus, the subtraction is set subtraction: remove columns common to the two sets.

    • outputSchema

      TupleMetadata outputSchema()
      Returns the scan output schema which is a somewhat complicated computation that depends on the projection type.

      For a wildcard schema:

      
       output schema = implicit cols U reader output schema
       

      For an explicit projection:

      
       output schema = projection list
       
      Where the projection list is augmented by types from the provided schema, implicit columns or readers.

      A defined schema is the output schema, so:

       output schema = defined schema
       
      Returns:
      the complete output schema provided by the scan to downstream operators. Includes both reader and implicit columns, in the order of the projection list or, for a wildcard, in the order of the first reader
    • projectionFilter

      ProjectionFilter projectionFilter(CustomErrorContext errorContext)
      A reader is responsible for reading columns in the reader input schema. A reader may read additional columns. The projection filter is passed to the ResultSetLoader to determine which columns should be projected, allowing the reader to be blissfully ignorant of which columns are needed. The result set loader provides a dummy reader for unprojected columns. (A reader can, via the result set loader, find if a column is projected if doing so helps reader efficiency.)

      The projection filter is the first line of defense for schema conflicts. The {code ResultSetLoader} will query the filter with a full column schema. If that schema conflicts with the scan schema for that column, this method will raise a UserException, which typically indicates a programming error, or a very odd data source in which a column changes types between batches.

      Parameters:
      errorContext - the reader-specific error context to use if errors are found
      Returns:
      a filter used to decide which reader columns to project during reading
    • applyReaderSchema

      void applyReaderSchema(TupleMetadata readerOutputSchema, CustomErrorContext errorContext)
      Once a reader has read a batch, the reader will have provided a type for each projected column which the reader knows about. For a wildcard projection, the reader will have added all the columns that it found. This call takes the reader output schema and merges it with the current scan schema to resolve dynamic types to concrete types and to add newly discovered columns.

      The process can raise an exception if the reader projects a column that it shouldn't (which is not actually possible because of the way the ResultSetLoader works.) An error can also occur if the reader provides a type different than that already defined in the scan schema by a defined schema, a provided schema, or a previous reader in the same scan. In such cases, the reader is expected to have converted its input type to the specified type, which was presumably selected because the reader is capable of the required conversion.

      Parameters:
      readerOutputSchema - the actual schema produced by a reader when reading a record batch
      errorContext - the reader-specific error context to use if errors are found
    • resolveMissingCols

      void resolveMissingCols(TupleMetadata missingCols)
      The missing column handler obtains the list of missing columns from #missingColumns(). Depending on the scan lifecycle, some of the columns may have a type, others may be dynamic. The missing column handler chooses a type for any dynamic columns, then calls this method to tell the scan schema tracker the now-resolved column type.

      Note: a goal of the provided/defined schema system is to avoid the need to guess types for missing columns since doing so quite often leads to problems further downstream in the query. Ideally, the type of missing columns will be known (via the provided or defined schema) to avoid such conflicts.

    • errorContext

      CustomErrorContext errorContext()
      The scan-level error context used for errors which may occur before the first reader starts. The reader will provide a more detailed error context that describes what is being read.
      Returns:
      the scan-level error context
    • internalSchema

      MutableTupleSchema internalSchema()
      Returns the internal scan schema. Primarily for testing.
      Returns:
      the internal mutable scan schema