Class ScanSchemaOrchestrator

java.lang.Object
org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator

public class ScanSchemaOrchestrator extends Object
Performs projection of a record reader, along with a set of static columns, to produce the final "public" result set (record batch) for the scan operator. Primarily solves the "vector permanence" problem: that the scan operator must present the same set of vectors to downstream operators despite the fact that the scan operator hosts a series of readers, each of which builds its own result set.

Provides the option to continue a schema from one batch to the next. This can reduce spurious schema changes in formats, such as JSON, with varying fields. It is not, however, a complete solution as the outcome still depends on the order of file scans and the division of files across readers.

Provides the option to infer the schema from the first batch. The "quick path" to obtain the schema will read one batch, then use that schema as the returned schema, returning the full batch in the next call to next().

Publishing the Final Result Set

This class "publishes" a vector container that has the final, projected form of a scan. The projected schema include:
  • Columns from the reader.
  • Static columns, such as implicit or partition columns.
  • Null columns for items in the select list, but not found in either of the above two categories.
The order of columns is that set by the select list (or, by the reader for a SELECT * query.

Schema Handling

The mapping handles a variety of cases:
  • An early-schema table (one in which we know the schema and the schema remains constant for the whole table.
  • A late schema table (one in which we discover the schema as we read the table, and where the schema can change as the read progresses.)
    • Late schema table with SELECT * (we want all columns, whatever they happen to be.)
    • Late schema with explicit select list (we want only certain columns when they happen to appear in the input.)

Implementation Overview

Major tasks of this class include:
  • Project table columns (change position and or name).
  • Insert static and null columns.
  • Schema smoothing. That is, if table A produces columns (a, b), but table B produces only (a), use the type of the first table's b column for the null created for the missing b in table B.
  • Vector persistence: use the same set of vectors across readers as long as the reader schema does not cause a "hard" schema change (change in type, introduction of a new column.
  • Detection of schema changes (change of type, introduction of a new column for a SELECT * query, changing the projected schema, and reporting the change downstream.
A projection is needed to:
  • Reorder table columns
  • Select a subset of table columns
  • Fill in missing select columns
  • Fill in implicit or partition columns
Creates and returns the batch merger that does the projection.

Projection

To visualize this, assume we have numbered table columns, lettered implicit, null or partition columns:

 [ 1 | 2 | 3 | 4 ]    Table columns in table order
 [ A | B | C ]        Static columns
 
Now, we wish to project them into select order. Let's say that the SELECT clause looked like this, with "t" indicating table columns:

 SELECT t2, t3, C, B, t1, A, t2 ...
 
Then the projection looks like this:

 [ 2 | 3 | C | B | 1 | A | 2 ]
 
Often, not all table columns are projected. In this case, the result set loader presents the full table schema to the reader, but actually writes only the projected columns. Suppose we have:

 SELECT t3, C, B, t1,, A ...
 
Then the abbreviated table schema looks like this:

 [ 1 | 3 ]
Note that table columns retain their table ordering. The projection looks like this:

 [ 2 | C | B | 1 | A ]
 

The projector is created once per schema, then can be reused for any number of batches.

Merging is done in one of two ways, depending on the input source:

  • For the table loader, the merger discards any data in the output, then exchanges the buffers from the input columns to the output, leaving projected columns empty. Note that unprojected columns must be cleared by the caller.
  • For implicit and null columns, the output vector is identical to the input vector.
  • Field Details

    • MIN_BATCH_BYTE_SIZE

      public static final int MIN_BATCH_BYTE_SIZE
      See Also:
    • MAX_BATCH_BYTE_SIZE

      public static final int MAX_BATCH_BYTE_SIZE
      See Also:
    • DEFAULT_BATCH_ROW_COUNT

      public static final int DEFAULT_BATCH_ROW_COUNT
      See Also:
    • DEFAULT_BATCH_BYTE_COUNT

      public static final int DEFAULT_BATCH_BYTE_COUNT
    • MAX_BATCH_ROW_COUNT

      public static final int MAX_BATCH_ROW_COUNT
      See Also:
    • allocator

      protected final BufferAllocator allocator
    • options

    • metadataManager

      public final MetadataManager metadataManager
      Creates the metadata (file and directory) columns, if needed.
    • vectorCache

      protected final ResultVectorCacheImpl vectorCache
      Cache used to preserve the same vectors from one output batch to the next to keep the Project operator happy (which depends on exactly the same vectors.

      If the Project operator ever changes so that it depends on looking up vectors rather than vector instances, this cache can be deprecated.

    • scanProj

      protected final ScanLevelProjection scanProj
    • schemaSmoother

      protected final SchemaSmoother schemaSmoother
    • batchCount

      protected int batchCount
    • rowCount

      protected long rowCount
    • outputContainer

      protected VectorContainer outputContainer
  • Constructor Details

  • Method Details

    • startReader

      public ReaderSchemaOrchestrator startReader()
    • isProjectNone

      public boolean isProjectNone()
    • hasSchema

      public boolean hasSchema()
    • providedSchema

      public TupleMetadata providedSchema()
      Returns the provided reader schema.
    • output

      public VectorContainer output()
    • tallyBatch

      public void tallyBatch(int rowCount)
    • atLimit

      public boolean atLimit()
    • closeReader

      public void closeReader()
    • close

      public void close()