org.apache.drill.exec.physical.impl.scan.project.ScanSchemaOrchestrator

public class ScanSchemaOrchestrator extends Object

Performs projection of a record reader, along with a set of static columns, to produce the final "public" result set (record batch) for the scan operator. Primarily solves the "vector permanence" problem: that the scan operator must present the same set of vectors to downstream operators despite the fact that the scan operator hosts a series of readers, each of which builds its own result set.

Provides the option to continue a schema from one batch to the next. This can reduce spurious schema changes in formats, such as JSON, with varying fields. It is not, however, a complete solution as the outcome still depends on the order of file scans and the division of files across readers.

Provides the option to infer the schema from the first batch. The "quick path" to obtain the schema will read one batch, then use that schema as the returned schema, returning the full batch in the next call to next().

Publishing the Final Result Set

This class "publishes" a vector container that has the final, projected form of a scan. The projected schema include:

Columns from the reader.

Static columns, such as implicit or partition columns.

Null columns for items in the select list, but not found in either of the above two categories.

The order of columns is that set by the select list (or, by the reader for a `SELECT *` query.

Schema Handling

The mapping handles a variety of cases:

An early-schema table (one in which we know the schema and the schema remains constant for the whole table.
A late schema table (one in which we discover the schema as we read the table, and where the schema can change as the read progresses.)
- Late schema table with SELECT * (we want all columns, whatever they happen to be.)
- Late schema with explicit select list (we want only certain columns when they happen to appear in the input.)

Implementation Overview

Major tasks of this class include:

Project table columns (change position and or name).
Insert static and null columns.
Schema smoothing. That is, if table A produces columns (a, b), but table B produces only (a), use the type of the first table's b column for the null created for the missing b in table B.
Vector persistence: use the same set of vectors across readers as long as the reader schema does not cause a "hard" schema change (change in type, introduction of a new column.
Detection of schema changes (change of type, introduction of a new column for a SELECT * query, changing the projected schema, and reporting the change downstream.

A projection is needed to:

Reorder table columns
Select a subset of table columns
Fill in missing select columns
Fill in implicit or partition columns

Creates and returns the batch merger that does the projection.

Projection

To visualize this, assume we have numbered table columns, lettered implicit, null or partition columns:


 [ 1 | 2 | 3 | 4 ]    Table columns in table order
 [ A | B | C ]        Static columns

Now, we wish to project them into select order. Let's say that the SELECT clause looked like this, with "t" indicating table columns:


 SELECT t2, t3, C, B, t1, A, t2 ...

Then the projection looks like this:


 [ 2 | 3 | C | B | 1 | A | 2 ]

Often, not all table columns are projected. In this case, the result set loader presents the full table schema to the reader, but actually writes only the projected columns. Suppose we have:


 SELECT t3, C, B, t1,, A ...

Then the abbreviated table schema looks like this:


 [ 1 | 3 ]

Note that table columns retain their table ordering. The projection looks like this:


 [ 2 | C | B | 1 | A ]

The projector is created once per schema, then can be reused for any number of batches.

Merging is done in one of two ways, depending on the input source:

For the table loader, the merger discards any data in the output, then exchanges the buffers from the input columns to the output, leaving projected columns empty. Note that unprojected columns must be cleared by the caller.
For implicit and null columns, the output vector is identical to the input vector.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static class

ScanSchemaOrchestrator.ScanOrchestratorBuilder

static class

ScanSchemaOrchestrator.ScanSchemaOptions
Field Summary

Fields

Modifier and Type

Field

Description

protected final BufferAllocator

allocator

protected int

batchCount

static final int

DEFAULT_BATCH_BYTE_COUNT

static final int

DEFAULT_BATCH_ROW_COUNT

static final int

MAX_BATCH_BYTE_SIZE

static final int

MAX_BATCH_ROW_COUNT

final MetadataManager

metadataManager

Creates the metadata (file and directory) columns, if needed.

static final int

MIN_BATCH_BYTE_SIZE

protected final ScanSchemaOrchestrator.ScanSchemaOptions

options

protected VectorContainer

outputContainer

protected long

rowCount

protected final ScanLevelProjection

scanProj

protected final SchemaSmoother

schemaSmoother

protected final ResultVectorCacheImpl

vectorCache

Cache used to preserve the same vectors from one output batch to the next to keep the Project operator happy (which depends on exactly the same vectors.
Constructor Summary

Constructors

Constructor

Description

ScanSchemaOrchestrator(BufferAllocator allocator, ScanSchemaOrchestrator.ScanOrchestratorBuilder builder)
Method Summary

Modifier and Type

Method

Description

boolean

atLimit()

void

close()

void

closeReader()

boolean

hasSchema()

boolean

isProjectNone()

VectorContainer

output()

TupleMetadata

providedSchema()

Returns the provided reader schema.

ReaderSchemaOrchestrator

startReader()

void

tallyBatch(int rowCount)

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- MIN_BATCH_BYTE_SIZE
  
  public static final int MIN_BATCH_BYTE_SIZE
  See Also:
  
  Constant Field Values
- MAX_BATCH_BYTE_SIZE
  
  public static final int MAX_BATCH_BYTE_SIZE
  See Also:
  
  Constant Field Values
- DEFAULT_BATCH_ROW_COUNT
  
  public static final int DEFAULT_BATCH_ROW_COUNT
  See Also:
  
  Constant Field Values
- DEFAULT_BATCH_BYTE_COUNT
  
  public static final int DEFAULT_BATCH_BYTE_COUNT
- MAX_BATCH_ROW_COUNT
  
  public static final int MAX_BATCH_ROW_COUNT
  See Also:
  
  Constant Field Values
- allocator
  
  protected final BufferAllocator allocator
- options
  
  protected final ScanSchemaOrchestrator.ScanSchemaOptions options
- metadataManager
  
  public final MetadataManager metadataManager
  
  Creates the metadata (file and directory) columns, if needed.
- vectorCache
  
  protected final ResultVectorCacheImpl vectorCache
  
  Cache used to preserve the same vectors from one output batch to the next to keep the Project operator happy (which depends on exactly the same vectors.
  If the Project operator ever changes so that it depends on looking up vectors rather than vector instances, this cache can be deprecated.
- scanProj
  
  protected final ScanLevelProjection scanProj
- schemaSmoother
  
  protected final SchemaSmoother schemaSmoother
- batchCount
  
  protected int batchCount
- rowCount
  
  protected long rowCount
- outputContainer
  
  protected VectorContainer outputContainer
Constructor Details
- ScanSchemaOrchestrator
  
  public ScanSchemaOrchestrator(BufferAllocator allocator, ScanSchemaOrchestrator.ScanOrchestratorBuilder builder)
Method Details
- startReader
  
  public ReaderSchemaOrchestrator startReader()
- isProjectNone
  
  public boolean isProjectNone()
- hasSchema
  
  public boolean hasSchema()
- providedSchema
  
  public TupleMetadata providedSchema()
  
  Returns the provided reader schema.
- output
  
  public VectorContainer output()
- tallyBatch
  
  public void tallyBatch(int rowCount)
- atLimit
  
  public boolean atLimit()
- closeReader
  
  public void closeReader()
- close
  
  public void close()

Class ScanSchemaOrchestrator

Publishing the Final Result Set

Schema Handling

Implementation Overview

Projection

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

MIN_BATCH_BYTE_SIZE

MAX_BATCH_BYTE_SIZE

DEFAULT_BATCH_ROW_COUNT

DEFAULT_BATCH_BYTE_COUNT

MAX_BATCH_ROW_COUNT

allocator

options

metadataManager

vectorCache

scanProj

schemaSmoother

batchCount

rowCount

outputContainer

Constructor Details

ScanSchemaOrchestrator

Method Details

startReader

isProjectNone

hasSchema

providedSchema

output

tallyBatch

atLimit

closeReader

close