All Known Subinterfaces:: ColumnsSchemaNegotiator, FileScanFramework.FileSchemaNegotiator

All Known Implementing Classes:: ColumnsScanFramework.ColumnsSchemaNegotiatorImpl, FileScanFramework.FileSchemaNegotiatorImpl, SchemaNegotiatorImpl

public interface SchemaNegotiator

Negotiates the table schema with the scanner framework and provides context information for the reader. Scans use either a "dynamic" or a defined schema.

Regardless of the schema type, the result of building the schema is a result set loader used to prepare batches for use in the query. The reader can simply read all columns, allowing the framework to discard unwanted values. Or for efficiency, the reader can check the column metadata to determine if a column is projected, and if not, then don't even read the column from the input source.

Defined Schema

If defined, the execution plan provides the output schema (presumably computed from an accurate metadata source.) The reader must populate the proscribed rows, performing column type conversions as needed. The reader can determine if the schema is defined by calling hasOutputSchema().

At present, the scan framework filters the "provided schema" against the project list so that this class presents only the actual output schema. Future versions may do the filtering in the planner, but the result for readers will be the same either way.

Dynamic Schema

A dynamic schema occurs when the plan does not specify a schema. Drill is unique in its support for "schema on read" in the sense that Drill does not know the schema until the reader defines it at scan time.

The reader and scan framework coordinate to form the output schema. The reader offers the columns it has available. The scan framework uses the projection list to decide which to accept. Either way the scan framework provides a column reader for the column (returning a do-nothing "dummy" reader if the column is unprojected.)

With a dynamic schema, readers offer a schema in one of two ways:

The reader provides the table schema in one of two ways: early schema or late schema. Either way, the project list from the physical plan determines which table columns are materialized and which are not. Readers are provided for all table columns for readers that must read sequentially, but only the materialized columns are written to value vectors.

Early Dynamic Schema

Some readers can determine the source schema at the start of a scan. For example, a CSV file has headers, a Parquet file has footers, both of which define a schema. This case is called "early schema." The reader fefines the schema by calling #tableSchema(TupleMetadata) to provide the known schema.

Late Dynamic Schema

Other readers don't know the input schema until the reader actually reads the data. For example, JSON typically has no schema, but does have sufficient structure (name/value pairs) to infer one.

The late schema reader calls RowSetLoader#addColumn() to add each column as it is discovered during the scan.

Note that, to avoid schema conflicts, a late schema reader must define the full set of columns in the first batch, and must stick to that schema for all subsequent batches. This allows the reader to look one batch ahead to learn the columns.

Drill, however, cannot predict the future. Without a defined schema, downstream operators cannot know which columns might appear later in the scan, with which types. Today this is a strong guideline. Future versions may enforce this rule.

Method Summary

Modifier and Type

Method

Description

void

batchSize(int maxRecordsPerBatch)

Set the preferred batch size (which may be overridden by the result set loader in order to limit vector or batch size.)

ResultSetLoader

build()

Build the schema, plan the required projections and static columns and return a loader used to populate value vectors.

OperatorContext

context()

com.typesafe.config.Config

drillConfig()

boolean

hasProvidedSchema()

Report if the execution plan defines a provided schema.

boolean

isProjectionEmpty()

Report whether the projection list is empty, as occurs in two cases: SELECT COUNT(*) ... -- empty project. SELECT a, b FROM table(c d) -- disjoint project.

void

limit(long limit)

Push down a LIMIT into the scan.

CustomErrorContext

parentErrorContext()

The context to use as a parent when creating a custom context.

TupleMetadata

providedSchema()

Returns the provided schema, if defined.

OptionSet

queryOptions()

void

setErrorContext(CustomErrorContext context)

Specify an advanced error context which allows the reader to fill in custom context values.

void

tableSchema(TupleMetadata schema, boolean isComplete)

Specify the table schema if this is an early-schema reader.

String

userName()

Name of the user running the query.

Method Details
- context
  
  OperatorContext context()
- drillConfig
  
  com.typesafe.config.Config drillConfig()
- queryOptions
  
  OptionSet queryOptions()
- setErrorContext
  
  void setErrorContext(CustomErrorContext context)
  
  Specify an advanced error context which allows the reader to fill in custom context values.
- userName
  
  String userName()
  
  Name of the user running the query.
- hasProvidedSchema
  
  boolean hasProvidedSchema()
  
  Report if the execution plan defines a provided schema. If so, the reader should use that schema, converting or ignoring columns as needed. A scan without a provided schema has a "dynamic" schema to be defined by the scan operator itself along with the column projection list.
  
  Returns:
  
  true if the execution plan defines the output schema, false if the schema should be computed dynamically from the source schema and column projections
- providedSchema
  
  TupleMetadata providedSchema()
  
  Returns the provided schema, if defined. The provided schema is a description of the source schema viewed as a Drill schema.
  
  Returns:
  
  the output schema, if hasProvidedSchema() returns true, null otherwise
- tableSchema
  
  void tableSchema(TupleMetadata schema, boolean isComplete)
  
  Specify the table schema if this is an early-schema reader. Need not be called for a late-schema readers. The schema provided here, if any, is a base schema: the reader is free to discover additional columns during the read.
  Should only be called if the schema is dynamic, that is, if hasProvidedSchema() returns false.
  
  Parameters:
  
  schema - the table schema if known at open time
  
  isComplete - true if the schema is complete: if it can be used to define an empty schema-only batch for the first reader. Set to false if the schema is partial: if the reader must read rows to determine the full schema
- batchSize
  
  void batchSize(int maxRecordsPerBatch)
  
  Set the preferred batch size (which may be overridden by the result set loader in order to limit vector or batch size.)
  
  Parameters:
  
  maxRecordsPerBatch - preferred number of record per batch
- limit
  
  void limit(long limit)
  
  Push down a LIMIT into the scan. This is a per-reader limit, not an overall scan limit.
- build
  
  ResultSetLoader build()
  
  Build the schema, plan the required projections and static columns and return a loader used to populate value vectors. If the select list includes a subset of table columns, then the loader will be set up in table schema order, but the unneeded column loaders will be null, meaning that the batch reader should skip setting those columns.
  
  Returns:
  
  the loader for the table with columns arranged in table schema order
- isProjectionEmpty
  
  boolean isProjectionEmpty()
  Report whether the projection list is empty, as occurs in two cases:
  
  SELECT COUNT(*) ... -- empty project.
  
  SELECT a, b FROM table(c d) -- disjoint project.

Interface SchemaNegotiator

Defined Schema

Dynamic Schema

Early Dynamic Schema

Late Dynamic Schema

Method Summary

Method Details

context

drillConfig

queryOptions

setErrorContext

userName

hasProvidedSchema

providedSchema

tableSchema

batchSize

limit

build

isProjectionEmpty

parentErrorContext