org.apache.drill.exec.physical.impl.scan.ScanOperatorExec

All Implemented Interfaces:: OperatorExec

public class ScanOperatorExec extends Object implements OperatorExec

Implementation of the revised scan operator that uses a mutator aware of batch sizes. This is the successor to ScanBatch and should be used by all new scan implementations.

The basic concept is to split the scan operator into layers:

The OperatorRecordBatch which implements Drill's Volcano-like protocol.
The scan operator "wrapper" (this class) which implements actions for the operator record batch specifically for scan. It iterates over readers, delegating semantic work to other classes.
The implementation of per-reader semantics in the two EVF versions and other ad-hoc implementations.
The result set loader and related classes which pack values into value vectors.
Value vectors, which store the data.

The layered format can be confusing. However, each layer is somewhat complex, so dividing the work into layers keeps the overall complexity somewhat under control.

Scanner Framework

Acts as an adapter between the operator protocol and the row reader protocol.

The scan operator itself is simply a framework for handling a set of readers; it knows nothing other than the interfaces of the components it works with; delegating all knowledge of schemas, projection, reading and the like to implementations of those interfaces. Because that work is complex, a set of frameworks exist to handle most common use cases, but a specialized reader can create a framework or reader from scratch.

Error handling in this class is minimal: the enclosing record batch iterator is responsible for handling exceptions. Error handling relies on the fact that the iterator will call close() regardless of which exceptions are thrown.

Protocol

The scanner works directly with two other interfaces

The ScanOperatorEvents implementation provides the set of readers to use. This class can simply maintain a list, or can create the reader on demand.

More subtly, the factory also handles projection issues and manages vectors across subsequent readers. A number of factories are available for the most common cases. Extend these to implement a version specific to a data source.

The RowBatchReader is a surprisingly minimal interface that nonetheless captures the essence of reading a result set as a set of batches. The factory implementations mentioned above implement this interface to provide commonly-used services, the most important of which is access to a {#link ResultSetLoader} to write values into value vectors.

Schema Versions

Readers may change schemas from time to time. To track such changes, this implementation tracks a batch schema version, maintained by comparing one schema with the next.

Readers can discover columns as they read data, such as with any JSON-based format. In this case, the row set mutator also provides a schema version, but a fine-grained one that changes each time a column is added.

The two schema versions serve different purposes and are not interchangeable. For example, if a scan reads two files, both will build up their own schemas, each increasing its internal version number as work proceeds. But, at the end of each batch, the schemas may (and, in fact, should) be identical, which is the schema version downstream operators care about.

Empty Files and/or Empty Schemas

A corner case occurs if the input is empty, such as a CSV file that contains no data. The general rule is the following:

If the reader is "early schema" (the schema is defined at open time), then the result will be a single empty batch with the schema defined. Example: a CSV file without headers; in this case, we know the schema is always the single `columns` array.
If the reader is "late schema" (the schema is defined while the data is read), then no batch is returned because there is no schema. Example: a JSON file. It is not helpful to return a single batch with no columns; such a batch will simply conflict with some other non-empty-schema batch. It turns out that other DBs handle this case gracefully: a query of the form
```
 SELECT * FROM VALUES()
```
Will produce an empty result: no schema, no data.
The hybrid case: the reader could provide an early schema, but cannot do so. That is, the early schema contains no columns. We treat this case identically to the late schema case. Example: a CSV file with headers in which the header line is empty.

Field Summary

Fields

Modifier and Type

Field

Description

protected final VectorContainerAccessor

containerAccessor

protected OperatorContext

context
Constructor Summary

Constructors

Constructor

Description

ScanOperatorExec(ScanOperatorEvents factory, boolean allowEmptyResult)
Method Summary

Modifier and Type

Method

Description

BatchAccessor

batchAccessor()

Provides a generic access mechanism to the batch's output data.

void

bind(OperatorContext context)

Bind this operator to the context.

boolean

buildSchema()

Retrieves the schema of the batch before the first actual batch of data.

void

cancel()

Alerts the operator that the query was cancelled.

void

close()

Close the operator by releasing all resources that the operator held.

OperatorContext

context()

boolean

next()

Retrieves the next batch of data.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- containerAccessor
  
  protected final VectorContainerAccessor containerAccessor
- context
  
  protected OperatorContext context
Constructor Details
- ScanOperatorExec
  
  public ScanOperatorExec(ScanOperatorEvents factory, boolean allowEmptyResult)
Method Details
- bind
  
  public void bind(OperatorContext context)
  
  Description copied from interface: OperatorExec
  
  Bind this operator to the context. The context provides access to per-operator, per-fragment and per-Drillbit services. Also provides access to the operator definition (AKA "pop config") for this operator.
  
  Specified by:
  
  bind in interface OperatorExec
  
  Parameters:
  
  context - operator context
- batchAccessor
  
  public BatchAccessor batchAccessor()
  
  Description copied from interface: OperatorExec
  
  Provides a generic access mechanism to the batch's output data. This method is called after a successful return from OperatorExec.buildSchema() and OperatorExec.next(). The batch itself can be held in a standard VectorContainer, or in some other structure more convenient for this operator.
  
  Specified by:
  
  batchAccessor in interface OperatorExec
  
  Returns:
  
  the access for the batch's output container
- context
  
  public OperatorContext context()
- buildSchema
  
  public boolean buildSchema()
  
  Description copied from interface: OperatorExec
  
  Retrieves the schema of the batch before the first actual batch of data. The schema is returned via an empty batch (no rows, only schema) from OperatorExec.batchAccessor().
  
  Specified by:
  
  buildSchema in interface OperatorExec
  
  Returns:
  
  true if a schema is available, false if the operator reached EOF before a schema was found
- next
  
  public boolean next()
  
  Description copied from interface: OperatorExec
  
  Retrieves the next batch of data. The data is returned via the OperatorExec.batchAccessor() method.
  
  Specified by:
  
  next in interface OperatorExec
  
  Returns:
  
  true if another batch of data is available, false if EOF was reached and no more data is available
- cancel
  
  public void cancel()
  
  Description copied from interface: OperatorExec
  
  Alerts the operator that the query was cancelled. Generally optional, but allows the operator to realize that a cancellation was requested.
  
  Specified by:
  
  cancel in interface OperatorExec
- close
  
  public void close()
  
  Description copied from interface: OperatorExec
  
  Close the operator by releasing all resources that the operator held. Called after OperatorExec.cancel() and after OperatorExec.batchAccessor() or OperatorExec.next() returns false.
  Note that there may be a significant delay between the last call to next() and the call to close() during which downstream operators do their work. A tidy operator will release resources immediately after EOF to avoid holding onto memory or other resources that could be used by downstream operators.
  
  Specified by:
  
  close in interface OperatorExec

Class ScanOperatorExec

Scanner Framework

Protocol

Schema Versions

Empty Files and/or Empty Schemas

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

containerAccessor

context

Constructor Details

ScanOperatorExec

Method Details

bind

batchAccessor

context

buildSchema

next

cancel

close