Class ScanOperatorExec

java.lang.Object
org.apache.drill.exec.physical.impl.scan.ScanOperatorExec
All Implemented Interfaces:
OperatorExec

public class ScanOperatorExec extends Object implements OperatorExec
Implementation of the revised scan operator that uses a mutator aware of batch sizes. This is the successor to ScanBatch and should be used by all new scan implementations.

The basic concept is to split the scan operator into layers:

  • The OperatorRecordBatch which implements Drill's Volcano-like protocol.
  • The scan operator "wrapper" (this class) which implements actions for the operator record batch specifically for scan. It iterates over readers, delegating semantic work to other classes.
  • The implementation of per-reader semantics in the two EVF versions and other ad-hoc implementations.
  • The result set loader and related classes which pack values into value vectors.
  • Value vectors, which store the data.

The layered format can be confusing. However, each layer is somewhat complex, so dividing the work into layers keeps the overall complexity somewhat under control.

Scanner Framework

Acts as an adapter between the operator protocol and the row reader protocol.

The scan operator itself is simply a framework for handling a set of readers; it knows nothing other than the interfaces of the components it works with; delegating all knowledge of schemas, projection, reading and the like to implementations of those interfaces. Because that work is complex, a set of frameworks exist to handle most common use cases, but a specialized reader can create a framework or reader from scratch.

Error handling in this class is minimal: the enclosing record batch iterator is responsible for handling exceptions. Error handling relies on the fact that the iterator will call close() regardless of which exceptions are thrown.

Protocol

The scanner works directly with two other interfaces

The ScanOperatorEvents implementation provides the set of readers to use. This class can simply maintain a list, or can create the reader on demand.

More subtly, the factory also handles projection issues and manages vectors across subsequent readers. A number of factories are available for the most common cases. Extend these to implement a version specific to a data source.

The RowBatchReader is a surprisingly minimal interface that nonetheless captures the essence of reading a result set as a set of batches. The factory implementations mentioned above implement this interface to provide commonly-used services, the most important of which is access to a {#link ResultSetLoader} to write values into value vectors.

Schema Versions

Readers may change schemas from time to time. To track such changes, this implementation tracks a batch schema version, maintained by comparing one schema with the next.

Readers can discover columns as they read data, such as with any JSON-based format. In this case, the row set mutator also provides a schema version, but a fine-grained one that changes each time a column is added.

The two schema versions serve different purposes and are not interchangeable. For example, if a scan reads two files, both will build up their own schemas, each increasing its internal version number as work proceeds. But, at the end of each batch, the schemas may (and, in fact, should) be identical, which is the schema version downstream operators care about.

Empty Files and/or Empty Schemas

A corner case occurs if the input is empty, such as a CSV file that contains no data. The general rule is the following:
  • If the reader is "early schema" (the schema is defined at open time), then the result will be a single empty batch with the schema defined. Example: a CSV file without headers; in this case, we know the schema is always the single `columns` array.
  • If the reader is "late schema" (the schema is defined while the data is read), then no batch is returned because there is no schema. Example: a JSON file. It is not helpful to return a single batch with no columns; such a batch will simply conflict with some other non-empty-schema batch. It turns out that other DBs handle this case gracefully: a query of the form
    
     SELECT * FROM VALUES()

    Will produce an empty result: no schema, no data.
  • The hybrid case: the reader could provide an early schema, but cannot do so. That is, the early schema contains no columns. We treat this case identically to the late schema case. Example: a CSV file with headers in which the header line is empty.
  • Field Details

  • Constructor Details

    • ScanOperatorExec

      public ScanOperatorExec(ScanOperatorEvents factory, boolean allowEmptyResult)
  • Method Details

    • bind

      public void bind(OperatorContext context)
      Description copied from interface: OperatorExec
      Bind this operator to the context. The context provides access to per-operator, per-fragment and per-Drillbit services. Also provides access to the operator definition (AKA "pop config") for this operator.
      Specified by:
      bind in interface OperatorExec
      Parameters:
      context - operator context
    • batchAccessor

      public BatchAccessor batchAccessor()
      Description copied from interface: OperatorExec
      Provides a generic access mechanism to the batch's output data. This method is called after a successful return from OperatorExec.buildSchema() and OperatorExec.next(). The batch itself can be held in a standard VectorContainer, or in some other structure more convenient for this operator.
      Specified by:
      batchAccessor in interface OperatorExec
      Returns:
      the access for the batch's output container
    • context

      public OperatorContext context()
    • buildSchema

      public boolean buildSchema()
      Description copied from interface: OperatorExec
      Retrieves the schema of the batch before the first actual batch of data. The schema is returned via an empty batch (no rows, only schema) from OperatorExec.batchAccessor().
      Specified by:
      buildSchema in interface OperatorExec
      Returns:
      true if a schema is available, false if the operator reached EOF before a schema was found
    • next

      public boolean next()
      Description copied from interface: OperatorExec
      Retrieves the next batch of data. The data is returned via the OperatorExec.batchAccessor() method.
      Specified by:
      next in interface OperatorExec
      Returns:
      true if another batch of data is available, false if EOF was reached and no more data is available
    • cancel

      public void cancel()
      Description copied from interface: OperatorExec
      Alerts the operator that the query was cancelled. Generally optional, but allows the operator to realize that a cancellation was requested.
      Specified by:
      cancel in interface OperatorExec
    • close

      public void close()
      Description copied from interface: OperatorExec
      Close the operator by releasing all resources that the operator held. Called after OperatorExec.cancel() and after OperatorExec.batchAccessor() or OperatorExec.next() returns false.

      Note that there may be a significant delay between the last call to next() and the call to close() during which downstream operators do their work. A tidy operator will release resources immediately after EOF to avoid holding onto memory or other resources that could be used by downstream operators.

      Specified by:
      close in interface OperatorExec