Skip navigation links

Package org.apache.drill.exec.physical.impl.scan

Defines the scan operation implementation.

See: Description

Package org.apache.drill.exec.physical.impl.scan Description

Defines the scan operation implementation. The scan operator is a generic mechanism that fits into the Drill Volcano-based iterator protocol to return record batches from one or more readers.

Two versions of the scan operator exist:

New code should use the new version, existing code will continue to use the ScanBatch version until all readers are converted to the new format.

Further, the new version is designed to allow intensive unit test without the need for the Drill server. New readers should exploit this feature to include intensive tests to keep Drill quality high.

See ScanOperatorExec for details of the scan operator protocol and components.

Traditional Class Structure

The original design was simple: but required each reader to handle many detailed tasks.

  +------------+          +-----------+
  | Scan Batch |    +---> | ScanBatch |
  |  Creator   |    |     +-----------+
  +------------+    |           |
         |          |           |
         v          |           |
  +------------+    |           v
  |   Format   | ---+   +---------------+
  |   Plugin   | -----> | Record Reader |
  +------------+        +---------------+

The scan batch creator is unique to each storage plugin and is created based on the physical operator configuration ("pop config"). The scan batch creator delegates to the format plugin to create both the scan batch (the scan operator) and the set of readers which the scan batch will manage.

The scan batch provides a Mutator that creates the vectors used by the record readers. Schema continuity comes from reusing the Mutator from one file/block to the next.

One characteristic of this system is that all the record readers are created up front. If we must read 1000 blocks, we'll create 1000 record readers. Developers must be very careful to only allocate resources when the reader is opened, and release resources when the reader is closed. Else, resource bloat becomes a large problem.

Revised Class Structure

The new design is more complex because it divides tasks up into separate classes. The class structure is larger, but each class is smaller, more focused and does just one task.

  +------------+          +---------------+
  | Scan Batch | -------> | Format Plugin |
  |  Creator   |          +---------------+
  +------------+          /        |       \
                         /         |        \
    +---------------------+        |         \ +---------------+
    | OperatorRecordBatch |        |     +---->| ScanFramework |
    +---------------------+        |     |     +---------------+
                                   v     |            |
                         +------------------+         |
                         | ScanOperatorExec |         |
                         +------------------+         v
                                   |            +--------------+
                                   +----------> | Batch Reader |
Here, the scan batch creator again delegates to the format plugin. The format plugin creates three objects: The overall structure uses the "composition" pattern: what is combined into a small set of classes in the traditional model is broken out into focused classes in the revised model.

A key part of the scan strategy is the batch reader. ("Batch" because it reads an entire batch at a time, using the result set loader.) The framework creates batch readers one by one as needed. Resource bloat is less of an issue because only one batch reader instance exists at any time for each scan operator instance.

Each of the above is further broken down into additional classes to handle projection and so on.

Skip navigation links

Copyright © 1970 The Apache Software Foundation. All rights reserved.