org.apache.drill.exec.physical.impl.scan.framework.ManagedScanFramework

All Implemented Interfaces:: ScanOperatorEvents

Direct Known Subclasses:: FileScanFramework

public class ManagedScanFramework extends Object implements ScanOperatorEvents

Basic scan framework for a "managed" reader which uses the scan schema mechanisms encapsulated in the scan schema orchestrator. Handles binding scan events to the scan orchestrator so that the scan schema is evolved as the scan progresses. Readers are created and managed via a reader factory class unique to each type of scan. The reader factory also provides the scan-specific schema negotiator to be passed to the reader.

This framework is a bridge between operator logic and the scan projection internals. It gathers scan-specific options in a builder abstraction, then passes them on the scan orchestrator at the right time. By abstracting out this plumbing, a scan batch creator simply chooses the proper framework builder, passes config options, and implements the matching "managed reader" and factory. All details of setup, projection, and so on are handled by the framework and the components that the framework builds upon.

Inputs

At this basic level, a scan framework requires just a few simple inputs:

The options defined by the scan projection framework such as the projection list.
A reader factory to create a reader for each of the files or blocks to be scanned. (Readers are expected to be created one-by-one as files are read.)
The operator context which provides access to a memory allocator and other plumbing items.

Orchestration

The above is sufficient to drive the entire scan operator functionality. Projection is done generically and is the same for all files. Only the reader (created via the factory class) differs from one type of file to another.

The framework achieves the work described below by composing a large set of detailed classes, each of which performs some specific task. This structure leaves the reader to simply infer schema and read data.

In particular, rather than do all the orchestration here (which would tie that logic to the scan operation), the detailed work is delegated to the ScanSchemaOrchestrator class, with this class as a "shim" between the the Scan events API and the schema orchestrator implementation.

Reader Integration

The details of how a file is structured, how a schema is inferred, how data is decoded: all that is encapsulated in the reader. The only real Interaction between the reader and the framework is:

The reader factory creates a reader and the corresponding schema negotiator.
The reader "negotiates" a schema with the framework. The framework knows the projection list from the query plan, knows something about data types (whether a column should be scalar, a map or an array), and knows about the schema already defined by prior readers. The reader knows what schema it can produce (if "early schema.") The schema negotiator class handles this task.
The reader reads data from the file and populates value vectors a batch at a time. The framework creates the result set loader to use for this work. The schema negotiator returns that loader to the reader, which uses it during read.
It is important to note that the result set loader also defines a schema: the schema requested by the reader. If the reader wants to read three columns, a, b, and c, then that is the schema that the result set loader supports. This is true even if the query plan only wants column a, or wants columns c, a. The framework handles the projection task so the reader does not have to worry about it. Reading an unwanted column is low cost: the result set loader will have provided a "dummy" column writer that simply discards the value. This is just as fast as having the reader use if-statements or a table to determine which columns to save.
A reader may be "late schema", true "schema on read." In this case, the reader simply tells the result set loader to create a new column reader on the fly. The framework will work out if that new column is to be projected and will return either a real column writer (projected column) or a dummy column writer (unprojected column.)
The reader then reads batches of data until all data is read. The result set loader signals when a batch is full; the reader should not worry about this detail itself.
The reader then releases its resources.

Nested Class Summary

Nested Classes

Modifier and Type

Class

Description

static interface

ManagedScanFramework.ReaderFactory

Creates a batch reader on demand.

static class

ManagedScanFramework.ScanFrameworkBuilder
Field Summary

Fields

Modifier and Type

Field

Description

protected final ManagedScanFramework.ScanFrameworkBuilder

builder

protected OperatorContext

context

protected final ManagedScanFramework.ReaderFactory

readerFactory

protected ScanSchemaOrchestrator

scanOrchestrator
Constructor Summary

Constructors

Constructor

Description

ManagedScanFramework(ManagedScanFramework.ScanFrameworkBuilder builder)
Method Summary

Modifier and Type

Method

Description

void

bind(OperatorContext context)

Build the scan-level schema from the physical operator select list.

void

close()

Called when the scan operator itself is closed.

protected void

configure()

OperatorContext

context()

CustomErrorContext

errorContext()

protected SchemaNegotiatorImpl

newNegotiator()

RowBatchReader

nextReader()

A scanner typically readers multiple data sources (such as files or file blocks.) A batch reader handles each read.

boolean

open(ShimBatchReader shimBatchReader)

TupleMetadata

outputSchema()

ScanSchemaOrchestrator

scanOrchestrator()

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Field Details
- builder
  
  protected final ManagedScanFramework.ScanFrameworkBuilder builder
- readerFactory
  
  protected final ManagedScanFramework.ReaderFactory readerFactory
- context
  
  protected OperatorContext context
- scanOrchestrator
  
  protected ScanSchemaOrchestrator scanOrchestrator
Constructor Details
- ManagedScanFramework
  
  public ManagedScanFramework(ManagedScanFramework.ScanFrameworkBuilder builder)
Method Details
- bind
  
  public void bind(OperatorContext context)
  
  Description copied from interface: ScanOperatorEvents
  
  Build the scan-level schema from the physical operator select list. The operator context is provided to allow access to the user name, to options, and to other information that might influence schema resolution.
  After this call, the schema manager should be ready to build a reader-specific schema for each reader as it is opened.
  
  Specified by:
  
  bind in interface ScanOperatorEvents
  
  Parameters:
  
  context - the operator context for the scan operator
- context
  
  public OperatorContext context()
- scanOrchestrator
  
  public ScanSchemaOrchestrator scanOrchestrator()
- outputSchema
  
  public TupleMetadata outputSchema()
- errorContext
  
  public CustomErrorContext errorContext()
- configure
  
  protected void configure()
- nextReader
  
  public RowBatchReader nextReader()
  
  Description copied from interface: ScanOperatorEvents
  
  A scanner typically readers multiple data sources (such as files or file blocks.) A batch reader handles each read. This method returns the next reader in whatever sequence that this scan defines.
  The preferred implementation is to create each batch reader in this call to minimize resource usage. Production queries may read thousands of files or blocks, so incremental reader creation can be far more efficient than creating readers at the start of the scan.
  
  Specified by:
  
  nextReader in interface ScanOperatorEvents
  
  Returns:
  
  a batch reader for one of the scan elements within the scan physical plan for this scan operator
- newNegotiator
  
  protected SchemaNegotiatorImpl newNegotiator()
- open
  
  public boolean open(ShimBatchReader shimBatchReader)
- close
  
  public void close()
  
  Description copied from interface: ScanOperatorEvents
  
  Called when the scan operator itself is closed. Indicates that no more readers are available.
  
  Specified by:
  
  close in interface ScanOperatorEvents

Class ManagedScanFramework

Inputs

Orchestration

Reader Integration

Nested Class Summary

Field Summary

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Field Details

builder

readerFactory

context

scanOrchestrator

Constructor Details

ManagedScanFramework

Method Details

bind

context

scanOrchestrator

outputSchema

errorContext

configure

nextReader

newNegotiator

open

close