Interface RowBatchReader

All Known Implementing Classes:
ReaderLifecycle, ShimBatchReader

public interface RowBatchReader
Extended version of a record reader used by the revised scan batch operator. Use this for all new readers. Replaces the original RecordReader interface.

Classes that extend from this interface must handle all aspects of reading data, creating vectors, handling projection and so on. That is, extensions of this class are intended to be frameworks.

For most cases, a plugin probably wants to start from a base implementation, such as the ManagedReader class, which provides services for handling projection, setting up the result set loader, handling schema smoothing, sharing vectors across batches, etc.

Note that this interface reads a batch of rows, not a single row. (The original RecordReader could be confusing in this aspect.)

The expected lifecycle is:

  • Construction. Allocate no resources.
  • open(): Allocate resources and set up the schema (if early schema.)
  • output() and schemaVersion() to determine the initial version. If no schema is available on open (the reader is late-schema), return null for the output and -1 for the schema version. Else, return non-negative for the version number and an empty batch with a schema.
  • next()} to retrieve the next record batch. Return true if a batch is available, false if EOF. There is no requirement to return a batch; the first call to next() can return false if no data is available.}
  • output() and schemaVersion() to obtain the batch of records read, and to detect if the version of the schema is different from the previous batch.
  • close() when the reader is no longer needed. This may occur before next() returns false if an error occurs or a limit is reached.
Although the reader should not care, the scanner must return an empty batch downstream to provide a "fast schema" for other operators. The scan operator handles this transparently:
  • If the reader is early-schema, then the reader itself returns an empty batch (via {@code output()</tt), with only schema after the call to {@code open()}. The scanner sends this empty batch downstream directly.</li> <li>If the reader is late-schema, then the reader will read the first batch from the reader. It will save that batch, extract the schema, and send an "artificial" batch downstream to satisfy the "fast schema" protocol.</li> </ul> As a result, there is little reason for the reader to worry about "fast schema" other than providing an early schema, if available, else don't worry about it. <p> If an error occurs, the reader can throw a {@link RuntimeException} from any method. A {@link org.apache.drill.common.exceptions.UserException} is preferred to provide detailed information about the source of the problem.
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    Release resources.
    boolean
    Called for the first reader within a scan.
    Name used when reporting errors.
    boolean
    Read the next batch.
    boolean
    Setup the record reader.
    Return the container with the reader's output.
    int
    Return the version of the schema returned by output().
  • Method Details

    • name

      String name()
      Name used when reporting errors. Can simply be the class name.
      Returns:
      display name for errors
    • open

      boolean open()
      Setup the record reader. Called just before the first call to next(). Allocate resources here, not in the constructor. Example: open files, allocate buffers, etc.
      Returns:
      true if the reader is open and ready to read (possibly no) rows. false for a "soft" failure in which no schema or data is available, but the scanner should not fail, it should move onto another reader
      Throws:
      RuntimeException - for "hard" errors that should terminate the query. UserException preferred to explain the problem better than the scan operator can by guessing at the cause
    • defineSchema

      boolean defineSchema()
      Called for the first reader within a scan. Allows the reader to provide an empty batch with only the schema filled in. Readers that are "early schema" (know the schema up front) should return true and create an empty batch. Readers that are "late schema" should return false. In that case, the scan operator will ask the reader to load an actual data batch, and infer the schema from that batch.

      This step is optional and is purely for performance.

      Returns:
      true if this reader can (and has) defined an empty batch to describe the schema, false otherwise
    • next

      boolean next()
      Read the next batch. Reading continues until either EOF, or until the mutator indicates that the batch is full. The batch is considered valid if it is non-empty. Returning true with an empty batch is valid, and is helpful on the very first batch (returning schema only.) An empty batch with a false return code indicates EOF and the batch will be discarded. A non-empty batch along with a false return result indicates a final, valid batch, but that EOF was reached and no more data is available.

      This somewhat complex protocol avoids the need to allocate a final batch just to find out that no more data is available; it allows EOF to be returned along with the final batch.

      Returns:
      true if more data may be available (and so next() should be called again, false to indicate that EOF was reached
      Throws:
      RuntimeException - (UserException preferred) if an error occurs that should fail the query.
    • output

      VectorContainer output()
      Return the container with the reader's output. This method is called at two times:
      • Directly after the call to open(). If the data source can provide a schema at open time, then the reader should provide an empty batch with the schema set. The scanner will return this schema downstream to inform other operators of the schema.
      • Directly after a successful call to next() to retrieve the batch produced by that call. (No call is made if next() returns false.
      Note that most operators require the same vectors be present in each container. So, in practice, a reader must return the same container, and same set of vectors, on each call.
      Returns:
      a vector container, with the record count and schema set, that announces the schema after open() (optional) or returns rows read after next() (required)
    • schemaVersion

      int schemaVersion()
      Return the version of the schema returned by output(). The schema is assumed to start at -1 (no schema). The reader is free to use any numbering system it likes as long as:
      • The numbers are non-negative, and increase (by any increment),
      • Numbers between successive calls are idential if the batch schemas are identical,
      • The number increases if a batch has a different schema than the previous batch.
      Numbers increment (or not) on calls to next(). Thus Two successive calls to this method should return the same number if no next() call lies between.

      If the reader can return a schema on open (so-called "early-schema), then this method must return a non-negative version number, even if the schema happens to be empty (such as reading an empty file.)

      However, if the reader cannot return a schema on open (so-called "late schema"), then this method must return -1 (and output() must return null) to indicate now schema is available when called before the first call to next().

      No calls will be made to this method before open() after {@code close(){@code or after {@code next()} returns false. The implementation is thus not required to handle these cases. @return the schema version, or -1 if no schema version is yet available

    • close

      void close()
      Release resources. Called just after a failure, when the scanner is cancelled, or after next() returns EOF. Release all resources and close files. Guaranteed to be called if open() returns normally; will not be called if open() throws an exception.
      Throws:
      RutimeException - (UserException preferred) if an error occurs that should fail the query.