See: Description
Interface | Description |
---|---|
ArrayReader |
Generic array reader.
|
ArrayWriter |
Writer for values into an array.
|
ColumnReader |
Base interface for all column readers, defining a generic set of methods
that all readers provide.
|
ColumnReaderIndex |
The reader structure is heavily recursive.
|
ColumnWriter |
Generic information about a column writer including:
Metadata
Write position information about a writer needed by a vector overflow
implementation.
|
ColumnWriterIndex |
A Drill record batch consists of a variety of vectors, including maps and lists.
|
DictReader | |
DictWriter |
Physically the writer is an array writer with special tuple writer as its element.
|
KeyAccessor | |
ObjectReader |
Defines a reader to get values for value vectors using
a simple, uniform interface modeled after a JSON object.
|
ObjectWriter |
Represents a column within a tuple.
|
ScalarReader |
Defines a reader to obtain values from value vectors using
a simple, uniform interface.
|
ScalarWriter |
Represents a scalar value: a required column, a nullable column,
or one element within an array of scalars.
|
SqlAccessor |
Column-data accessor that implements JDBC's Java-null--when--SQL-NULL mapping.
|
TupleReader |
Interface for reading from tuples (rows or maps).
|
TupleWriter |
Writer for a tuple.
|
ValueWriter |
Writer for a scalar value.
|
VariantReader |
Reader for a Drill "union vector." The union vector is presented
as a reader over a set of variants.
|
VariantWriter |
Writer for a Drill "union vector." The union vector is presented
as a writer over a set of variants.
|
VariantWriter.VariantWriterListener | |
WriterPosition |
Position information about a writer used during vector overflow.
|
Enum | Description |
---|---|
ObjectType |
Type of writer.
|
ValueType |
Represents the primitive types supported to read and write data
from value vectors.
|
Exception | Description |
---|---|
InvalidAccessException | |
InvalidConversionError |
Raised when a conversion from one type to another is supported at
setup type, but a value provided at runtime is not valid for that
conversion.
|
TupleWriter.UndefinedColumnException |
Unchecked exception thrown when attempting to access a column writer by
name for an undefined columns.
|
UnsupportedConversionError |
Raised when a column accessor reads or writes the value using the wrong
Java type (which may indicate an data inconsistency in the input data.)
|
row : tuple
tuple : (name column) *
column : scalar obj | array obj | tuple obj
scalar obj : scalar accessor
array obj : array accessor
array accessor : element accessor
tuple obj : tuple
As seen above, the accessor tree starts with a tuple (a row in the form of a class provided by the consumer.) Each column in the tuple is represented by an object accesor. That object accessor contains a scalar, tuple or array accessor. This models Drill's JSON structure: a row can have a list of lists of tuples that contains lists of ints, say.
ScanBatch
class.ScalarReader
and ColumnWriter
are the core abstractions: they
provide simplified access to the myriad of Drill column types via a
simplified, uniform API. TupleReader
and TupleWriter
provide
a simplified API to rows or maps (both of which are tuples in Drill.)
AccessorUtilities
provides a number of data conversion tools.
Both the column reader and writer use a reduced set of data types to access values. Drill provides about 38 different types, but they can be mapped to a smaller set for programmatic access. For example, the signed byte, short, int; and the unsigned 8-bit, and 16-bit values can all be mapped to ints for get/set. The result is a much simpler set of get/set methods compared to the underlying set of vector types.
Different implementations of the row index handle the case of no selection vector, a selection vector 2, or a selection vector 4.
You can think of the (row index + vector accessor, column index) as forming a coordinate pair. The row index provides the y index (vertical position along the rows.) The vector accessor maps the row position to a vector when needed. The column index picks out the x coordinate (horizontal position along the columns.)
Drill is unusual among query and DB engines in that it does not normally use indexes. The reason is easy to understand. Suppose two files contain columns a and b. File 1, read by minor fragment 0, contains the columns in the order (a, b). But, file 2, read by minor fragment 1, contains the columns in the order (b, a). Drill considers this the same schema. Since column order can vary, Drill has avoided depending on column order. (But, only partly; many bugs have cropped up because some parts of the code do require common ordering.)
Here we observe that column order varies only across fragments. We have control of the column order within our own fragment. (We can coerce varying order into a desired order. If the above two files are read by the same scan operator, then the first file sets the order at (a, b), and the second files (b, a) order can be coerced into the (a, b) order.
Given this insight, the readers and writers here promote position to a first-class concept. Code can access columns by name (for convenience, especially in testing) or by position (for efficiency.)
Further, it is often convenient to fetch a column accessor (reader or writer) once, then cache it. The design here ensures that such caching works well. The goal is that, eventually, operators will code-generate references to cached readers and writers instead of generating code that works directly with the vectors.
Drill's other types have a more-or-less simple mapping to the relational model, allowing simple reader and writer interfaces. But, the Union and List types are not a good fit and cause a very large amount of complexity in the reader and writer models.
A Union is just that: it is a container for a variety of typed vectors. It is like a "union" in C: it has members for each type, but only one type is in use at any one time. However, unlike C, the implementation is more like a C "struct" every vector takes space or every row, even if no value is stored in that row. That is, a Drill union is as if a naive C programmer used a "struct" when s/he should have used a union.
Unions are designed to evolve dynamically as data is read. Suppose we read the following JSON:
{a: 10} {a: "foo"} {a: null} {a: 12.34}Here, we discover the need for an Int type, then a Varchar, then mark a value as null and finally a Float. The union adds the desired types as we request them. The writer mimics this behavior, using a listener to do the needed vector work.
Further, a union can be null. It carries a types vector that indicates the type of each row. A zero-value indicates that the union as a whole is null. In this case, null means no value, is is not, say, a null Int or null Varchar: it is simply null (as in JSON). Since at most one vector within the union carries a value, the element vectors must also be nullable. This means that a union has two null bits: one or the union, the other for the selected type. It is not clear what Drill semantics are supposed to be. Here the writers assume that either the whole union is null, or that exactly one member is non-null. Readers are more paranoid: they assume each member is null if either the union is null or the member itself is null. (Yes, a bit of a mess...)
The current union vector format is highly inefficient. If the union concept is needed, then it should be redesigned, perhaps as a variable-width vector in which each entry consists of a type/value pair. (For variable-width values such as strings, the implementation would be a triple of (type, length, value). The API here is designed to abstract away the implementation and should work equally well for the current "union" implementation and the possible "variant" implementation. As a result, when changing the API, avoid introducing methods that assume an implementation.
Lists add another layer of complexity. A list is, logically, a repeated union. But, for whatever historical reasons, a List can be other things as well. First, it can have no type at all: a list of nothing. This likely has no meaning, but the semantics of the List vector allow it. Second, the List can be an array of a single type in which each entry can be null. (Normal Repeated types can have an empty array for a row, but cannot have a null entry. Lists can have either an empty array or a null array in order to model the JSON null and [] cases.)
When a List has a single type, it stores the backing vector directly within the List. But, a list can also be a list of unions. In this case, the List stores a union vector as its backing vector. Here, we now have three ways to indicate null: the List's bits vector, the type vector in the union, and the bits vector in each element vector. Again, the writer assumes that if the List vector is null, the entire value for that row is null. The reader is again paranoid and handles all three nullable states. (Again, a huge mess.)
The readers can provide a nice API for these cases since we know the List format up front. They can present the list as either a nullable array of a single type, or as an array of unions.
Writers have more of a challenge. If we knew that a List was being used as a list of, say, Nullable Int, we could present the List as an array writer with Int elements. But, the List allows dynamic type addition, as with unions. (In the case of the list, it has internal special handling for the single vs. many type case.)
To isolate the client from the list representation, it is simpler to always present a List an array of variants. But, this is awkward in the single-type case. The solution is to use metadata. If the metadata promises to use only a single type, the writer can use the nullable array of X format. If the metadata says to use a union (the default), then the List is presented as an array of unions, even when the list has 0 or 1 member types. (The complexity here is excessive: Drill should really redesign this feature to make it simpler and to better fit the relational model.)
The Apache Arrow project created a refined version of the vector classes. Much talk has occurred about ripping out Drill's implementation to use Arrow instead.
However, even Arrow has limits:
Therefore, a goal of this reader/writer layer is to isolate the operators from vector implementation. For this to work, the accessors must be at least as efficient as direct vector access. (They are now more efficient.)
Once all operators use this layer, a switch to Arrow, or an evolution toward Value Vectors 2.0 will be much easier. Simply change the vector format and update the reader and writer implementations. The rest of the code will remain unchanged. (Note, to achieve this goal, it is important to carefully design the accessor API [interfaces] to hide implementation details.)
This layer handles all that work, providing a simple API that encourages more custom readers because the work to create the readers becomes far simpler. (Other layers tackle other parts of the problem as well.)
Copyright © 1970 The Apache Software Foundation. All rights reserved.