Skip navigation links

Package org.apache.drill.exec.vector.accessor.writer

Implementation of the vector writers.

See: Description

Package org.apache.drill.exec.vector.accessor.writer Description

Implementation of the vector writers. The code will make much more sense if we start with a review of Drill’s complex vector data model. Drill has 38+ data ("minor") types. Drill also has three cardinalities ("modes"). The result is over 120+ different vector types. Then, when you add maps, repeated maps, lists and repeated lists, you rapidly get an explosion of types that the writer code must handle.

Understanding the Vector Model

Vectors can be categorized along multiple dimensions:

A repeated map, a list, a repeated list and any array (repeated) scalar all are array-like. Nullable and required modes are identical (single values), but a nullable has an additional is-set ("bit") vector.

The writers (and readers) borrow concepts from JSON and relational theory to simplify the problem:

Repeat Levels

JSON and Parquet can be understood as a series of one or more "repeat levels." First, let's identify the repeat levels above the batch level: Then, within a batch: Scalar arrays introduce a repeat level: each row has 0, 1 or many elements in the array-valued column. An offset vector indexes to the first value for each row. Each scalar array has its own per-array index to point to the next write position.
  • Map arrays introduce a repeat level for a group of columns (those that make up the map.) A single offset vector points to the common start position for the columns. A common index points to the common next write position.
  • Lists also introduce a repeat level. (Details to be worked out.
  • For repeated vectors, one can think of the structure either top-down or bottom-up:

    Writer Data Model

    The above leads to a very simple, JSON-like data model:

    This data model is similar to; but has important differences from, the prior, generated, readers and writers. This version is based on the concept of minimizing the number of writer classes, and leveraging Java primitives to keep the number of get/set methods to a reasonable size. This version also automates vector allocation, vector overflow and so on.

    The object layer is new: it is the simplest way to model the three "object types." An app using this code would use just the leaf scalar readers and writers.

    Similarly, the ColumnWriter interface provides a uniform way to access services common to all writer types, while allowing each JSON-like writer to provide type-specific ways to access data.

    Writer Performance

    To maximize performance, have a single version for all "data modes": (nullable, required, repeated). Some items of note:

    Lists

    As described in the API package, Lists and Unions in Drill are highly complex, and not well supported. This creates huge problems in the writer layer because we must support something which is broken and under-used, but which most people assume works (it is part of Drill's JSON-like, schema-free model.) Our goal here is to support Union and List well enough that nothing new is broken; though this layer cannot fix the issues elsewhere in Drill.

    The most complex part is List support for the transition from a single type to a union of types. The API should be simple: the client should not have to be aware of the transition.

    To make this work, the writers provide two options:

    1. Use metadata to state that a List will have exactly one type and to specify that type. The List will present as an array of that type in which each array can be null.
    2. Otherwise, the list is repeated union (a array of variants), even if the list happens to have 0 or 1 types. In this case, the list presents as an array of variants.
    The result is that client code assumes one or the other model, and never has to worry about transitioning from one to the other within a single operator.

    The PromotableListWriter handles the complex details of providing the above simple API in the array-of-variant case.

    Possible Improvements

    The code here works and has extensive unit tests. But, many improvements are possible:

    Caveats

    The column accessors are divided into two packages: vector and java-exec. It is easy to add functionality in the wrong place, breaking abstraction and encapsulation. Here are some general guidelines: Given all this, plan carefully where to make any improvement. If your change violates the dependencies below, perhaps reconsider another way to do the change.
                                                      +------------+
                          +-------------------------- | Result Set |
                          v                           |   Loader   |
                 +----------------+     +---------+   +------------+
                 |    Metadata    | <-- | Row Set |     |
                 | Implementation |     |  Tools  |     |
                 +----------------+     +---------+     |
     java-exec           |                     |        |
     ------------------- | ------------------- | ------ | ------------
     vector              v                     v        v
                   +------------+            +-----------+
                   | Metadata   | <--------- |  Column   |
                   | Interfaces |            | Accessors |
                   +------------+            +-----------+
                                                   |
                                                   v
                                              +---------+
                                              |  Value  |
                                              | Vectors |
                                              +---------+
     
    Skip navigation links

    Copyright © 1970 The Apache Software Foundation. All rights reserved.