Class ListVector

All Implemented Interfaces:
Closeable, AutoCloseable, Iterable<ValueVector>, ContainerVectorLike, RepeatedValueVector, ValueVector

public class ListVector extends BaseRepeatedValueVector
"Non-repeated" LIST vector. This vector holds some other vector as its data element. Unlike a repeated vector, the child element can change dynamically. It starts as nothing (the LATE type). It can then change to a single type (typically a map but can be anything.) If another type is needed, the list morphs again, this time to a list of unions. The prior single type becomes a member of the new union (which requires back-filling is-set values.)

Why this odd behavior? The LIST type apparently attempts to model certain JSON types. In JSON, we can have lists like this:


 {a: [null, null]}
 {a: [10, "foo"]}
 {a: [{name: "fred", balance: 10}, null]
 {a: null}
 
Compared with Drill, JSON has a number of additional list-related abilities:
  • A list can be null. (In Drill, an array can be empty, but not null.)
  • A list element can be null. (In Drill, a repeated type is an array of non-nullable elements, so list elements can't be null.
  • A list can contain heterogeneous types. (In Drill, repeated types are arrays of a single type.
The LIST vector is an attempt to implement full JSON behavior. The list:
  • Allows the list value for a row to be null. (To handle the {list: null} case.
  • Allows list elements to be null. (To handle the {list: [10, null 30]}</code case.)
  • Allows the list to be a single type. (To handle the list of nullable ints above.
  • Allows the list to be of multiple types, by creating a list of UNIONs. (To handle the {list: ["fred", 10]} case.

Background

The above is the theory. The problem is, the goals are very difficult to achieve, and the code here does not quite do so. The code here is difficult to maintain and understand. The first thing to understand is that union vectors are broken in most operators, and so major bugs remain in union and list vectors that have not had to be fixed. Recent revisions attempt to fix or works around some of the bugs, but many remain.

Unions have a null bit for the union itself. That is, a union can be an int, say, or. a Varchar, or null. Oddly, the Int and Varchar can also be null (we use nullable vectors so we can mark the unused values as null.) So, we have a two-level null bit. The most logical way to interpret it is that a union value can be:

  • Untyped null (if the type is not set and the null bit (really, the isSet bit) is unset.) Typed null if the type is set and EITHER the union's isSet bit is unset OR the union's isSet bit is set, but the data vector's isSet bit is not set. It is not clear in the code which convention is assumed, or if different code does it differently.
  • Now, add all that to a list. A list can be a list of something (ints, say, or maps.) When the list is a list of maps, the entire value for a row can be null. But individual maps can't be null. In a list, however, individual ints can be null (because we use a nullable int vector.)
So, when a list of (non-nullable maps) converts to a list of unions (one of which is a map), we suddenly now have the list null bit and the union null bit to worry about. We have to go and back-patch the isSet vector for all the existing map entries in the new union so that we don't end up with all previous entries becoming null by default.

Another issue is that the metadata for a list should reflect the structure of the list. The MaterializedField contains a child field, which points to the element of the list. If that child is a UNION, then the UNION's MaterializedField contains subtypes for each type in the union. Now, note that the LIST's metadata contains the child, so we need to update the LIST's MaterializedField each time we add a type to the UNION. And, since the LIST is part of a row or map, then we have to update the metadata in those objects to propagate the change.

The problem is that the original design assumed that MaterializedField is immutable. The above shows that it clearly is not. So, we have a tension between the original immutable design and the necessity of mutating the MaterializedField to keep everything in sync.

Of course, there is another solution: don't include subtypes and children in the MaterializedField, then we don't have the propagation problem.

The code for this class kind of punts on the issue: the metadata is not maintained and can get out of sync. THis makes the metadata useless: one must recover actual structure by traversing vectors. There was an attempt to fix this, but doing so changes the metadata structure, which broke clients. So, we have to live with broken metadata and work around the issues. The metadata sync issue exists in many places, but is most obvious in the LIST vector because of the sheer complexity in this class.

This is why the code notes say that this is a mess.

It is hard to simply fix the bugs because this is a design problem. If the list and union vectors don't need to work (they barely work today), then any design is fine. See the list of JIRA tickets below for more information.

Fundamental issue: should Drill support unions and lists? Is the current approach compatible with SQL? Is there a better approach? If such changes are made, they are breaking changes, and so must be done as part of a major version, such as the much-discussed "Drill 2.0". Or, perhaps as part of a conversion to use Apache Arrow, which also would be a major breaking change.

  • Field Details

  • Constructor Details

  • Method Details

    • getWriter

      public UnionListWriter getWriter()
    • allocateNew

      public void allocateNew() throws OutOfMemoryException
      Description copied from interface: ValueVector
      Allocate new buffers. ValueVector implements logic to determine how much to allocate.
      Throws:
      OutOfMemoryException - Thrown if no memory can be allocated.
    • transferTo

      public void transferTo(ListVector target)
    • copyFromSafe

      public void copyFromSafe(int inIndex, int outIndex, ListVector from)
    • copyFrom

      public void copyFrom(int inIndex, int outIndex, ListVector from)
    • copyEntry

      public void copyEntry(int toIndex, ValueVector from, int fromIndex)
    • getDataVector

      public ValueVector getDataVector()
      Specified by:
      getDataVector in interface RepeatedValueVector
      Overrides:
      getDataVector in class BaseRepeatedValueVector
      Returns:
      the underlying data vector or null if none exists.
    • getBitsVector

      public ValueVector getBitsVector()
    • getTransferPair

      public TransferPair getTransferPair(String ref, BufferAllocator allocator)
    • makeTransferPair

      public TransferPair makeTransferPair(ValueVector target)
      Description copied from interface: ValueVector
      Returns a new transfer pair that is used to transfer underlying buffers into the target vector.
    • getAccessor

      public ListVector.Accessor getAccessor()
      Description copied from interface: ValueVector
      Returns an accessor that is used to read from this vector instance.
    • getMutator

      public ListVector.Mutator getMutator()
      Description copied from interface: ValueVector
      Returns an mutator that is used to write to this vector instance.
    • getReader

      public FieldReader getReader()
      Description copied from interface: ValueVector
      Returns a field reader that supports reading values from this vector.
    • allocateNewSafe

      public boolean allocateNewSafe()
      Description copied from interface: ValueVector
      Allocates new buffers. ValueVector implements logic to determine how much to allocate.
      Specified by:
      allocateNewSafe in interface ValueVector
      Overrides:
      allocateNewSafe in class BaseRepeatedValueVector
      Returns:
      Returns true if allocation was successful.
    • getMetadataBuilder

      protected UserBitShared.SerializedField.Builder getMetadataBuilder()
      Overrides:
      getMetadataBuilder in class BaseRepeatedValueVector
    • addOrGetVector

      public <T extends ValueVector> AddOrGetResult<T> addOrGetVector(VectorDescriptor descriptor)
      Description copied from interface: ContainerVectorLike
      Creates and adds a child vector if none with the same name exists, else returns the vector instance.
      Specified by:
      addOrGetVector in interface ContainerVectorLike
      Overrides:
      addOrGetVector in class BaseRepeatedValueVector
      Parameters:
      descriptor - vector descriptor
      Returns:
      result of operation wrapping vector corresponding to the given descriptor and whether it's newly created
    • getBufferSize

      public int getBufferSize()
      Description copied from interface: ValueVector
      Returns the number of bytes that is used by this vector instance. This is a bit of a misnomer. Returns the number of bytes used by data in this instance.
      Specified by:
      getBufferSize in interface ValueVector
      Overrides:
      getBufferSize in class BaseRepeatedValueVector
    • clear

      public void clear()
      Description copied from interface: ValueVector
      Release the underlying DrillBuf and reset the ValueVector to empty.
      Specified by:
      clear in interface ValueVector
      Overrides:
      clear in class BaseRepeatedValueVector
    • getBuffers

      public DrillBuf[] getBuffers(boolean clear)
      Description copied from interface: ValueVector
      Return the underlying buffers associated with this vector. Note that this doesn't impact the reference counts for this buffer so it only should be used for in-context access. Also note that this buffer changes regularly thus external classes shouldn't hold a reference to it (unless they change it).
      Specified by:
      getBuffers in interface ValueVector
      Overrides:
      getBuffers in class BaseRepeatedValueVector
      Parameters:
      clear - Whether to clear vector before returning; the buffers will still be refcounted; but the returned array will be the only reference to them
      Returns:
      The underlying buffers that is used by this vector instance.
    • load

      public void load(UserBitShared.SerializedField metadata, DrillBuf buffer)
      Description copied from interface: ValueVector
      Load the data provided in the buffer. Typically used when deserializing from the wire.
      Specified by:
      load in interface ValueVector
      Overrides:
      load in class BaseRepeatedValueVector
      Parameters:
      metadata - Metadata used to decode the incoming buffer.
      buffer - The buffer that contains the ValueVector.
    • isEmptyType

      public boolean isEmptyType()
    • setChildVector

      public void setChildVector(ValueVector childVector)
      Overrides:
      setChildVector in class BaseRepeatedValueVector
    • promoteToUnion

      public UnionVector promoteToUnion()
      Promote the list to a union. Called from old-style writers. This implementation relies on the caller to set the types vector for any existing values. This method simply clears the existing vector.
      Returns:
      the new union vector
    • fullPromoteToUnion

      public UnionVector fullPromoteToUnion()
      Revised form of promote to union that correctly fixes up the list field metadata to match the new union type. Since this form handles both the vector and metadata revisions, it is a "full" promotion.
      Returns:
      the new union vector
    • convertToUnion

      public UnionVector convertToUnion(int allocValueCount, int valueCount)
      Promote to a union, preserving the existing data vector as a member of the new union. Back-fill the types vector with the proper type value for existing rows.
      Returns:
      the new union vector
    • collectLedgers

      public void collectLedgers(Set<AllocationManager.BufferLedger> ledgers)
      Description copied from interface: ValueVector
      Add the ledgers underlying the buffers underlying the components of the vector to the set provided. Used to determine actual memory allocation.
      Specified by:
      collectLedgers in interface ValueVector
      Overrides:
      collectLedgers in class BaseRepeatedValueVector
      Parameters:
      ledgers - set of ledgers to which to add ledgers for this vector
    • getPayloadByteCount

      public int getPayloadByteCount(int valueCount)
      Description copied from interface: ValueVector
      Return the number of value bytes consumed by actual data.
      Specified by:
      getPayloadByteCount in interface ValueVector
      Overrides:
      getPayloadByteCount in class BaseRepeatedValueVector