Class ListVector
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Iterable<ValueVector>
,ContainerVectorLike
,RepeatedValueVector
,ValueVector
Why this odd behavior? The LIST type apparently attempts to model certain JSON types. In JSON, we can have lists like this:
{a: [null, null]}
{a: [10, "foo"]}
{a: [{name: "fred", balance: 10}, null]
{a: null}
Compared with Drill, JSON has a number of additional list-related
abilities:
- A list can be null. (In Drill, an array can be empty, but not null.)
- A list element can be null. (In Drill, a repeated type is an array of non-nullable elements, so list elements can't be null.
- A list can contain heterogeneous types. (In Drill, repeated types are arrays of a single type.
- Allows the list value for a row to be null. (To handle the
{list: null}
case. - Allows list elements to be null. (To handle the
{list: [10, null 30]}</code case.)
- Allows the list to be a single type. (To handle the list of nullable ints above.
- Allows the list to be of multiple types, by creating a list
of UNIONs. (To handle the
{list: ["fred", 10]}
case.
Background
The above is the theory. The problem is, the goals are very difficult to achieve, and the code here does not quite do so. The code here is difficult to maintain and understand. The first thing to understand is that union vectors are broken in most operators, and so major bugs remain in union and list vectors that have not had to be fixed. Recent revisions attempt to fix or works around some of the bugs, but many remain.Unions have a null bit for the union itself. That is, a union can be an int, say, or. a Varchar, or null. Oddly, the Int and Varchar can also be null (we use nullable vectors so we can mark the unused values as null.) So, we have a two-level null bit. The most logical way to interpret it is that a union value can be:
- Untyped null (if the type is not set and the null bit (really, the isSet bit) is unset.) Typed null if the type is set and EITHER the union's isSet bit is unset OR the union's isSet bit is set, but the data vector's isSet bit is not set. It is not clear in the code which convention is assumed, or if different code does it differently.
- Now, add all that to a list. A list can be a list of something (ints, say, or maps.) When the list is a list of maps, the entire value for a row can be null. But individual maps can't be null. In a list, however, individual ints can be null (because we use a nullable int vector.)
Another issue is that the metadata for a list should reflect the structure
of the list. The MaterializedField
contains a child field, which
points to the element of the list. If that child is a UNION, then the UNION's
MaterializedField
contains subtypes for each type in the
union. Now, note that the LIST's metadata contains the child, so we need
to update the LIST's MaterializedField
each time we add a
type to the UNION. And, since the LIST is part of a row or map, then we
have to update the metadata in those objects to propagate the change.
The problem is that the original design assumed that
MaterializedField
is immutable. The above shows that it
clearly is not. So, we have a tension between the original immutable
design and the necessity of mutating the MaterializedField
to keep everything in sync.
Of course, there is another solution: don't include subtypes and children
in the MaterializedField
, then we don't have the propagation
problem.
The code for this class kind of punts on the issue: the metadata is not maintained and can get out of sync. THis makes the metadata useless: one must recover actual structure by traversing vectors. There was an attempt to fix this, but doing so changes the metadata structure, which broke clients. So, we have to live with broken metadata and work around the issues. The metadata sync issue exists in many places, but is most obvious in the LIST vector because of the sheer complexity in this class.
This is why the code notes say that this is a mess.
It is hard to simply fix the bugs because this is a design problem. If the list and union vectors don't need to work (they barely work today), then any design is fine. See the list of JIRA tickets below for more information.
Fundamental issue: should Drill support unions and lists? Is the current approach compatible with SQL? Is there a better approach? If such changes are made, they are breaking changes, and so must be done as part of a major version, such as the much-discussed "Drill 2.0". Or, perhaps as part of a conversion to use Apache Arrow, which also would be a major breaking change.
-
Nested Class Summary
Nested classes/interfaces inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
BaseRepeatedValueVector.BaseRepeatedAccessor, BaseRepeatedValueVector.BaseRepeatedMutator, BaseRepeatedValueVector.BaseRepeatedValueVectorTransferPair<T extends BaseRepeatedValueVector>
Nested classes/interfaces inherited from class org.apache.drill.exec.vector.BaseValueVector
BaseValueVector.BaseAccessor, BaseValueVector.BaseMutator
Nested classes/interfaces inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
RepeatedValueVector.RepeatedAccessor, RepeatedValueVector.RepeatedMutator
-
Field Summary
Fields inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
DATA_VECTOR_NAME, DEFAULT_DATA_VECTOR, offsets, OFFSETS_FIELD, OFFSETS_VECTOR_NAME, vector
Fields inherited from class org.apache.drill.exec.vector.BaseValueVector
allocator, field, INITIAL_VALUE_ALLOCATION, MAX_ALLOCATION_SIZE
Fields inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
DEFAULT_REPEAT_PER_RECORD
Fields inherited from interface org.apache.drill.exec.vector.ValueVector
BITS_VECTOR_NAME, MAX_BUFFER_SIZE, MAX_ROW_COUNT, MIN_ROW_COUNT, VALUES_VECTOR_NAME
-
Constructor Summary
ConstructorDescriptionListVector
(MaterializedField field, BufferAllocator allocator, CallBack callBack) -
Method Summary
Modifier and TypeMethodDescription<T extends ValueVector>
AddOrGetResult<T>addOrGetVector
(VectorDescriptor descriptor) Creates and adds a child vector if none with the same name exists, else returns the vector instance.void
Allocate new buffers.boolean
Allocates new buffers.void
clear()
Release the underlying DrillBuf and reset the ValueVector to empty.void
collectLedgers
(Set<AllocationManager.BufferLedger> ledgers) Add the ledgers underlying the buffers underlying the components of the vector to the set provided.convertToUnion
(int allocValueCount, int valueCount) Promote to a union, preserving the existing data vector as a member of the new union.void
copyEntry
(int toIndex, ValueVector from, int fromIndex) void
copyFrom
(int inIndex, int outIndex, ListVector from) void
copyFromSafe
(int inIndex, int outIndex, ListVector from) Revised form of promote to union that correctly fixes up the list field metadata to match the new union type.Returns anaccessor
that is used to read from this vector instance.DrillBuf[]
getBuffers
(boolean clear) Return the underlying buffers associated with this vector.int
Returns the number of bytes that is used by this vector instance.protected UserBitShared.SerializedField.Builder
Returns anmutator
that is used to write to this vector instance.int
getPayloadByteCount
(int valueCount) Return the number of value bytes consumed by actual data.Returns afield reader
that supports reading values from this vector.getTransferPair
(String ref, BufferAllocator allocator) boolean
void
load
(UserBitShared.SerializedField metadata, DrillBuf buffer) Load the data provided in the buffer.makeTransferPair
(ValueVector target) Returns a newtransfer pair
that is used to transfer underlying buffers into the target vector.Promote the list to a union.void
setChildVector
(ValueVector childVector) void
transferTo
(ListVector target) Methods inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
exchange, getAllocatedSize, getBufferSizeFor, getOffsetVector, getValueCapacity, iterator, replaceDataVector, setInitialCapacity, size
Methods inherited from class org.apache.drill.exec.vector.BaseValueVector
checkBufRefs, close, fillBitsVector, getAllocator, getField, getField, getMetadata, getTransferPair, toNullable, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface org.apache.drill.exec.vector.ValueVector
close, getAllocator, getField, getMetadata, getTransferPair, toNullable
-
Field Details
-
UNION_VECTOR_NAME
- See Also:
-
-
Constructor Details
-
ListVector
-
-
Method Details
-
getWriter
-
allocateNew
Description copied from interface:ValueVector
Allocate new buffers. ValueVector implements logic to determine how much to allocate.- Throws:
OutOfMemoryException
- Thrown if no memory can be allocated.
-
transferTo
-
copyFromSafe
-
copyFrom
-
copyEntry
-
getDataVector
- Specified by:
getDataVector
in interfaceRepeatedValueVector
- Overrides:
getDataVector
in classBaseRepeatedValueVector
- Returns:
- the underlying data vector or null if none exists.
-
getBitsVector
-
getTransferPair
-
makeTransferPair
Description copied from interface:ValueVector
Returns a newtransfer pair
that is used to transfer underlying buffers into the target vector. -
getAccessor
Description copied from interface:ValueVector
Returns anaccessor
that is used to read from this vector instance. -
getMutator
Description copied from interface:ValueVector
Returns anmutator
that is used to write to this vector instance. -
getReader
Description copied from interface:ValueVector
Returns afield reader
that supports reading values from this vector. -
allocateNewSafe
public boolean allocateNewSafe()Description copied from interface:ValueVector
Allocates new buffers. ValueVector implements logic to determine how much to allocate.- Specified by:
allocateNewSafe
in interfaceValueVector
- Overrides:
allocateNewSafe
in classBaseRepeatedValueVector
- Returns:
- Returns true if allocation was successful.
-
getMetadataBuilder
- Overrides:
getMetadataBuilder
in classBaseRepeatedValueVector
-
addOrGetVector
Description copied from interface:ContainerVectorLike
Creates and adds a child vector if none with the same name exists, else returns the vector instance.- Specified by:
addOrGetVector
in interfaceContainerVectorLike
- Overrides:
addOrGetVector
in classBaseRepeatedValueVector
- Parameters:
descriptor
- vector descriptor- Returns:
- result of operation wrapping vector corresponding to the given descriptor and whether it's newly created
-
getBufferSize
public int getBufferSize()Description copied from interface:ValueVector
Returns the number of bytes that is used by this vector instance. This is a bit of a misnomer. Returns the number of bytes used by data in this instance.- Specified by:
getBufferSize
in interfaceValueVector
- Overrides:
getBufferSize
in classBaseRepeatedValueVector
-
clear
public void clear()Description copied from interface:ValueVector
Release the underlying DrillBuf and reset the ValueVector to empty.- Specified by:
clear
in interfaceValueVector
- Overrides:
clear
in classBaseRepeatedValueVector
-
getBuffers
Description copied from interface:ValueVector
Return the underlying buffers associated with this vector. Note that this doesn't impact the reference counts for this buffer so it only should be used for in-context access. Also note that this buffer changes regularly thus external classes shouldn't hold a reference to it (unless they change it).- Specified by:
getBuffers
in interfaceValueVector
- Overrides:
getBuffers
in classBaseRepeatedValueVector
- Parameters:
clear
- Whether to clear vector before returning; the buffers will still be refcounted; but the returned array will be the only reference to them- Returns:
- The underlying
buffers
that is used by this vector instance.
-
isEmptyType
public boolean isEmptyType() -
setChildVector
- Overrides:
setChildVector
in classBaseRepeatedValueVector
-
promoteToUnion
Promote the list to a union. Called from old-style writers. This implementation relies on the caller to set the types vector for any existing values. This method simply clears the existing vector.- Returns:
- the new union vector
-
fullPromoteToUnion
Revised form of promote to union that correctly fixes up the list field metadata to match the new union type. Since this form handles both the vector and metadata revisions, it is a "full" promotion.- Returns:
- the new union vector
-
convertToUnion
Promote to a union, preserving the existing data vector as a member of the new union. Back-fill the types vector with the proper type value for existing rows.- Returns:
- the new union vector
-
collectLedgers
Description copied from interface:ValueVector
Add the ledgers underlying the buffers underlying the components of the vector to the set provided. Used to determine actual memory allocation.- Specified by:
collectLedgers
in interfaceValueVector
- Overrides:
collectLedgers
in classBaseRepeatedValueVector
- Parameters:
ledgers
- set of ledgers to which to add ledgers for this vector
-
getPayloadByteCount
public int getPayloadByteCount(int valueCount) Description copied from interface:ValueVector
Return the number of value bytes consumed by actual data.- Specified by:
getPayloadByteCount
in interfaceValueVector
- Overrides:
getPayloadByteCount
in classBaseRepeatedValueVector
-