Class ListVector
- All Implemented Interfaces:
Closeable
,AutoCloseable
,Iterable<ValueVector>
,ContainerVectorLike
,RepeatedValueVector
,ValueVector
Why this odd behavior? The LIST type apparently attempts to model certain JSON types. In JSON, we can have lists like this:
{a: [null, null]}
{a: [10, "foo"]}
{a: [{name: "fred", balance: 10}, null]
{a: null}
Compared with Drill, JSON has a number of additional list-related
abilities:
- A list can be null. (In Drill, an array can be empty, but not null.)
- A list element can be null. (In Drill, a repeated type is an array of non-nullable elements, so list elements can't be null.
- A list can contain heterogeneous types. (In Drill, repeated types are arrays of a single type.
- Allows the list value for a row to be null. (To handle the
{list: null}
case. - Allows list elements to be null. (To handle the
{list: [10, null 30]}</code case.)
- Allows the list to be a single type. (To handle the list of nullable ints above.
- Allows the list to be of multiple types, by creating a list
of UNIONs. (To handle the
{list: ["fred", 10]}
case.
Background
The above is the theory. The problem is, the goals are very difficult
to achieve, and the code here does not quite do so.
The code here is difficult to maintain and understand.
The first thing to understand is that union vectors are
broken in most operators, and so major bugs remain in union
and list vectors that
have not had to be fixed. Recent revisions attempt to
fix or works around some of the bugs, but many remain.
Unions have a null bit for the union itself. That is, a union can be an
int, say, or. a Varchar, or null. Oddly, the Int and Varchar can also be
null (we use nullable vectors so we can mark the unused values as null.)
So, we have a two-level null bit. The most logical way to interpret it is
that a union value can be:
- Untyped null (if the type is not set and the null bit (really, the isSet
bit) is unset.) Typed null if the type is set and EITHER the union's isSet
bit is unset OR the union's isSet bit is set, but the data vector's isSet
bit is not set. It is not clear in the code which convention is assumed, or
if different code does it differently.
- Now, add all that to a list. A list can be a list of something (ints, say,
or maps.) When the list is a list of maps, the entire value for a row can
be null. But individual maps can't be null. In a list, however, individual
ints can be null (because we use a nullable int vector.)
So, when a list of (non-nullable maps) converts to a list of unions (one of
which is a map), we suddenly now have the list null bit and the union null
bit to worry about. We have to go and back-patch the isSet vector for all
the existing map entries in the new union so that we don't end up with all
previous entries becoming null by default.
Another issue is that the metadata for a list should reflect the structure
of the list. The MaterializedField
contains a child field, which
points to the element of the list. If that child is a UNION, then the UNION's
MaterializedField
contains subtypes for each type in the
union. Now, note that the LIST's metadata contains the child, so we need
to update the LIST's MaterializedField
each time we add a
type to the UNION. And, since the LIST is part of a row or map, then we
have to update the metadata in those objects to propagate the change.
The problem is that the original design assumed that
MaterializedField
is immutable. The above shows that it
clearly is not. So, we have a tension between the original immutable
design and the necessity of mutating the MaterializedField
to keep everything in sync.
Of course, there is another solution: don't include subtypes and children
in the MaterializedField
, then we don't have the propagation
problem.
The code for this class kind of punts on the issue: the metadata is not
maintained and can get out of sync. THis makes the metadata useless: one must
recover actual structure by traversing vectors. There was an attempt to fix
this, but doing so changes the metadata structure, which broke clients. So,
we have to live with broken metadata and work around the issues. The metadata
sync issue exists in many places, but is most obvious in the LIST vector
because of the sheer complexity in this class.
This is why the code notes say that this is a mess.
It is hard to simply fix the bugs because this is a design problem. If the list
and union vectors don't need to work (they barely work today), then any
design is fine. See the list of JIRA tickets below for more information.
Fundamental issue: should Drill support unions and lists? Is the current
approach compatible with SQL? Is there a better approach? If such changes
are made, they are breaking changes, and so must be done as part of a major
version, such as the much-discussed "Drill 2.0". Or, perhaps as part of
a conversion to use Apache Arrow, which also would be a major breaking
change.
-
Nested Class Summary
Nested Classes
Nested classes/interfaces inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
BaseRepeatedValueVector.BaseRepeatedAccessor, BaseRepeatedValueVector.BaseRepeatedMutator, BaseRepeatedValueVector.BaseRepeatedValueVectorTransferPair<T extends BaseRepeatedValueVector>
Nested classes/interfaces inherited from class org.apache.drill.exec.vector.BaseValueVector
BaseValueVector.BaseAccessor, BaseValueVector.BaseMutator
Nested classes/interfaces inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
RepeatedValueVector.RepeatedAccessor, RepeatedValueVector.RepeatedMutator
-
Field Summary
Fields
Fields inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
DATA_VECTOR_NAME, DEFAULT_DATA_VECTOR, offsets, OFFSETS_FIELD, OFFSETS_VECTOR_NAME, vector
Fields inherited from class org.apache.drill.exec.vector.BaseValueVector
allocator, field, INITIAL_VALUE_ALLOCATION, MAX_ALLOCATION_SIZE
Fields inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
DEFAULT_REPEAT_PER_RECORD
Fields inherited from interface org.apache.drill.exec.vector.ValueVector
BITS_VECTOR_NAME, MAX_BUFFER_SIZE, MAX_ROW_COUNT, MIN_ROW_COUNT, VALUES_VECTOR_NAME
-
Constructor Summary
Constructors
Constructor
Description
ListVector(MaterializedField field,
BufferAllocator allocator,
CallBack callBack)
-
Method Summary
Modifier and Type
Method
Description
<T extends ValueVector>
AddOrGetResult<T>
addOrGetVector(VectorDescriptor descriptor)
Creates and adds a child vector if none with the same name exists, else returns the vector instance.
void
Allocate new buffers.
boolean
Allocates new buffers.
void
clear()
Release the underlying DrillBuf and reset the ValueVector to empty.
void
collectLedgers(Set<AllocationManager.BufferLedger> ledgers)
Add the ledgers underlying the buffers underlying the components of the
vector to the set provided.
convertToUnion(int allocValueCount,
int valueCount)
Promote to a union, preserving the existing data vector as a member of the
new union.
void
copyEntry(int toIndex,
ValueVector from,
int fromIndex)
void
copyFrom(int inIndex,
int outIndex,
ListVector from)
void
copyFromSafe(int inIndex,
int outIndex,
ListVector from)
Revised form of promote to union that correctly fixes up the list
field metadata to match the new union type.
Returns an accessor
that is used to read from this vector
instance.
DrillBuf[]
getBuffers(boolean clear)
Return the underlying buffers associated with this vector.
int
Returns the number of bytes that is used by this vector instance.
protected UserBitShared.SerializedField.Builder
Returns an mutator
that is used to write to this vector
instance.
int
getPayloadByteCount(int valueCount)
Return the number of value bytes consumed by actual data.
Returns a field reader
that supports reading values
from this vector.
getTransferPair(String ref,
BufferAllocator allocator)
boolean
void
load(UserBitShared.SerializedField metadata,
DrillBuf buffer)
Load the data provided in the buffer.
makeTransferPair(ValueVector target)
Returns a new transfer pair
that is used to transfer underlying
buffers into the target vector.
Promote the list to a union.
void
setChildVector(ValueVector childVector)
void
transferTo(ListVector target)
Methods inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
exchange, getAllocatedSize, getBufferSizeFor, getOffsetVector, getValueCapacity, iterator, replaceDataVector, setInitialCapacity, size
Methods inherited from class org.apache.drill.exec.vector.BaseValueVector
checkBufRefs, close, fillBitsVector, getAllocator, getField, getField, getMetadata, getTransferPair, toNullable, toString
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
Methods inherited from interface java.lang.Iterable
forEach, spliterator
Methods inherited from interface org.apache.drill.exec.vector.ValueVector
close, getAllocator, getField, getMetadata, getTransferPair, toNullable
-
Field Details
-
UNION_VECTOR_NAME
- See Also:
-
-
Constructor Details
-
ListVector
-
Method Details
-
getWriter
-
allocateNew
Description copied from interface: ValueVector
Allocate new buffers. ValueVector implements logic to determine how much to allocate.
- Throws:
OutOfMemoryException
- Thrown if no memory can be allocated.
-
transferTo
-
copyFromSafe
-
copyFrom
-
copyEntry
-
getDataVector
- Specified by:
getDataVector
in interface RepeatedValueVector
- Overrides:
getDataVector
in class BaseRepeatedValueVector
- Returns:
- the underlying data vector or null if none exists.
-
getBitsVector
-
getTransferPair
-
makeTransferPair
Description copied from interface: ValueVector
Returns a new transfer pair
that is used to transfer underlying
buffers into the target vector.
-
getAccessor
Description copied from interface: ValueVector
Returns an accessor
that is used to read from this vector
instance.
-
getMutator
Description copied from interface: ValueVector
Returns an mutator
that is used to write to this vector
instance.
-
getReader
Description copied from interface: ValueVector
Returns a field reader
that supports reading values
from this vector.
-
allocateNewSafe
public boolean allocateNewSafe()
Description copied from interface: ValueVector
Allocates new buffers. ValueVector implements logic to determine how much to allocate.
- Specified by:
allocateNewSafe
in interface ValueVector
- Overrides:
allocateNewSafe
in class BaseRepeatedValueVector
- Returns:
- Returns true if allocation was successful.
-
getMetadataBuilder
- Overrides:
getMetadataBuilder
in class BaseRepeatedValueVector
-
addOrGetVector
Description copied from interface: ContainerVectorLike
Creates and adds a child vector if none with the same name exists, else returns the vector instance.
- Specified by:
addOrGetVector
in interface ContainerVectorLike
- Overrides:
addOrGetVector
in class BaseRepeatedValueVector
- Parameters:
descriptor
- vector descriptor
- Returns:
- result of operation wrapping vector corresponding to the given descriptor and whether it's newly created
-
getBufferSize
public int getBufferSize()
Description copied from interface: ValueVector
Returns the number of bytes that is used by this vector instance.
This is a bit of a misnomer. Returns the number of bytes used by
data in this instance.
- Specified by:
getBufferSize
in interface ValueVector
- Overrides:
getBufferSize
in class BaseRepeatedValueVector
-
clear
public void clear()
Description copied from interface: ValueVector
Release the underlying DrillBuf and reset the ValueVector to empty.
- Specified by:
clear
in interface ValueVector
- Overrides:
clear
in class BaseRepeatedValueVector
-
getBuffers
Description copied from interface: ValueVector
Return the underlying buffers associated with this vector. Note that this doesn't impact the reference counts for
this buffer so it only should be used for in-context access. Also note that this buffer changes regularly thus
external classes shouldn't hold a reference to it (unless they change it).
- Specified by:
getBuffers
in interface ValueVector
- Overrides:
getBuffers
in class BaseRepeatedValueVector
- Parameters:
clear
- Whether to clear vector before returning; the buffers will still be refcounted;
but the returned array will be the only reference to them
- Returns:
- The underlying
buffers
that is used by this vector instance.
-
-
isEmptyType
public boolean isEmptyType()
-
setChildVector
- Overrides:
setChildVector
in class BaseRepeatedValueVector
-
promoteToUnion
Promote the list to a union. Called from old-style writers. This implementation
relies on the caller to set the types vector for any existing values.
This method simply clears the existing vector.
- Returns:
- the new union vector
-
fullPromoteToUnion
Revised form of promote to union that correctly fixes up the list
field metadata to match the new union type. Since this form handles
both the vector and metadata revisions, it is a "full" promotion.
- Returns:
- the new union vector
-
convertToUnion
Promote to a union, preserving the existing data vector as a member of the
new union. Back-fill the types vector with the proper type value for
existing rows.
- Returns:
- the new union vector
-
collectLedgers
Description copied from interface: ValueVector
Add the ledgers underlying the buffers underlying the components of the
vector to the set provided. Used to determine actual memory allocation.
- Specified by:
collectLedgers
in interface ValueVector
- Overrides:
collectLedgers
in class BaseRepeatedValueVector
- Parameters:
ledgers
- set of ledgers to which to add ledgers for this vector
-
getPayloadByteCount
public int getPayloadByteCount(int valueCount)
Description copied from interface: ValueVector
Return the number of value bytes consumed by actual data.
- Specified by:
getPayloadByteCount
in interface ValueVector
- Overrides:
getPayloadByteCount
in class BaseRepeatedValueVector