org.apache.drill.exec.vector.complex.ListVector

All Implemented Interfaces:: Closeable, AutoCloseable, Iterable<ValueVector>, ContainerVectorLike, RepeatedValueVector, ValueVector

public class ListVector extends BaseRepeatedValueVector

"Non-repeated" LIST vector. This vector holds some other vector as its data element. Unlike a repeated vector, the child element can change dynamically. It starts as nothing (the LATE type). It can then change to a single type (typically a map but can be anything.) If another type is needed, the list morphs again, this time to a list of unions. The prior single type becomes a member of the new union (which requires back-filling is-set values.)

Why this odd behavior? The LIST type apparently attempts to model certain JSON types. In JSON, we can have lists like this:


 {a: [null, null]}
 {a: [10, "foo"]}
 {a: [{name: "fred", balance: 10}, null]
 {a: null}

Compared with Drill, JSON has a number of additional list-related abilities:

A list can be null. (In Drill, an array can be empty, but not null.)
A list element can be null. (In Drill, a repeated type is an array of non-nullable elements, so list elements can't be null.
A list can contain heterogeneous types. (In Drill, repeated types are arrays of a single type.

The LIST vector is an attempt to implement full JSON behavior. The list:

Allows the list value for a row to be null. (To handle the {list: null} case.
Allows list elements to be null. (To handle the {list: [10, null 30]}</code case.)


 Allows the list to be a single type. (To handle the list
 of nullable ints above.
 Allows the list to be of multiple types, by creating a list
 of UNIONs. (To handle the
 {list: ["fred", 10]} case.



 Background

 The above is the theory. The problem is, the goals are very difficult
 to achieve, and the code here does not quite do so.
 The code here is difficult to maintain and understand.
 The first thing to understand is that union vectors are
 broken in most operators, and so major bugs remain in union
 and list vectors that
 have not had to be fixed. Recent revisions attempt to
 fix or works around some of the bugs, but many remain.
 
 Unions have a null bit for the union itself. That is, a union can be an
 int, say, or. a Varchar, or null. Oddly, the Int and Varchar can also be
 null (we use nullable vectors so we can mark the unused values as null.)
 So, we have a two-level null bit. The most logical way to interpret it is
 that a union value can be:
 

 Untyped null (if the type is not set and the null bit (really, the isSet
 bit) is unset.) Typed null if the type is set and EITHER the union's isSet
 bit is unset OR the union's isSet bit is set, but the data vector's isSet
 bit is not set. It is not clear in the code which convention is assumed, or
 if different code does it differently.
 Now, add all that to a list. A list can be a list of something (ints, say,
 or maps.) When the list is a list of maps, the entire value for a row can
 be null. But individual maps can't be null. In a list, however, individual
 ints can be null (because we use a nullable int vector.)
 
 So, when a list of (non-nullable maps) converts to a list of unions (one of
 which is a map), we suddenly now have the list null bit and the union null
 bit to worry about. We have to go and back-patch the isSet vector for all
 the existing map entries in the new union so that we don't end up with all
 previous entries becoming null by default.
 
 Another issue is that the metadata for a list should reflect the structure
 of the list. The MaterializedField contains a child field, which
 points to the element of the list. If that child is a UNION, then the UNION's
 MaterializedField contains subtypes for each type in the
 union. Now, note that the LIST's metadata contains the child, so we need
 to update the LIST's MaterializedField each time we add a
 type to the UNION. And, since the LIST is part of a row or map, then we
 have to update the metadata in those objects to propagate the change.
 

 The problem is that the original design assumed that
 MaterializedField is immutable. The above shows that it
 clearly is not. So, we have a tension between the original immutable
 design and the necessity of mutating the MaterializedField
 to keep everything in sync.
 

 Of course, there is another solution: don't include subtypes and children
 in the MaterializedField, then we don't have the propagation
 problem.
 

 The code for this class kind of punts on the issue: the metadata is not
 maintained and can get out of sync. THis makes the metadata useless: one must
 recover actual structure by traversing vectors. There was an attempt to fix
 this, but doing so changes the metadata structure, which broke clients. So,
 we have to live with broken metadata and work around the issues. The metadata
 sync issue exists in many places, but is most obvious in the LIST vector
 because of the sheer complexity in this class.
 

 This is why the code notes say that this is a mess.
 

 It is hard to simply fix the bugs because this is a design problem. If the list
 and union vectors don't need to work (they barely work today), then any
 design is fine. See the list of JIRA tickets below for more information.
 

 Fundamental issue: should Drill support unions and lists? Is the current
 approach compatible with SQL? Is there a better approach? If such changes
 are made, they are breaking changes, and so must be done as part of a major
 version, such as the much-discussed "Drill 2.0". Or, perhaps as part of
 a conversion to use Apache Arrow, which also would be a major breaking
 change.







Nested Class Summary
Nested Classes

Modifier and Type
Class
Description
class 
ListVector.Accessor
 
class 
ListVector.Mutator
 


Nested classes/interfaces inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
BaseRepeatedValueVector.BaseRepeatedAccessor, BaseRepeatedValueVector.BaseRepeatedMutator, BaseRepeatedValueVector.BaseRepeatedValueVectorTransferPair<T extends BaseRepeatedValueVector>

Nested classes/interfaces inherited from class org.apache.drill.exec.vector.BaseValueVector
BaseValueVector.BaseAccessor, BaseValueVector.BaseMutator

Nested classes/interfaces inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
RepeatedValueVector.RepeatedAccessor, RepeatedValueVector.RepeatedMutator





Field Summary
Fields

Modifier and Type
Field
Description
static final String
UNION_VECTOR_NAME
 


Fields inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
DATA_VECTOR_NAME, DEFAULT_DATA_VECTOR, offsets, OFFSETS_FIELD, OFFSETS_VECTOR_NAME, vector

Fields inherited from class org.apache.drill.exec.vector.BaseValueVector
allocator, field, INITIAL_VALUE_ALLOCATION, MAX_ALLOCATION_SIZE

Fields inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector
DEFAULT_REPEAT_PER_RECORD

Fields inherited from interface org.apache.drill.exec.vector.ValueVector
BITS_VECTOR_NAME, MAX_BUFFER_SIZE, MAX_ROW_COUNT, MIN_ROW_COUNT, VALUES_VECTOR_NAME





Constructor Summary
Constructors

Constructor
Description
ListVector(MaterializedField field,
 BufferAllocator allocator,
 CallBack callBack)
 






Method Summary




Modifier and Type
Method
Description
<T extends ValueVector>
AddOrGetResult<T>
addOrGetVector(VectorDescriptor descriptor)

Creates and adds a child vector if none with the same name exists, else returns the vector instance.

void
allocateNew()

Allocate new buffers.

boolean
allocateNewSafe()

Allocates new buffers.

void
clear()

Release the underlying DrillBuf and reset the ValueVector to empty.

void
collectLedgers(Set<AllocationManager.BufferLedger> ledgers)

Add the ledgers underlying the buffers underlying the components of the
 vector to the set provided.

UnionVector
convertToUnion(int allocValueCount,
 int valueCount)

Promote to a union, preserving the existing data vector as a member of the
 new union.

void
copyEntry(int toIndex,
 ValueVector from,
 int fromIndex)
 
void
copyFrom(int inIndex,
 int outIndex,
 ListVector from)
 
void
copyFromSafe(int inIndex,
 int outIndex,
 ListVector from)
 
UnionVector
fullPromoteToUnion()

Revised form of promote to union that correctly fixes up the list
 field metadata to match the new union type.

ListVector.Accessor
getAccessor()

Returns an accessor that is used to read from this vector
 instance.

ValueVector
getBitsVector()
 
DrillBuf[]
getBuffers(boolean clear)

Return the underlying buffers associated with this vector.

int
getBufferSize()

Returns the number of bytes that is used by this vector instance.

ValueVector
getDataVector()
 
protected UserBitShared.SerializedField.Builder
getMetadataBuilder()
 
ListVector.Mutator
getMutator()

Returns an mutator that is used to write to this vector
 instance.

int
getPayloadByteCount(int valueCount)

Return the number of value bytes consumed by actual data.

FieldReader
getReader()

Returns a field reader that supports reading values
 from this vector.

TransferPair
getTransferPair(String ref,
 BufferAllocator allocator)
 
UnionListWriter
getWriter()
 
boolean
isEmptyType()
 
void
load(UserBitShared.SerializedField metadata,
 DrillBuf buffer)

Load the data provided in the buffer.

TransferPair
makeTransferPair(ValueVector target)

Returns a new transfer pair that is used to transfer underlying
 buffers into the target vector.

UnionVector
promoteToUnion()

Promote the list to a union.

void
setChildVector(ValueVector childVector)
 
void
transferTo(ListVector target)
 




Methods inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector
exchange, getAllocatedSize, getBufferSizeFor, getOffsetVector, getValueCapacity, iterator, replaceDataVector, setInitialCapacity, size

Methods inherited from class org.apache.drill.exec.vector.BaseValueVector
checkBufRefs, close, fillBitsVector, getAllocator, getField, getField, getMetadata, getTransferPair, toNullable, toString

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait

Methods inherited from interface java.lang.Iterable
forEach, spliterator

Methods inherited from interface org.apache.drill.exec.vector.ValueVector
close, getAllocator, getField, getMetadata, getTransferPair, toNullable









Field Details



UNION_VECTOR_NAME
public static final String UNION_VECTOR_NAME

See Also:


Constant Field Values











Constructor Details



ListVector
public ListVector(MaterializedField field,
 BufferAllocator allocator,
 CallBack callBack)








Method Details



getWriter
public UnionListWriter getWriter()




allocateNew
public void allocateNew()
                 throws OutOfMemoryException
Description copied from interface: ValueVector
Allocate new buffers. ValueVector implements logic to determine how much to allocate.

Throws:
OutOfMemoryException - Thrown if no memory can be allocated.





transferTo
public void transferTo(ListVector target)




copyFromSafe
public void copyFromSafe(int inIndex,
 int outIndex,
 ListVector from)




copyFrom
public void copyFrom(int inIndex,
 int outIndex,
 ListVector from)




copyEntry
public void copyEntry(int toIndex,
 ValueVector from,
 int fromIndex)




getDataVector
public ValueVector getDataVector()

Specified by:
getDataVector in interface RepeatedValueVector
Overrides:
getDataVector in class BaseRepeatedValueVector
Returns:
the underlying data vector or null if none exists.





getBitsVector
public ValueVector getBitsVector()




getTransferPair
public TransferPair getTransferPair(String ref,
 BufferAllocator allocator)




makeTransferPair
public TransferPair makeTransferPair(ValueVector target)
Description copied from interface: ValueVector
Returns a new transfer pair that is used to transfer underlying
 buffers into the target vector.




getAccessor
public ListVector.Accessor getAccessor()
Description copied from interface: ValueVector
Returns an accessor that is used to read from this vector
 instance.




getMutator
public ListVector.Mutator getMutator()
Description copied from interface: ValueVector
Returns an mutator that is used to write to this vector
 instance.




getReader
public FieldReader getReader()
Description copied from interface: ValueVector
Returns a field reader that supports reading values
 from this vector.




allocateNewSafe
public boolean allocateNewSafe()
Description copied from interface: ValueVector
Allocates new buffers. ValueVector implements logic to determine how much to allocate.

Specified by:
allocateNewSafe in interface ValueVector
Overrides:
allocateNewSafe in class BaseRepeatedValueVector
Returns:
Returns true if allocation was successful.





getMetadataBuilder
protected UserBitShared.SerializedField.Builder getMetadataBuilder()

Overrides:
getMetadataBuilder in class BaseRepeatedValueVector





addOrGetVector
public <T extends ValueVector> AddOrGetResult<T> addOrGetVector(VectorDescriptor descriptor)
Description copied from interface: ContainerVectorLike
Creates and adds a child vector if none with the same name exists, else returns the vector instance.

Specified by:
addOrGetVector in interface ContainerVectorLike
Overrides:
addOrGetVector in class BaseRepeatedValueVector
Parameters:
descriptor - vector descriptor
Returns:
result of operation wrapping vector corresponding to the given descriptor and whether it's newly created





getBufferSize
public int getBufferSize()
Description copied from interface: ValueVector
Returns the number of bytes that is used by this vector instance.
 This is a bit of a misnomer. Returns the number of bytes used by
 data in this instance.

Specified by:
getBufferSize in interface ValueVector
Overrides:
getBufferSize in class BaseRepeatedValueVector





clear
public void clear()
Description copied from interface: ValueVector
Release the underlying DrillBuf and reset the ValueVector to empty.

Specified by:
clear in interface ValueVector
Overrides:
clear in class BaseRepeatedValueVector





getBuffers
public DrillBuf[] getBuffers(boolean clear)
Description copied from interface: ValueVector
Return the underlying buffers associated with this vector. Note that this doesn't impact the reference counts for
 this buffer so it only should be used for in-context access. Also note that this buffer changes regularly thus
 external classes shouldn't hold a reference to it (unless they change it).

Specified by:
getBuffers in interface ValueVector
Overrides:
getBuffers in class BaseRepeatedValueVector
Parameters:
clear - Whether to clear vector before returning; the buffers will still be refcounted;
   but the returned array will be the only reference to them
Returns:
The underlying buffers that is used by this vector instance.





load
public void load(UserBitShared.SerializedField metadata,
 DrillBuf buffer)
Description copied from interface: ValueVector
Load the data provided in the buffer. Typically used when deserializing from the wire.

Specified by:
load in interface ValueVector
Overrides:
load in class BaseRepeatedValueVector
Parameters:
metadata - Metadata used to decode the incoming buffer.
buffer - The buffer that contains the ValueVector.





isEmptyType
public boolean isEmptyType()




setChildVector
public void setChildVector(ValueVector childVector)

Overrides:
setChildVector in class BaseRepeatedValueVector





promoteToUnion
public UnionVector promoteToUnion()
Promote the list to a union. Called from old-style writers. This implementation
 relies on the caller to set the types vector for any existing values.
 This method simply clears the existing vector.

Returns:
the new union vector





fullPromoteToUnion
public UnionVector fullPromoteToUnion()
Revised form of promote to union that correctly fixes up the list
 field metadata to match the new union type. Since this form handles
 both the vector and metadata revisions, it is a "full" promotion.

Returns:
the new union vector





convertToUnion
public UnionVector convertToUnion(int allocValueCount,
 int valueCount)
Promote to a union, preserving the existing data vector as a member of the
 new union. Back-fill the types vector with the proper type value for
 existing rows.

Returns:
the new union vector





collectLedgers
public void collectLedgers(Set<AllocationManager.BufferLedger> ledgers)
Description copied from interface: ValueVector
Add the ledgers underlying the buffers underlying the components of the
 vector to the set provided. Used to determine actual memory allocation.

Specified by:
collectLedgers in interface ValueVector
Overrides:
collectLedgers in class BaseRepeatedValueVector
Parameters:
ledgers - set of ledgers to which to add ledgers for this vector





getPayloadByteCount
public int getPayloadByteCount(int valueCount)
Description copied from interface: ValueVector
Return the number of value bytes consumed by actual data.

Specified by:
getPayloadByteCount in interface ValueVector
Overrides:
getPayloadByteCount in class BaseRepeatedValueVector

Class ListVector

Background

Nested Class Summary

Nested classes/interfaces inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector

Nested classes/interfaces inherited from class org.apache.drill.exec.vector.BaseValueVector

Nested classes/interfaces inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector

Field Summary

Fields inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector

Fields inherited from class org.apache.drill.exec.vector.BaseValueVector

Fields inherited from interface org.apache.drill.exec.vector.complex.RepeatedValueVector

Fields inherited from interface org.apache.drill.exec.vector.ValueVector

Constructor Summary

Method Summary

Methods inherited from class org.apache.drill.exec.vector.complex.BaseRepeatedValueVector

Methods inherited from class org.apache.drill.exec.vector.BaseValueVector

Methods inherited from class java.lang.Object

Methods inherited from interface java.lang.Iterable

Methods inherited from interface org.apache.drill.exec.vector.ValueVector

Field Details

UNION_VECTOR_NAME

Constructor Details

ListVector

Method Details

getWriter

allocateNew

transferTo

copyFromSafe

copyFrom

copyEntry

getDataVector

getBitsVector

getTransferPair

makeTransferPair

getAccessor

getMutator

getReader

allocateNewSafe

getMetadataBuilder

addOrGetVector

getBufferSize

clear

getBuffers

load

isEmptyType

setChildVector

promoteToUnion

fullPromoteToUnion

convertToUnion

collectLedgers

getPayloadByteCount