public class HashJoinBatch extends AbstractBinaryRecordBatch<HashJoinPOP> implements RowKeyJoin
This implementation splits the incoming Build side rows into multiple
Partitions, thus allowing spilling of some of these partitions to disk if
memory gets tight. Each partition is implemented as a HashPartition
.
After the build phase is over, in the most general case, some of the
partitions were spilled, and the others are in memory. Each of the partitions
in memory would get a HashTable
built.
Next the Probe side is read, and each row is key matched with a Build partition. If that partition is in memory, then the key is used to probe and perform the join, and the results are added to the outgoing batch. But if that build side partition was spilled, then the matching Probe size partition is spilled as well.
After all the Probe side was processed, we are left with pairs of spilled partitions. Then each pair is processed individually (that Build partition should be smaller than the original, hence likely fit whole into memory to allow probing; if not -- see below).
Processing of each spilled pair is EXACTLY like processing the original
Build/Probe incomings. (As a fact, the innerNext()
method calls
itself recursively !!). Thus the spilled build partition is read and divided
into new partitions, which in turn may spill again (and again...). The code
tracks these spilling "cycles". Normally any such "again" (i.e. cycle of 2 or
greater) is a waste, indicating that the number of partitions chosen was too
small.
Modifier and Type | Class and Description |
---|---|
static class |
HashJoinBatch.HashJoinSpilledPartition
This holds information about the spilled partitions for the build and probe
side.
|
class |
HashJoinBatch.HashJoinUpdater |
static class |
HashJoinBatch.Metric |
AbstractRecordBatch.BatchState
RowKeyJoin.RowKeyJoinState
RecordBatch.IterOutcome
batchMemoryManager, left, LEFT_INDEX, leftUpstream, numInputs, right, RIGHT_INDEX, rightUpstream
batchStatsContext, container, context, oContext, state, stats, unionTypeEnabled
MAX_BATCH_ROW_COUNT
Constructor and Description |
---|
HashJoinBatch(HashJoinPOP popConfig,
FragmentContext context,
RecordBatch left,
RecordBatch right)
The constructor
|
Modifier and Type | Method and Description |
---|---|
protected void |
buildSchema() |
protected void |
cancelIncoming() |
void |
close() |
void |
dump()
Perform dump of this batch's state to logs.
|
RecordBatch.IterOutcome |
executeBuildPhase()
Execute the BUILD phase; first read incoming and split rows into
partitions; may decide to spill some of the partitions
|
AbstractRecordBatch.BatchState |
getBatchState()
Get the current BatchState (this is useful when performing row key join)
|
HashJoinMemoryCalculator |
getCalculatorImpl()
Determines the memory calculator to use.
|
int |
getRecordCount()
Get the number of records.
|
RowKeyJoin.RowKeyJoinState |
getRowKeyJoinState()
Get the current RowKeyJoinState.
|
boolean |
hasRowKeyBatch()
Is the next batch of row keys ready to be returned
|
RecordBatch.IterOutcome |
innerNext() |
boolean |
isSpilledInner(int part) |
String |
makeDebugString()
This creates a string that summarizes the memory usage of the operator.
|
org.apache.commons.lang3.tuple.Pair<ValueVector,Integer> |
nextRowKeyBatch()
Get the hash table iterator that is created for the build side of the hash
join if this hash join was instantiated as a row-key join.
|
void |
setBatchState(AbstractRecordBatch.BatchState newState)
Set the BatchState (this is useful when performing row key join)
|
void |
setRowKeyJoinState(RowKeyJoin.RowKeyJoinState newState)
Set the RowKeyJoinState (this is useful for maintaining state for row key join algorithm)
|
HashJoinProbe |
setupHashJoinProbe() |
void |
updateMetrics() |
checkForEarlyFinish, getBatchMemoryManager, prefetchFirstBatchFromBothSides, updateBatchMemoryManagerStats, verifyOutcomeToSetBatchState
cancel, checkContinue, getContainer, getContext, getOutgoingContainer, getPopConfig, getRecordBatchStatsContext, getSchema, getSelectionVector2, getSelectionVector4, getValueAccessorById, getValueVectorId, getWritableBatch, isRecordBatchStatsLoggingEnabled, iterator, next, next, next, schemaChangeException, schemaChangeException
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
forEach, spliterator
public HashJoinBatch(HashJoinPOP popConfig, FragmentContext context, RecordBatch left, RecordBatch right) throws OutOfMemoryException
popConfig
- context
- left
- -- probe/outer side incoming inputright
- -- build/iner side incoming inputOutOfMemoryException
public int getRecordCount()
VectorAccessible
getRecordCount
in interface VectorAccessible
protected void buildSchema()
buildSchema
in class AbstractRecordBatch<HashJoinPOP>
public HashJoinMemoryCalculator getCalculatorImpl()
public RecordBatch.IterOutcome innerNext()
innerNext
in class AbstractRecordBatch<HashJoinPOP>
public RecordBatch.IterOutcome executeBuildPhase() throws SchemaChangeException
RecordBatch.IterOutcome
if a
termination condition is reached. Otherwise returns null.SchemaChangeException
public boolean isSpilledInner(int part)
public String makeDebugString()
public org.apache.commons.lang3.tuple.Pair<ValueVector,Integer> nextRowKeyBatch()
nextRowKeyBatch
in interface RowKeyJoin
public boolean hasRowKeyBatch()
RowKeyJoin
hasRowKeyBatch
in interface RowKeyJoin
public AbstractRecordBatch.BatchState getBatchState()
RowKeyJoin
getBatchState
in interface RowKeyJoin
public void setBatchState(AbstractRecordBatch.BatchState newState)
RowKeyJoin
setBatchState
in interface RowKeyJoin
protected void cancelIncoming()
cancelIncoming
in class AbstractBinaryRecordBatch<HashJoinPOP>
public void updateMetrics()
public void setRowKeyJoinState(RowKeyJoin.RowKeyJoinState newState)
RowKeyJoin
setRowKeyJoinState
in interface RowKeyJoin
public RowKeyJoin.RowKeyJoinState getRowKeyJoinState()
RowKeyJoin
getRowKeyJoinState
in interface RowKeyJoin
public void close()
close
in interface AutoCloseable
close
in class AbstractRecordBatch<HashJoinPOP>
public HashJoinProbe setupHashJoinProbe()
public void dump()
RecordBatch
dump
in interface RecordBatch
Copyright © 1970 The Apache Software Foundation. All rights reserved.