Class ScanSchemaOrchestrator
Provides the option to continue a schema from one batch to the next. This can reduce spurious schema changes in formats, such as JSON, with varying fields. It is not, however, a complete solution as the outcome still depends on the order of file scans and the division of files across readers.
Provides the option to infer the schema from the first batch. The "quick path" to obtain the schema will read one batch, then use that schema as the returned schema, returning the full batch in the next call to next().
Publishing the Final Result Set
This class "publishes" a vector container that has the final, projected
form of a scan. The projected schema include:
- Columns from the reader.
- Static columns, such as implicit or partition columns.
- Null columns for items in the select list, but not found in either
of the above two categories.
The order of columns is that set by the select list (or, by the reader for
a SELECT * query.
Schema Handling
The mapping handles a variety of cases:
- An early-schema table (one in which we know the schema and
the schema remains constant for the whole table.
- A late schema table (one in which we discover the schema as
we read the table, and where the schema can change as the read
progresses.)
- Late schema table with SELECT * (we want all columns, whatever
they happen to be.)
- Late schema with explicit select list (we want only certain
columns when they happen to appear in the input.)
Implementation Overview
Major tasks of this class include:
- Project table columns (change position and or name).
- Insert static and null columns.
- Schema smoothing. That is, if table A produces columns (a, b), but
table B produces only (a), use the type of the first table's b column for the
null created for the missing b in table B.
- Vector persistence: use the same set of vectors across readers as long
as the reader schema does not cause a "hard" schema change (change in type,
introduction of a new column.
- Detection of schema changes (change of type, introduction of a new column
for a SELECT * query, changing the projected schema, and reporting
the change downstream.
A projection is needed to:
- Reorder table columns
- Select a subset of table columns
- Fill in missing select columns
- Fill in implicit or partition columns
Creates and returns the batch merger that does the projection.
Projection
To visualize this, assume we have numbered table columns, lettered
implicit, null or partition columns:
[ 1 | 2 | 3 | 4 ] Table columns in table order
[ A | B | C ] Static columns
Now, we wish to project them into select order.
Let's say that the SELECT clause looked like this, with "t"
indicating table columns:
SELECT t2, t3, C, B, t1, A, t2 ...
Then the projection looks like this:
[ 2 | 3 | C | B | 1 | A | 2 ]
Often, not all table columns are projected. In this case, the
result set loader presents the full table schema to the reader,
but actually writes only the projected columns. Suppose we
have:
SELECT t3, C, B, t1,, A ...
Then the abbreviated table schema looks like this:
[ 1 | 3 ]
Note that table columns retain their table ordering.
The projection looks like this:
[ 2 | C | B | 1 | A ]
- Columns from the reader.
- Static columns, such as implicit or partition columns.
- Null columns for items in the select list, but not found in either of the above two categories.
Schema Handling
The mapping handles a variety of cases:- An early-schema table (one in which we know the schema and the schema remains constant for the whole table.
- A late schema table (one in which we discover the schema as
we read the table, and where the schema can change as the read
progresses.)
- Late schema table with SELECT * (we want all columns, whatever they happen to be.)
- Late schema with explicit select list (we want only certain columns when they happen to appear in the input.)
Implementation Overview
Major tasks of this class include:- Project table columns (change position and or name).
- Insert static and null columns.
- Schema smoothing. That is, if table A produces columns (a, b), but table B produces only (a), use the type of the first table's b column for the null created for the missing b in table B.
- Vector persistence: use the same set of vectors across readers as long as the reader schema does not cause a "hard" schema change (change in type, introduction of a new column.
- Detection of schema changes (change of type, introduction of a new column for a SELECT * query, changing the projected schema, and reporting the change downstream.
- Reorder table columns
- Select a subset of table columns
- Fill in missing select columns
- Fill in implicit or partition columns
Projection
To visualize this, assume we have numbered table columns, lettered implicit, null or partition columns:
[ 1 | 2 | 3 | 4 ] Table columns in table order
[ A | B | C ] Static columns
Now, we wish to project them into select order.
Let's say that the SELECT clause looked like this, with "t"
indicating table columns:
SELECT t2, t3, C, B, t1, A, t2 ...
Then the projection looks like this:
[ 2 | 3 | C | B | 1 | A | 2 ]
Often, not all table columns are projected. In this case, the
result set loader presents the full table schema to the reader,
but actually writes only the projected columns. Suppose we
have:
SELECT t3, C, B, t1,, A ...
Then the abbreviated table schema looks like this:
[ 1 | 3 ]
Note that table columns retain their table ordering.
The projection looks like this:
[ 2 | C | B | 1 | A ]
The projector is created once per schema, then can be reused for any number of batches.
Merging is done in one of two ways, depending on the input source:
- For the table loader, the merger discards any data in the output, then exchanges the buffers from the input columns to the output, leaving projected columns empty. Note that unprojected columns must be cleared by the caller.
- For implicit and null columns, the output vector is identical to the input vector.
-
Nested Class Summary
Modifier and TypeClassDescriptionstatic class
static class
-
Field Summary
Modifier and TypeFieldDescriptionprotected final BufferAllocator
protected int
static final int
static final int
static final int
static final int
final MetadataManager
Creates the metadata (file and directory) columns, if needed.static final int
protected final ScanSchemaOrchestrator.ScanSchemaOptions
protected VectorContainer
protected long
protected final ScanLevelProjection
protected final SchemaSmoother
protected final ResultVectorCacheImpl
Cache used to preserve the same vectors from one output batch to the next to keep the Project operator happy (which depends on exactly the same vectors. -
Constructor Summary
ConstructorDescriptionScanSchemaOrchestrator
(BufferAllocator allocator, ScanSchemaOrchestrator.ScanOrchestratorBuilder builder) -
Method Summary
Modifier and TypeMethodDescriptionboolean
atLimit()
void
close()
void
boolean
boolean
output()
Returns the provided reader schema.void
tallyBatch
(int rowCount)
-
Field Details
-
MIN_BATCH_BYTE_SIZE
public static final int MIN_BATCH_BYTE_SIZE- See Also:
-
MAX_BATCH_BYTE_SIZE
public static final int MAX_BATCH_BYTE_SIZE- See Also:
-
DEFAULT_BATCH_ROW_COUNT
public static final int DEFAULT_BATCH_ROW_COUNT- See Also:
-
DEFAULT_BATCH_BYTE_COUNT
public static final int DEFAULT_BATCH_BYTE_COUNT -
MAX_BATCH_ROW_COUNT
public static final int MAX_BATCH_ROW_COUNT- See Also:
-
allocator
-
options
-
metadataManager
Creates the metadata (file and directory) columns, if needed. -
vectorCache
Cache used to preserve the same vectors from one output batch to the next to keep the Project operator happy (which depends on exactly the same vectors.If the Project operator ever changes so that it depends on looking up vectors rather than vector instances, this cache can be deprecated.
-
scanProj
-
schemaSmoother
-
batchCount
protected int batchCount -
rowCount
protected long rowCount -
outputContainer
-
-
Constructor Details
-
ScanSchemaOrchestrator
public ScanSchemaOrchestrator(BufferAllocator allocator, ScanSchemaOrchestrator.ScanOrchestratorBuilder builder)
-
-
Method Details
-
startReader
-
isProjectNone
public boolean isProjectNone() -
hasSchema
public boolean hasSchema() -
providedSchema
Returns the provided reader schema. -
output
-
tallyBatch
public void tallyBatch(int rowCount) -
atLimit
public boolean atLimit() -
closeReader
public void closeReader() -
close
public void close()
-