Class BaseScalarWriter

All Implemented Interfaces:
ColumnWriter, ScalarWriter, ValueWriter, WriterEvents, WriterPosition
Direct Known Subclasses:
AbstractFixedWidthWriter, BaseVarWidthWriter

public abstract class BaseScalarWriter extends AbstractScalarWriterImpl
Column writer implementation that acts as the basis for the generated, vector-specific implementations. All set methods throw an exception; subclasses simply override the supported method(s).

The only tricky part to this class is understanding the state of the write indexes as the write proceeds. There are two pointers to consider:

  • lastWriteIndex: The position in the vector at which the client last asked us to write data. This index is maintained in this class because it depends only on the actions of this class.
  • vectorIndex: The position in the vector at which we will write if the client chooses to write a value at this time. The vector index is shared by all columns at the same repeat level. It is incremented as the client steps through the write and is observed in this class each time a write occurs.
A repeat level is defined as any of the following:
  • The set of top-level scalar columns, or those within a top-level, non-repeated map, or nested to any depth within non-repeated maps rooted at the top level.
  • The values for a single scalar array.
  • The set of scalar columns within a repeated map, or nested within non-repeated maps within a repeated map.
Items at a repeat level index together and share a vector index. However, the columns within a repeat level do not share a last write index: some can lag further behind than others.

Let's illustrate the states. Let's focus on one column and illustrate the three states that can occur during write:

  • Behind: the last write index is more than one position behind the vector index. Zero-filling will be needed to catch up to the vector index.
  • Written: the last write index is the same as the vector index because the client wrote data at this position (and previous values were back-filled with nulls, empties or zeros.)
  • Unwritten: the last write index is one behind the vector index. This occurs when the column was written, then the client moved to the next row or array position.
  • Restarted: The current row is abandoned (perhaps filtered out) and is to be rewritten. The last write position moves back one position. Note that, the Restarted state is indistinguishable from the unwritten state: the only real difference is that the current slot (pointed to by the vector index) contains the previous written value that must be overwritten or back-filled. But, this is fine, because we assume that unwritten values are garbage anyway.
To illustrate:

      Behind      Written    Unwritten    Restarted
       |X|          |X|         |X|          |X|
   lw >|X|          |X|         |X|          |X|
       | |          |0|         |0|     lw > |0|
    v >| |  lw, v > |X|    lw > |X|      v > |X|
                            v > | |
 
The illustrated state transitions are:
  • Suppose the state starts in Behind.
    • If the client writes a value, then the empty slot is back-filled and the state moves to Written.
    • If the client does not write a value, the state stays at Behind, and the gap of unfilled values grows.
  • When in the Written state:
    • If the client saves the current row or array position, the vector index increments and we move to the Unwritten state.
    • If the client abandons the row, the last write position moves back one to recreate the unwritten state. We've shown this state separately above just to illustrate the two transitions from Written.
  • When in the Unwritten (or Restarted) states:
    • If the client writes a value, then the writer moves back to the Written state.
    • If the client skips the value, then the vector index increments again, leaving a gap, and the writer moves to the Behind state.

We've already noted that the Restarted state is identical to the Unwritten state (and was discussed just to make the flow a bit clearer.) The astute reader will have noticed that the Behind state is the same as the Unwritten state if we define the combined state as when the last write position is behind the vector index.

Further, if one simply treats the gap between last write and the vector indexes as the amount (which may be zero) to back-fill, then there is just one state. This is, in fact, how the code works: it always writes to the vector index (and can do so multiple times for a single row), back-filling as necessary.

The states, then, are more for our use in understanding the algorithm. They are also very useful when working through the logic of performing a roll-over when a vector overflows.

  • Field Details

    • MIN_BUFFER_SIZE

      public static final int MIN_BUFFER_SIZE
      See Also:
    • listener

      Listener invoked if the vector overflows. If not provided, then the writer does not support vector overflow.
    • emptyValue

      protected byte[] emptyValue
      Value to use to fill empties. Must be at least as wide as each value.
    • drillBuf

      protected DrillBuf drillBuf
    • capacity

      protected int capacity
      Capacity, in values, of the currently allocated buffer that backs the vector. Updated each time the buffer changes. The capacity is in values (rather than bytes) to streamline the per-write logic.
  • Constructor Details

    • BaseScalarWriter

      public BaseScalarWriter()
  • Method Details

    • bindListener

      public void bindListener(WriterEvents.ColumnWriterListener listener)
      Description copied from interface: WriterEvents
      Bind a listener to the underlying vector writer. This listener reports on vector events (overflow, growth), and so is called only when the writer is backed by a vector. The listener is ignored (and never called) for dummy (non-projected) columns. If the column is compound (such as for a nullable or repeated column, or for a map), then the writer is bound to the individual components.
      Specified by:
      bindListener in interface WriterEvents
      Overrides:
      bindListener in class AbstractScalarWriter
      Parameters:
      listener - the vector event listener to bind
    • bindSchema

      public void bindSchema(ColumnMetadata schema)
      Overrides:
      bindSchema in class AbstractScalarWriterImpl
    • setBuffer

      protected abstract void setBuffer()
      All change of buffer comes through this function to allow capturing the buffer address and capacity. Only two ways to set the buffer: by binding to a vector in bindVector(), or by resizing the vector in prepareWrite().
    • realloc

      protected void realloc(int size)
    • canExpand

      protected boolean canExpand(int delta)
      The vector is about to grow. Give the listener a chance to veto the growth and opt for overflow instead.
      Parameters:
      delta - the new amount of memory to allocate
      Returns:
      true if the vector can be grown, false if an overflow should be triggered
    • overflowed

      protected void overflowed()
      Handle vector overflow. If this is an array, then there is a slim chance we may need to grow the vector immediately after overflow. Since a double overflow is not allowed, this recursive call won't continue forever.
    • skipNulls

      public abstract void skipNulls()
    • nullable

      public boolean nullable()
      Description copied from interface: ColumnWriter
      Whether this writer allows nulls. This is not as simple as checking for the TypeProtos.DataMode.OPTIONAL type in the schema. List entries are nullable, if they are primitive, but not if they are maps or lists. Unions are nullable, regardless of cardinality.
      Returns:
      true if a call to ColumnWriter.setNull() is supported, false if not
    • setNull

      public void setNull()
      Description copied from interface: ColumnWriter
      Set the current value to null. Support depends on the underlying implementation: only nullable types support this operation. throws IllegalStateException if called on a non-nullable value.
    • setBoolean

      public void setBoolean(boolean value)
    • setInt

      public void setInt(int value)
    • setLong

      public void setLong(long value)
    • setFloat

      public void setFloat(float value)
    • setDouble

      public void setDouble(double value)
    • setString

      public void setString(String value)
    • setBytes

      public void setBytes(byte[] value, int len)
    • appendBytes

      public void appendBytes(byte[] value, int len)
    • setDecimal

      public void setDecimal(BigDecimal value)
    • setPeriod

      public void setPeriod(org.joda.time.Period value)
    • setDate

      public void setDate(LocalDate value)
    • setTime

      public void setTime(LocalTime value)
    • setTimestamp

      public void setTimestamp(Instant value)
    • dump

      public void dump(HierarchicalFormatter format)
      Specified by:
      dump in interface WriterEvents
      Overrides:
      dump in class AbstractScalarWriterImpl