Class SmoothingProjection

java.lang.Object
org.apache.drill.exec.physical.impl.scan.project.ReaderLevelProjection
org.apache.drill.exec.physical.impl.scan.project.SmoothingProjection

public class SmoothingProjection extends ReaderLevelProjection
Resolve a table schema against the prior schema. This works only if the types match and if all columns in the table schema already appear in the prior schema.

Consider this an experimental mechanism. The hope was that, with clever techniques, we could "smooth over" some of the issues that cause schema change events in Drill. As it turned out, however, creating this mechanism revealed that it is not possible, even in theory, to handle most schema changes because of the time dimension:

  • An even in a later batch may provide information that would have caused us to make a different decision in an earlier batch. For example, we are asked for column `foo`, did not see such a column in the first batch, block or file, guessed some type, and later saw that the column was of a different type. We can't "time travel" to tell our earlier selves, nor, when we make the initial type decision, can we jump to the future to see what type we'll discover.
  • Readers in this fragment may see column `foo` but readers in another fragment read files/blocks that don't have that column. The two readers cannot communicate to agree on a type.

What this mechanism can do is make decisions based on history: when a column appears, we can adjust its type a bit to try to avoid an unnecessary change. For example, if a prior file in this scan saw `foo` as nullable Varchar, but the present file has the column as requied Varchar, we can use the more general nullable form. But, again, the "can't predict the future" bites us: we can handle a nullable-to-required column change, but not visa-versa.

What this mechanism will tell the careful reader is that the only general solution to the schema-change problem is to now the full schema up front: for the planner to be told the schema and to communicate that schema to all readers so that all readers agree on the final schema.

When that is done, the techniques shown here can be used to adjust any per-file variation of schema to match the up-front schema.