Class ParquetReaderUtility
java.lang.Object
org.apache.drill.exec.store.parquet.ParquetReaderUtility
Utility class where we can capture common logic between the two parquet readers
-
Nested Class Summary
Nested ClassesModifier and TypeClassDescriptionstatic enumFor most recently created parquet files, we can determine if we have corrupted dates (see DRILL-4203) based on the file metadata.static classUtilities for converting from parquet INT96 binary (impala, hive timestamp) to date time value. -
Field Summary
FieldsModifier and TypeFieldDescriptionstatic final Stringstatic final longAll old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties in metadata) have a corrupt date shift: 4881176L days or 2 * 2440588Lstatic final intThe year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption.static final intVersion 2 (and later) of the Drill Parquet writer uses the date format described in the Parquet spec.static final longNumber of days between Julian day epoch (January 1, 4713 BC) and Unix day epoch (January 1, 1970). -
Constructor Summary
Constructors -
Method Summary
Modifier and TypeMethodDescriptionstatic intautoCorrectCorruptedDate(int corruptedDate) static voidcheckDecimalTypeEnabled(OptionManager options) checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates) Detect corrupt date values by looking at the min/max values in the metadata.static booleancontainsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns) Check whether any of columns in the given list is either nested or repetitive.static voidcorrectDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata) detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates) Check for corrupted dates in a parquet file.getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer) Map full column paths to all ColumnDescriptors in file schemagetColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer) Map full schema paths in format `a`.`b`.`c` to respective SchemaElement objects.static List<TypeProtos.MajorType> getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes) Converts list ofOriginalTypes to list ofTypeProtos.MajorTypes.static TypeProtos.DataModegetDataMode(org.apache.parquet.schema.Type.Repetition repetition) Converts Parquet'sType.Repetitionto Drill'sTypeProtos.DataMode.static StringgetFullColumnPath(org.apache.parquet.column.ColumnDescriptor column) generate full path of the column in format `a`.`b`.`c`static intgetIntFromLEBytes(byte[] input, int start) static TypeProtos.MinorTypegetMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType) Builds minor type using givenOriginalType originalTypeorPrimitiveTypeName type.static TypeProtos.MajorTypegetType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType, int precision, int scale) Builds major type using givenOriginalType originalTypeorPrimitiveTypeName type.static booleanisLogicalListType(org.apache.parquet.schema.GroupType groupType) Checks whether group field approximately matches pattern for Logical Lists:static booleanisLogicalMapType(org.apache.parquet.schema.GroupType groupType) Checks whether group field matches pattern for Logical Map type:static voidtransformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, ParquetReaderConfig readerConfig) Transforms values for min / max binary statistics to byte array.
-
Field Details
-
JULIAN_DAY_NUMBER_FOR_UNIX_EPOCH
public static final long JULIAN_DAY_NUMBER_FOR_UNIX_EPOCHNumber of days between Julian day epoch (January 1, 4713 BC) and Unix day epoch (January 1, 1970). The value of this constant is 2440588L.- See Also:
-
CORRECT_CORRUPT_DATE_SHIFT
public static final long CORRECT_CORRUPT_DATE_SHIFTAll old parquet files (which haven't "is.date.correct=true" or "parquet-writer.version" properties in metadata) have a corrupt date shift: 4881176L days or 2 * 2440588L- See Also:
-
DATE_CORRUPTION_THRESHOLD
public static final int DATE_CORRUPTION_THRESHOLDThe year 5000 (or 1106685 day from Unix epoch) is chosen as the threshold for auto-detecting date corruption. This balances two possible cases of bad auto-correction. External tools writing dates in the future will not be shifted unless they are past this threshold (and we cannot identify them as external files based on the metadata). On the other hand, historical dates written with Drill wouldn't risk being incorrectly shifted unless they were something like 10,000 years in the past. -
DRILL_WRITER_VERSION_STD_DATE_FORMAT
public static final int DRILL_WRITER_VERSION_STD_DATE_FORMATVersion 2 (and later) of the Drill Parquet writer uses the date format described in the Parquet spec. Prior versions had dates formatted withCORRECT_CORRUPT_DATE_SHIFT- See Also:
-
ALLOWED_DRILL_VERSION_FOR_BINARY
- See Also:
-
-
Constructor Details
-
ParquetReaderUtility
public ParquetReaderUtility()
-
-
Method Details
-
checkDecimalTypeEnabled
-
getIntFromLEBytes
public static int getIntFromLEBytes(byte[] input, int start) -
getColNameToSchemaElementMapping
public static Map<String,org.apache.parquet.format.SchemaElement> getColNameToSchemaElementMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer) Map full schema paths in format `a`.`b`.`c` to respective SchemaElement objects.- Parameters:
footer- Parquet file metadata- Returns:
- schema full path to SchemaElement map
-
getFullColumnPath
generate full path of the column in format `a`.`b`.`c`- Parameters:
column- ColumnDescriptor object- Returns:
- full path in format `a`.`b`.`c`
-
getColNameToColumnDescriptorMapping
public static Map<String,org.apache.parquet.column.ColumnDescriptor> getColNameToColumnDescriptorMapping(org.apache.parquet.hadoop.metadata.ParquetMetadata footer) Map full column paths to all ColumnDescriptors in file schema- Parameters:
footer- Parquet file metadata- Returns:
- column full path to ColumnDescriptor object map
-
autoCorrectCorruptedDate
public static int autoCorrectCorruptedDate(int corruptedDate) -
correctDatesInMetadataCache
public static void correctDatesInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata) -
transformBinaryInMetadataCache
public static void transformBinaryInMetadataCache(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, ParquetReaderConfig readerConfig) Transforms values for min / max binary statistics to byte array. Transformation logic depends on metadata file version.- Parameters:
parquetTableMetadata- table metadata that should be correctedreaderConfig- parquet reader config
-
detectCorruptDates
public static ParquetReaderUtility.DateCorruptionStatus detectCorruptDates(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates) Check for corrupted dates in a parquet file. See Drill-4203 -
checkForCorruptDateValuesInStatistics
public static ParquetReaderUtility.DateCorruptionStatus checkForCorruptDateValuesInStatistics(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns, boolean autoCorrectCorruptDates) Detect corrupt date values by looking at the min/max values in the metadata. This should only be used when a file does not have enough metadata to determine if the data was written with an external tool or an older version of Drill (ParquetRecordWriter.WRITER_VERSION_PROPERTYinvalid input: '<'DRILL_WRITER_VERSION_STD_DATE_FORMAT) This method only checks the first Row Group, because Drill has only ever written a single Row Group per file.- Parameters:
footer- parquet footercolumns- list of columns schema pathautoCorrectCorruptDates- user setting to allow enabling/disabling of auto-correction of corrupt dates. There are some rare cases (storing dates thousands of years into the future, with tools other than Drill writing files) that would result in the date values being "corrected" into bad values.
-
getType
public static TypeProtos.MajorType getType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType, int precision, int scale) Builds major type using givenOriginalType originalTypeorPrimitiveTypeName type. For DECIMAL will be returned major type with scale and precision.- Parameters:
type- parquet primitive typeoriginalType- parquet original typeprecision- type precision (used for DECIMAL type)scale- type scale (used for DECIMAL type)- Returns:
- major type
-
getMinorType
public static TypeProtos.MinorType getMinorType(org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName type, org.apache.parquet.schema.OriginalType originalType) Builds minor type using givenOriginalType originalTypeorPrimitiveTypeName type.- Parameters:
type- parquet primitive typeoriginalType- parquet original type- Returns:
- minor type
-
containsComplexColumn
public static boolean containsComplexColumn(org.apache.parquet.hadoop.metadata.ParquetMetadata footer, List<SchemaPath> columns) Check whether any of columns in the given list is either nested or repetitive.- Parameters:
footer- Parquet file schemacolumns- list of query SchemaPath objects
-
getComplexTypes
public static List<TypeProtos.MajorType> getComplexTypes(List<org.apache.parquet.schema.OriginalType> originalTypes) Converts list ofOriginalTypes to list ofTypeProtos.MajorTypes. NOTE: current implementation cares aboutOriginalType.MAPandOriginalType.LISTonly converting it toTypeProtos.MinorType.DICTandTypeProtos.MinorType.LISTrespectively. Other original types are converted tonull, because there is no certain correspondence (and, actually, a need because these types are used to differentiate between Drill's MAP and DICT (and arrays of thereof) types when constructingTupleSchema) between these two.- Parameters:
originalTypes- list of Parquet's types- Returns:
- list containing either
nullor type with minor typeTypeProtos.MinorType.DICTorTypeProtos.MinorType.LISTvalues
-
isLogicalListType
public static boolean isLogicalListType(org.apache.parquet.schema.GroupType groupType) Checks whether group field approximately matches pattern for Logical Lists:<list-repetition> group <name> (LIST) { repeated group list { <element-repetition> <element-type> element; } }(See for more details: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#lists) Note, that standard field names 'list' and 'element' aren't checked intentionally, because Hive lists have 'bag' and 'array_element' names instead.- Parameters:
groupType- type which may have LIST original type- Returns:
- whether the type is LIST and nested field is repeated group
- See Also:
-
isLogicalMapType
public static boolean isLogicalMapType(org.apache.parquet.schema.GroupType groupType) Checks whether group field matches pattern for Logical Map type:<map-repetition> group <name> (MAP) { repeated group key_value { required <key-type> key; <value-repetition> <value-type> value; } }Note, that actual group names are not checked specifically.- Parameters:
groupType- parquet type which may be of MAP type- Returns:
- whether the type is MAP
- See Also:
-
getDataMode
Converts Parquet'sType.Repetitionto Drill'sTypeProtos.DataMode.- Parameters:
repetition- repetition to be converted- Returns:
- data mode corresponding to Parquet's repetition
-