Class ParquetTableMetadataUtils

java.lang.Object
org.apache.drill.exec.store.parquet.ParquetTableMetadataUtils

public class ParquetTableMetadataUtils extends Object
Utility class for converting parquet metadata classes to Metastore metadata classes.
  • Method Details

    • addImplicitColumnsStatistics

      public static Map<SchemaPath,ColumnStatistics<?>> addImplicitColumnsStatistics(Map<SchemaPath,ColumnStatistics<?>> columnsStatistics, List<SchemaPath> columns, List<String> partitionValues, OptionManager optionManager, org.apache.hadoop.fs.Path location, boolean supportsFileImplicitColumns)
      Creates new map based on specified columnStatistics with added statistics for implicit and partition (dir) columns.
      Parameters:
      columnsStatistics - map of column statistics to expand
      columns - list of all columns including implicit or partition ones
      partitionValues - list of partition values
      optionManager - option manager
      location - location of metadata part
      supportsFileImplicitColumns - whether implicit columns are supported
      Returns:
      map with added statistics for implicit and partition (dir) columns
    • getRowGroupsMetadata

      public static org.apache.drill.shaded.guava.com.google.common.collect.Multimap<org.apache.hadoop.fs.Path,RowGroupMetadata> getRowGroupsMetadata(MetadataBase.ParquetTableMetadataBase tableMetadata)
      Returns list of RowGroupMetadata received by converting parquet row groups metadata taken from the specified tableMetadata. Assigns index to row groups based on their position in files metadata. For empty / fake row groups assigns '-1' index.
      Parameters:
      tableMetadata - the source of row groups to be converted
      Returns:
      list of RowGroupMetadata
    • getRowGroupMetadata

      public static RowGroupMetadata getRowGroupMetadata(MetadataBase.ParquetTableMetadataBase tableMetadata, MetadataBase.RowGroupMetadata rowGroupMetadata, int rgIndexInFile, org.apache.hadoop.fs.Path location)
      Returns RowGroupMetadata instance converted from specified parquet rowGroupMetadata.
      Parameters:
      tableMetadata - table metadata which contains row group metadata to convert
      rowGroupMetadata - row group metadata to convert
      rgIndexInFile - index of current row group within the file
      location - location of file with current row group
      Returns:
      RowGroupMetadata instance converted from specified parquet rowGroupMetadata
    • getFileMetadata

      public static FileMetadata getFileMetadata(Collection<RowGroupMetadata> rowGroups)
      Returns FileMetadata instance received by merging specified RowGroupMetadata list.
      Parameters:
      rowGroups - collection of RowGroupMetadata to be merged
      Returns:
      FileMetadata instance
    • getPartitionMetadata

      public static PartitionMetadata getPartitionMetadata(SchemaPath partitionColumn, List<FileMetadata> files)
      Returns PartitionMetadata instance received by merging specified FileMetadata list.
      Parameters:
      partitionColumn - partition column
      files - list of files to be merged
      Returns:
      PartitionMetadata instance
    • getRowGroupColumnStatistics

      public static Map<SchemaPath,ColumnStatistics<?>> getRowGroupColumnStatistics(MetadataBase.ParquetTableMetadataBase tableMetadata, MetadataBase.RowGroupMetadata rowGroupMetadata)
      Converts specified MetadataBase.RowGroupMetadata into the map of ColumnStatistics instances with column names as keys.
      Parameters:
      tableMetadata - the source of column types
      rowGroupMetadata - metadata to convert
      Returns:
      map with converted row group metadata
    • getNonInterestingColumnsMeta

      public static NonInterestingColumnsMetadata getNonInterestingColumnsMeta(MetadataBase.ParquetTableMetadataBase parquetTableMetadata)
      Returns the non-interesting column's metadata
      Parameters:
      parquetTableMetadata - the source of column metadata for non-interesting column's statistics
      Returns:
      returns non-interesting columns metadata
    • getValue

      public static Object getValue(Object value, org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName primitiveType, org.apache.parquet.schema.OriginalType originalType)
      Handles passed value considering its type and specified primitiveType with originalType.
      Parameters:
      value - value to handle
      primitiveType - primitive type of the column whose value should be handled
      originalType - original type of the column whose value should be handled
      Returns:
      handled value
    • getFileFields

      Returns map of column names with their drill types for specified file.
      Parameters:
      parquetTableMetadata - the source of primitive and original column types
      file - file whose columns should be discovered
      Returns:
      map of column names with their drill types
    • getRowGroupFields

      public static Map<SchemaPath,TypeProtos.MajorType> getRowGroupFields(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, MetadataBase.RowGroupMetadata rowGroup)
      Returns map of column names with their drill types for specified rowGroup.
      Parameters:
      parquetTableMetadata - the source of primitive and original column types
      rowGroup - row group whose columns should be discovered
      Returns:
      map of column names with their drill types
    • getIntermediateFields

      public static Map<SchemaPath,TypeProtos.MajorType> getIntermediateFields(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, MetadataBase.RowGroupMetadata rowGroup)
      Returns map of column names with their Drill types for every NameSegment in SchemaPath in specified rowGroup. The type for a SchemaPath can be null in case when it is not possible to determine its type. Actually, as of now this hierarchy is of interest solely because there is a need to account for TypeProtos.MinorType.DICT to make sure filters used on DICT's values (get by key) are not pruned out before actual filtering happens.
      Parameters:
      parquetTableMetadata - the source of column types
      rowGroup - row group whose columns should be discovered
      Returns:
      map of column names with their drill types
    • getOriginalType

      public static org.apache.parquet.schema.OriginalType getOriginalType(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, MetadataBase.ColumnMetadata column)
      Returns OriginalType type for the specified column.
      Parameters:
      parquetTableMetadata - the source of column type
      column - column whose OriginalType should be returned
      Returns:
      OriginalType type for the specified column
    • getPrimitiveTypeName

      public static org.apache.parquet.schema.PrimitiveType.PrimitiveTypeName getPrimitiveTypeName(MetadataBase.ParquetTableMetadataBase parquetTableMetadata, MetadataBase.ColumnMetadata column)
      Returns PrimitiveType.PrimitiveTypeName type for the specified column.
      Parameters:
      parquetTableMetadata - the source of column type
      column - column whose PrimitiveType.PrimitiveTypeName should be returned
      Returns:
      PrimitiveType.PrimitiveTypeName type for the specified column
    • getColumnStatistics

      public static Map<SchemaPath,ColumnStatistics<?>> getColumnStatistics(TupleMetadata schema, DrillStatsTable statistics)
      Returns map with schema path and ColumnStatistics obtained from specified DrillStatsTable for all columns from specified BaseTableMetadata.
      Parameters:
      schema - source of column names
      statistics - source of column statistics
      Returns:
      map with schema path and ColumnStatistics