Class Metadata
java.lang.Object
org.apache.drill.exec.store.parquet.metadata.Metadata
This is an utility class, holder for Parquet Table Metadata and
ParquetReaderConfig
. All the creation of
parquet metadata cache using create api's are forced to happen using the process user since only that user will have
write permission for the cache file-
Field Summary
-
Method Summary
Modifier and TypeMethodDescriptionstatic void
createMeta
(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, ParquetReaderConfig readerConfig, boolean allColumnsInteresting, Set<SchemaPath> columnSet) Create the parquet metadata file for the directory at the given path, and for any subdirectories.static org.apache.hadoop.fs.Path
getDirFileName
(org.apache.hadoop.fs.Path metadataParentDir) getParquetFileMetadata_v4
(Metadata_V4.ParquetTableMetadata_v4 parquetTableMetadata, org.apache.parquet.hadoop.metadata.ParquetMetadata footer, org.apache.hadoop.fs.FileStatus file, org.apache.hadoop.fs.FileSystem fs, boolean allColumnsInteresting, boolean skipNonInteresting, Set<SchemaPath> columnSet, ParquetReaderConfig readerConfig) Get the file metadata for a single filegetParquetTableMetadata
(Map<org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem> fileStatusMap, ParquetReaderConfig readerConfig) Get the parquet metadata for a list of parquet files.getParquetTableMetadata
(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, ParquetReaderConfig readerConfig) Get the parquet metadata for the parquet files in the given directory, including those in subdirectories.static Metadata_V4.MetadataSummary
getSummary
(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path metadataParentDir, boolean autoRefreshTriggered, ParquetReaderConfig readerConfig) Reads the summary from the metadata cache file, if the cache file is stale recreates the metadatastatic org.apache.hadoop.fs.Path
getSummaryFileName
(org.apache.hadoop.fs.Path metadataParentDir) readBlockMeta
(org.apache.hadoop.fs.FileSystem fs, List<org.apache.hadoop.fs.Path> paths, MetadataContext metaContext, ParquetReaderConfig readerConfig) Get the parquet metadata for the table by reading the metadata filestatic ParquetTableMetadataDirs
readMetadataDirs
(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, MetadataContext metaContext, ParquetReaderConfig readerConfig) Get the parquet metadata for all subdirectories by reading the metadata file
-
Field Details
-
OLD_METADATA_FILENAMES
-
OLD_METADATA_FILENAME
- See Also:
-
METADATA_DIRECTORIES_FILENAME
- See Also:
-
METADATA_FILENAME
- See Also:
-
METADATA_SUMMARY_FILENAME
- See Also:
-
CURRENT_METADATA_FILENAMES
-
DEFAULT_NULL_COUNT
-
NULL_COUNT_NOT_EXISTS
-
-
Method Details
-
createMeta
public static void createMeta(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, ParquetReaderConfig readerConfig, boolean allColumnsInteresting, Set<SchemaPath> columnSet) throws IOException Create the parquet metadata file for the directory at the given path, and for any subdirectories.- Parameters:
fs
- file systempath
- pathreaderConfig
- parquet reader configurationallColumnsInteresting
- if set, store column metadata for all the columnscolumnSet
- Set of columns for which column metadata has to be stored- Throws:
IOException
-
getParquetTableMetadata
public static Metadata_V4.ParquetTableMetadata_v4 getParquetTableMetadata(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, ParquetReaderConfig readerConfig) throws IOException Get the parquet metadata for the parquet files in the given directory, including those in subdirectories.- Parameters:
fs
- file systempath
- pathreaderConfig
- parquet reader configuration- Returns:
- parquet table metadata
- Throws:
IOException
-
getParquetTableMetadata
public static Metadata_V4.ParquetTableMetadata_v4 getParquetTableMetadata(Map<org.apache.hadoop.fs.FileStatus, org.apache.hadoop.fs.FileSystem> fileStatusMap, ParquetReaderConfig readerConfig) throws IOExceptionGet the parquet metadata for a list of parquet files.- Parameters:
fileStatusMap
- file statuses and corresponding file systemsreaderConfig
- parquet reader configuration- Returns:
- parquet table metadata
- Throws:
IOException
-
readBlockMeta
public static MetadataBase.ParquetTableMetadataBase readBlockMeta(org.apache.hadoop.fs.FileSystem fs, List<org.apache.hadoop.fs.Path> paths, MetadataContext metaContext, ParquetReaderConfig readerConfig) Get the parquet metadata for the table by reading the metadata file- Parameters:
fs
- current file systempaths
- The path to the metadata file, located in the directory that contains the parquet filesmetaContext
- metadata contextreaderConfig
- parquet reader configuration- Returns:
- parquet table metadata. Null if metadata cache is missing, unsupported or corrupted
-
readMetadataDirs
public static ParquetTableMetadataDirs readMetadataDirs(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path path, MetadataContext metaContext, ParquetReaderConfig readerConfig) Get the parquet metadata for all subdirectories by reading the metadata file- Parameters:
fs
- current file systempath
- The path to the metadata file, located in the directory that contains the parquet filesmetaContext
- metadata contextreaderConfig
- parquet reader configuration- Returns:
- parquet metadata for a directory. Null if metadata cache is missing, unsupported or corrupted
-
getParquetFileMetadata_v4
public static Metadata_V4.ParquetFileAndRowCountMetadata getParquetFileMetadata_v4(Metadata_V4.ParquetTableMetadata_v4 parquetTableMetadata, org.apache.parquet.hadoop.metadata.ParquetMetadata footer, org.apache.hadoop.fs.FileStatus file, org.apache.hadoop.fs.FileSystem fs, boolean allColumnsInteresting, boolean skipNonInteresting, Set<SchemaPath> columnSet, ParquetReaderConfig readerConfig) throws IOException, InterruptedException Get the file metadata for a single file- Parameters:
parquetTableMetadata
- The table metadata to be updated with all the columns' infofooter
- If non null, use this footer instead of reading it from the filefile
- The fileallColumnsInteresting
- If true, read the min/max metadata for all the columnsskipNonInteresting
- If true, collect info only for the interesting columnscolumnSet
- Specifies specific columns for which min/max metadata is collectedreaderConfig
- for the options- Returns:
- the file metadata
- Throws:
IOException
InterruptedException
-
getSummaryFileName
public static org.apache.hadoop.fs.Path getSummaryFileName(org.apache.hadoop.fs.Path metadataParentDir) -
getDirFileName
public static org.apache.hadoop.fs.Path getDirFileName(org.apache.hadoop.fs.Path metadataParentDir) -
getSummary
public static Metadata_V4.MetadataSummary getSummary(org.apache.hadoop.fs.FileSystem fs, org.apache.hadoop.fs.Path metadataParentDir, boolean autoRefreshTriggered, ParquetReaderConfig readerConfig) Reads the summary from the metadata cache file, if the cache file is stale recreates the metadata- Parameters:
fs
- file systemmetadataParentDir
- parent directory that holds metadata filesautoRefreshTriggered
- true if the auto-refresh is already triggeredreaderConfig
- Parquet reader config- Returns:
- returns metadata summary
-