Class ParquetFileWriter

java.lang.Object
org.apache.parquet.hadoop.ParquetFileWriter

public class ParquetFileWriter extends Object
Internal implementation of the Parquet file writer as a block container
Note: this is temporary Drill-Parquet class needed to write empty parquet files. Details in PARQUET-2026 and DRILL-7907
  • Nested Class Summary

    Nested Classes
    Modifier and Type
    Class
    Description
    static enum 
     
  • Field Summary

    Fields
    Modifier and Type
    Field
    Description
    static final int
     
    static final String
     
    static final byte[]
     
    static final byte[]
     
    static final String
     
    protected final org.apache.parquet.io.PositionOutputStream
     
    static final String
     
    static final String
     
  • Constructor Summary

    Constructors
    Constructor
    Description
    ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file)
    Deprecated.
    will be removed in 2.0.0
    ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode)
    Deprecated.
    will be removed in 2.0.0
    ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
    Deprecated.
    will be removed in 2.0.0
    ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize)
    Deprecated.
    will be removed in 2.0.0
    ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled)
     
    ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, org.apache.parquet.crypto.FileEncryptionProperties encryptionProperties)
     
  • Method Summary

    Modifier and Type
    Method
    Description
    void
    appendColumnChunk(org.apache.parquet.column.ColumnDescriptor descriptor, org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.ColumnChunkMetaData chunk, org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter, org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex, org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex)
     
    void
    appendFile(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file)
    Deprecated.
    will be removed in 2.0.0; use appendFile(InputFile) instead
    void
    appendFile(org.apache.parquet.io.InputFile file)
     
    void
    appendRowGroup(org.apache.hadoop.fs.FSDataInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns)
    Deprecated.
    will be removed in 2.0.0; use appendRowGroup(SeekableInputStream,BlockMetaData,boolean) instead
    void
    appendRowGroup(org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns)
     
    void
    appendRowGroups(org.apache.hadoop.fs.FSDataInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns)
    Deprecated.
    will be removed in 2.0.0; use appendRowGroups(SeekableInputStream,List,boolean) instead
    void
    appendRowGroups(org.apache.parquet.io.SeekableInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns)
     
    void
    end(Map<String,String> extraMetaData)
    ends a file once all blocks have been written.
    void
    ends a block once all column chunks have been written
    void
    end a column (once all rep, def and data have been written)
    org.apache.parquet.hadoop.metadata.ParquetMetadata
     
    long
     
    long
     
    static org.apache.parquet.hadoop.metadata.ParquetMetadata
    mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf)
    Deprecated.
    metadata files are not recommended and will be removed in 2.0.0
    static org.apache.parquet.hadoop.metadata.ParquetMetadata
    mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy)
    Deprecated.
    metadata files are not recommended and will be removed in 2.0.0
    void
    start the file
    void
    startBlock(long recordCount)
    start a block
    void
    startColumn(org.apache.parquet.column.ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName)
    start a column inside a block
    void
    writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
    Deprecated.
    void
    writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
    Writes a single page
    void
    writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding)
    Deprecated.
    this method does not support writing column indexes; Use writeDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding) instead
    void
    writeDataPageV2(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, org.apache.parquet.column.statistics.Statistics<?> statistics)
    Writes a single v2 data page
    void
    writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage)
    writes a dictionary page page
    void
    writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage, org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD)
     
    static void
    writeMergedMetadataFile(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf)
    Deprecated.
    metadata files are not recommended and will be removed in 2.0.0
    static void
    writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers)
    Deprecated.
    metadata files are not recommended and will be removed in 2.0.0
    static void
    writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers, org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel level)
    Deprecated.
    metadata files are not recommended and will be removed in 2.0.0

    Methods inherited from class java.lang.Object

    clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
  • Field Details

  • Constructor Details

    • ParquetFileWriter

      @Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file) throws IOException
      Deprecated.
      will be removed in 2.0.0
      Parameters:
      configuration - Hadoop configuration
      schema - the schema of the data
      file - the file to write to
      Throws:
      IOException - if the file can not be created
    • ParquetFileWriter

      @Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode) throws IOException
      Deprecated.
      will be removed in 2.0.0
      Parameters:
      configuration - Hadoop configuration
      schema - the schema of the data
      file - the file to write to
      mode - file creation mode
      Throws:
      IOException - if the file can not be created
    • ParquetFileWriter

      @Deprecated public ParquetFileWriter(org.apache.hadoop.conf.Configuration configuration, org.apache.parquet.schema.MessageType schema, org.apache.hadoop.fs.Path file, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException
      Deprecated.
      will be removed in 2.0.0
      Parameters:
      configuration - Hadoop configuration
      schema - the schema of the data
      file - the file to write to
      mode - file creation mode
      rowGroupSize - the row group size
      maxPaddingSize - the maximum padding
      Throws:
      IOException - if the file can not be created
    • ParquetFileWriter

      @Deprecated public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize) throws IOException
      Deprecated.
      will be removed in 2.0.0
      Parameters:
      file - OutputFile to create or overwrite
      schema - the schema of the data
      mode - file creation mode
      rowGroupSize - the row group size
      maxPaddingSize - the maximum padding
      Throws:
      IOException - if the file can not be created
    • ParquetFileWriter

      public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled) throws IOException
      Parameters:
      file - OutputFile to create or overwrite
      schema - the schema of the data
      mode - file creation mode
      rowGroupSize - the row group size
      maxPaddingSize - the maximum padding
      columnIndexTruncateLength - the length which the min/max values in column indexes tried to be truncated to
      statisticsTruncateLength - the length which the min/max values in row groups tried to be truncated to
      pageWriteChecksumEnabled - whether to write out page level checksums
      Throws:
      IOException - if the file can not be created
    • ParquetFileWriter

      public ParquetFileWriter(org.apache.parquet.io.OutputFile file, org.apache.parquet.schema.MessageType schema, ParquetFileWriter.Mode mode, long rowGroupSize, int maxPaddingSize, int columnIndexTruncateLength, int statisticsTruncateLength, boolean pageWriteChecksumEnabled, org.apache.parquet.crypto.FileEncryptionProperties encryptionProperties) throws IOException
      Throws:
      IOException
  • Method Details

    • start

      public void start() throws IOException
      start the file
      Throws:
      IOException - if there is an error while writing
    • startBlock

      public void startBlock(long recordCount) throws IOException
      start a block
      Parameters:
      recordCount - the record count in this block
      Throws:
      IOException - if there is an error while writing
    • startColumn

      public void startColumn(org.apache.parquet.column.ColumnDescriptor descriptor, long valueCount, org.apache.parquet.hadoop.metadata.CompressionCodecName compressionCodecName) throws IOException
      start a column inside a block
      Parameters:
      descriptor - the column descriptor
      valueCount - the value count in this column
      compressionCodecName - a compression codec name
      Throws:
      IOException - if there is an error while writing
    • writeDictionaryPage

      public void writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage) throws IOException
      writes a dictionary page page
      Parameters:
      dictionaryPage - the dictionary page
      Throws:
      IOException - if there is an error while writing
    • writeDictionaryPage

      public void writeDictionaryPage(org.apache.parquet.column.page.DictionaryPage dictionaryPage, org.apache.parquet.format.BlockCipher.Encryptor headerBlockEncryptor, byte[] AAD) throws IOException
      Throws:
      IOException
    • writeDataPage

      @Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException
      Deprecated.
      writes a single page
      Parameters:
      valueCount - count of values
      uncompressedPageSize - the size of the data once uncompressed
      bytes - the compressed data for the page without header
      rlEncoding - encoding of the repetition level
      dlEncoding - encoding of the definition level
      valuesEncoding - encoding of values
      Throws:
      IOException - if there is an error while writing
    • writeDataPage

      @Deprecated public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException
      Deprecated.
      this method does not support writing column indexes; Use writeDataPage(int, int, BytesInput, Statistics, long, Encoding, Encoding, Encoding) instead
      writes a single page
      Parameters:
      valueCount - count of values
      uncompressedPageSize - the size of the data once uncompressed
      bytes - the compressed data for the page without header
      statistics - statistics for the page
      rlEncoding - encoding of the repetition level
      dlEncoding - encoding of the definition level
      valuesEncoding - encoding of values
      Throws:
      IOException - if there is an error while writing
    • writeDataPage

      public void writeDataPage(int valueCount, int uncompressedPageSize, org.apache.parquet.bytes.BytesInput bytes, org.apache.parquet.column.statistics.Statistics statistics, long rowCount, org.apache.parquet.column.Encoding rlEncoding, org.apache.parquet.column.Encoding dlEncoding, org.apache.parquet.column.Encoding valuesEncoding) throws IOException
      Writes a single page
      Parameters:
      valueCount - count of values
      uncompressedPageSize - the size of the data once uncompressed
      bytes - the compressed data for the page without header
      statistics - the statistics of the page
      rowCount - the number of rows in the page
      rlEncoding - encoding of the repetition level
      dlEncoding - encoding of the definition level
      valuesEncoding - encoding of values
      Throws:
      IOException - if any I/O error occurs during writing the file
    • writeDataPageV2

      public void writeDataPageV2(int rowCount, int nullCount, int valueCount, org.apache.parquet.bytes.BytesInput repetitionLevels, org.apache.parquet.bytes.BytesInput definitionLevels, org.apache.parquet.column.Encoding dataEncoding, org.apache.parquet.bytes.BytesInput compressedData, int uncompressedDataSize, org.apache.parquet.column.statistics.Statistics<?> statistics) throws IOException
      Writes a single v2 data page
      Parameters:
      rowCount - count of rows
      nullCount - count of nulls
      valueCount - count of values
      repetitionLevels - repetition level bytes
      definitionLevels - definition level bytes
      dataEncoding - encoding for data
      compressedData - compressed data bytes
      uncompressedDataSize - the size of uncompressed data
      statistics - the statistics of the page
      Throws:
      IOException - if any I/O error occurs during writing the file
    • endColumn

      public void endColumn() throws IOException
      end a column (once all rep, def and data have been written)
      Throws:
      IOException - if there is an error while writing
    • endBlock

      public void endBlock() throws IOException
      ends a block once all column chunks have been written
      Throws:
      IOException - if there is an error while writing
    • appendFile

      @Deprecated public void appendFile(org.apache.hadoop.conf.Configuration conf, org.apache.hadoop.fs.Path file) throws IOException
      Deprecated.
      will be removed in 2.0.0; use appendFile(InputFile) instead
      Parameters:
      conf - a configuration
      file - a file path to append the contents of to this file
      Throws:
      IOException - if there is an error while reading or writing
    • appendFile

      public void appendFile(org.apache.parquet.io.InputFile file) throws IOException
      Throws:
      IOException
    • appendRowGroups

      @Deprecated public void appendRowGroups(org.apache.hadoop.fs.FSDataInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) throws IOException
      Deprecated.
      will be removed in 2.0.0; use appendRowGroups(SeekableInputStream,List,boolean) instead
      Parameters:
      file - a file stream to read from
      rowGroups - row groups to copy
      dropColumns - whether to drop columns from the file that are not in this file's schema
      Throws:
      IOException - if there is an error while reading or writing
    • appendRowGroups

      public void appendRowGroups(org.apache.parquet.io.SeekableInputStream file, List<org.apache.parquet.hadoop.metadata.BlockMetaData> rowGroups, boolean dropColumns) throws IOException
      Throws:
      IOException
    • appendRowGroup

      @Deprecated public void appendRowGroup(org.apache.hadoop.fs.FSDataInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) throws IOException
      Deprecated.
      will be removed in 2.0.0; use appendRowGroup(SeekableInputStream,BlockMetaData,boolean) instead
      Parameters:
      from - a file stream to read from
      rowGroup - row group to copy
      dropColumns - whether to drop columns from the file that are not in this file's schema
      Throws:
      IOException - if there is an error while reading or writing
    • appendRowGroup

      public void appendRowGroup(org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.BlockMetaData rowGroup, boolean dropColumns) throws IOException
      Throws:
      IOException
    • appendColumnChunk

      public void appendColumnChunk(org.apache.parquet.column.ColumnDescriptor descriptor, org.apache.parquet.io.SeekableInputStream from, org.apache.parquet.hadoop.metadata.ColumnChunkMetaData chunk, org.apache.parquet.column.values.bloomfilter.BloomFilter bloomFilter, org.apache.parquet.internal.column.columnindex.ColumnIndex columnIndex, org.apache.parquet.internal.column.columnindex.OffsetIndex offsetIndex) throws IOException
      Parameters:
      descriptor - the descriptor for the target column
      from - a file stream to read from
      chunk - the column chunk to be copied
      bloomFilter - the bloomFilter for this chunk
      columnIndex - the column index for this chunk
      offsetIndex - the offset index for this chunk
      Throws:
      IOException
    • end

      public void end(Map<String,String> extraMetaData) throws IOException
      ends a file once all blocks have been written. closes the file.
      Parameters:
      extraMetaData - the extra meta data to write in the footer
      Throws:
      IOException - if there is an error while writing
    • getFooter

      public org.apache.parquet.hadoop.metadata.ParquetMetadata getFooter()
    • mergeMetadataFiles

      @Deprecated public static org.apache.parquet.hadoop.metadata.ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf) throws IOException
      Deprecated.
      metadata files are not recommended and will be removed in 2.0.0
      Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.
      Parameters:
      files - a list of files to merge metadata from
      conf - a configuration
      Returns:
      merged parquet metadata for the files
      Throws:
      IOException - if there is an error while writing
    • mergeMetadataFiles

      @Deprecated public static org.apache.parquet.hadoop.metadata.ParquetMetadata mergeMetadataFiles(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.conf.Configuration conf, org.apache.parquet.hadoop.metadata.KeyValueMetadataMergeStrategy keyValueMetadataMergeStrategy) throws IOException
      Deprecated.
      metadata files are not recommended and will be removed in 2.0.0
      Given a list of metadata files, merge them into a single ParquetMetadata Requires that the schemas be compatible, and the extraMetadata be exactly equal.
      Parameters:
      files - a list of files to merge metadata from
      conf - a configuration
      keyValueMetadataMergeStrategy - strategy to merge values for same key, if there are multiple
      Returns:
      merged parquet metadata for the files
      Throws:
      IOException - if there is an error while writing
    • writeMergedMetadataFile

      @Deprecated public static void writeMergedMetadataFile(List<org.apache.hadoop.fs.Path> files, org.apache.hadoop.fs.Path outputPath, org.apache.hadoop.conf.Configuration conf) throws IOException
      Deprecated.
      metadata files are not recommended and will be removed in 2.0.0
      Given a list of metadata files, merge them into a single metadata file. Requires that the schemas be compatible, and the extraMetaData be exactly equal. This is useful when merging 2 directories of parquet files into a single directory, as long as both directories were written with compatible schemas and equal extraMetaData.
      Parameters:
      files - a list of files to merge metadata from
      outputPath - path to write merged metadata to
      conf - a configuration
      Throws:
      IOException - if there is an error while reading or writing
    • writeMetadataFile

      @Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers) throws IOException
      Deprecated.
      metadata files are not recommended and will be removed in 2.0.0
      writes a _metadata and _common_metadata file
      Parameters:
      configuration - the configuration to use to get the FileSystem
      outputPath - the directory to write the _metadata file to
      footers - the list of footers to merge
      Throws:
      IOException - if there is an error while writing
    • writeMetadataFile

      @Deprecated public static void writeMetadataFile(org.apache.hadoop.conf.Configuration configuration, org.apache.hadoop.fs.Path outputPath, List<org.apache.parquet.hadoop.Footer> footers, org.apache.parquet.hadoop.ParquetOutputFormat.JobSummaryLevel level) throws IOException
      Deprecated.
      metadata files are not recommended and will be removed in 2.0.0
      writes _common_metadata file, and optionally a _metadata file depending on the ParquetOutputFormat.JobSummaryLevel provided
      Parameters:
      configuration - the configuration to use to get the FileSystem
      outputPath - the directory to write the _metadata file to
      footers - the list of footers to merge
      level - level of summary to write
      Throws:
      IOException - if there is an error while writing
    • getPos

      public long getPos() throws IOException
      Returns:
      the current position in the underlying file
      Throws:
      IOException - if there is an error while getting the current stream's position
    • getNextRowGroupSize

      public long getNextRowGroupSize() throws IOException
      Throws:
      IOException