Choosing a Storage Format

Nov 2, 2018

Drill supports several file formats for data including CSV, TSV, PSV, JSON, and Parquet. Changing the default format is a typical functional change that can optimize performance. Drill runs fastest against Parquet files because Parquet data representation is almost identical to how Drill represents data.

Optimized for working with large files, Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data using a technique that conserves resources.

When using Parquet as the storage format, balance the number of files against the file size to achieve maximum parallelization. See Configuring the Size of Parquet Files.

When a read of Parquet data occurs, Drill loads only the necessary columns of data, which reduces I/O. Reading only a small piece of the Parquet data from a data file or table, Drill can examine and analyze all values for a column across multiple files.

Because SQL does not support all Parquet data types, to prevent Drill from inferring a type other than the one you want, you can use the CAST or CONVERT functions. See Data Type Conversion.

See Parquet Format for more information about Parquet with Drill. You may also be interested in the JSON Data Model, Data Sources and File Formats Introduction, and Supported Data Types.