Choosing a Storage Format
Drill supports several file formats for data including CSV, TSV, PSV, JSON, and Parquet. Changing the default format is a typical functional change that can optimize performance. Drill runs fastest against Parquet files because Parquet data representation is almost identical to how Drill represents data.
Optimized for working with large files, Parquet arranges data in columns, putting related values in close proximity to each other to optimize query performance, minimize I/O, and facilitate compression. Parquet detects and encodes the same or similar data using a technique that conserves resources.
When using Parquet as the storage format, balance the number of files against the file size to achieve maximum parallelization. See Configuring the Size of Parquet Files.
When a read of Parquet data occurs, Drill loads only the necessary columns of data, which reduces I/O. Reading only a small piece of the Parquet data from a data file or table, Drill can examine and analyze all values for a column across multiple files.