Apache Drill is a low latency distributed query engine for large-scale datasets, including structured and semi-structured/nested data. Inspired by Google's Dremel, Drill is designed to scale to several thousands of nodes and query petabytes of data at interactive speeds that BI/Analytics environments require.

High-Level Architecture

Drill includes a distributed execution environment, purpose built for large-scale data processing. At the core of Apache Drill is the 'Drillbit' service, which is responsible for accepting requests from the client, processing the queries, and returning results to the client.

A Drillbit service can be installed and run on all of the required nodes in a Hadoop cluster to form a distributed cluster environment. When a Drillbit runs on each data node in the cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes. Drill uses ZooKeeper to maintain cluster membership and health-check information.

Note that though Drill works in a Hadoop cluster environment, Drill is not tied to Hadoop and can run in any distributed cluster environment. The only pre-requisite for Drill is Zookeeper.

Query Flow in Drill

The following image represents the flow of a Drill query: The flow of a Drill query The flow of a Drill query typically involves the following steps:

Core Modules within a Drillbit

The following image represents Drillbit components:

Drillbit components

• RPC end point: Drill exposes a low overhead protobuf-based RPC protocol to communicate with the clients. Additionally, a C++ and Java API layers are also available for the client applications to interact with Drill. Clients can communicate to a specific Drillbit directly or go through a ZooKeeper quorum to discover the available Drillbits before submitting queries. It is recommended that the clients always go through ZooKeeper to shield clients from the intricacies of cluster management, such as the addition or removal of nodes.

• SQL parser: Drill uses Calcite, the open source framework, to parse incoming queries. The output of the parser component is a language agnostic, computer-friendly logical plan that represents the query.

• Optimizer: Drill uses various standard database optimizations such as rule based/cost based, as well as data locality and other optimization rules exposed by the storage engine to re-write and split the query. The output of the optimizer is a distributed physical query plan that represents the most efficient and fastest way to execute the query across different nodes in the cluster.

• Execution engine: Drill provides a MPP execution engine built to perform distributed query processing across the various nodes in the cluster.

• Storage plugin interfaces: Drill serves as a query layer on top of several data sources. Storage plugins in Drill represent the abstractions that Drill uses to interact with the data sources. Storage plugins provide Drill with the following information:

• Metadata available in the source
• Interfaces for Drill to read from and write to data sources
• Location of data and a set of optimization rules to help with efficient and faster execution of Drill queries on a specific data source

In the context of Hadoop, Drill provides storage plugins for files and HBase/M7. Drill also integrates with Hive as a storage plugin since Hive provides a metadata abstraction layer on top of files, HBase/M7, and provides libraries to read data and operate on these sources (SerDes and UDFs).

When users query files and HBase/M7 with Drill, they can do it directly or go through Hive if they have metadata defined there. Drill integration with Hive is only for metadata. Drill does not invoke the Hive execution engine for any requests.

• Distributed cache: Drill uses a distributed cache to manage metadata (not the data) and configuration information across various nodes. Sample metadata information that is stored in the cache includes query plan fragments, intermediate state of the query execution, and statistics. Drill uses Infinispan as its cache technology.

Architectural Highlights

The goal for Drill is to bring the SQL ecosystem and performance of the relational systems to Hadoop scale data WITHOUT compromising on the flexibility of Hadoop/NoSQL systems. There are several core architectural elements in Apache Drill that make it a highly flexible and efficient query engine.



Drill is designed from the ground up for high performance on large datasets. Few core elements of Drill processing that help Drill achieve its performance include: