Why Drill

Top 10 Reasons to Use Drill

1. Get started in minutes

It only takes a couple minutes to start working with Drill. Untar it on your Mac or Windows laptop and run a query on a local file. No need to set up any infrastructure. No need to define schemas. Just point at the data and drill!

$ tar -xvf apache-drill-0.6.0-incubating.tar.gz
$ apache-drill-0.6.0-incubating/bin/sqlline -u jdbc:drill:zk=local
0: jdbc:drill:zk=local> SELECT * FROM dfs.root.`path/to/employee.json` limit 5;
| employee_id | full_name        | first_name | last_name  | position_id | position_title       |  store_id  | department_id | birt 
| 1           | Sheri Nowmer     | Sheri      | Nowmer     | 1           | President            | 0          | 1             | 19   
| 2           | Derrick Whelply  | Derrick    | Whelply    | 2           | VP Country Manager   | 0          | 1             |
| 4           | Michael Spence   | Michael    | Spence     | 2           | VP Country Manager   | 0          | 1             |
| 5           | Maya Gutierrez   | Maya       | Gutierrez  | 2           | VP Country Manager   | 0          | 1             |
| 6           | Roberta Damstra  | Roberta    | Damstra    | 3           | VP Information Systems | 0        | 2             |

2. Schema-free JSON model

Drill is the world's first and only distributed SQL engine that doesn't require schemas. It shares the same schema-free JSON model as MongoDB and Elasticsearch. Instead of spending weeks or months defining schemas, transforming data (ETL) and maintaining those schemas, simply point Drill at your data (file, directory, HBase table, etc.) and run your queries. Drill automatically understands the structure of the data. Drill's self-service approach reduces the burden on IT and increases the productivity and agility of analysts and developers.

3. Query complex, semi-structured data in-situ

Drill's schema-free JSON model allows you to query complex, semi-structured data in situ. No need to flatten or transform the data prior to or during query execution. Drill also provides intuitive extensions to SQL to work with nested data. Here's a simple query on a JSON file demonstrating how to access nested elements and arrays:

SELECT * FROM (SELECT t.trans_id,
                      t.trans_info.prod_id[0] AS prod_id,
                      t.trans_info.purch_flag AS purchased
               FROM `clicks/clicks.json` t) sq
WHERE sq.prod_id BETWEEN 700 AND 750 AND
      sq.purchased = 'true'
ORDER BY sq.prod_id;

4. Real SQL - not "SQL-like"

Drill supports the standard SQL:2003 syntax. No need to learn a new "SQL-like" language or struggle with a semi-functional BI tool. Drill supports many data types including DATE, INTERVAL, TIMESTAMP, VARCHAR and DECIMAL, as well as complex query constructs such as correlated sub-queries and joins in WHERE clauses. Here is an example of a TPC-H standard query that runs in Drill "as is":

# TPC-H query 4
SELECT  o.o_orderpriority, count(*) AS order_count
FROM orders o
WHERE o.o_orderdate >= date '1996-10-01'
      AND o.o_orderdate < date '1996-10-01' + interval '3' month
                 SELECT * FROM lineitem l 
                 WHERE l.l_orderkey = o.o_orderkey
                 AND l.l_commitdate < l.l_receiptdate
      GROUP BY o.o_orderpriority
      ORDER BY o.o_orderpriority;

5. Leverage standard BI tools

Drill works with standard BI tools. You can keep using the tools you love, such as Tableau, MicroStrategy, QlikView and Excel. No need to introduce yet another visualization or dashboard tool. Combine a self-service BI tool with the only self-service SQL engine to enable true self-service data exploration.

6. Interactive queries on Hive tables

Apache Drill lets you leverage your investments in Hive. You can run interactive queries with Drill on your Hive tables and access all Hive input/output formats (including custom SerDes). You can join tables associated with different Hive metastores, and you can join a Hive table with an HBase table or a directory of log files. Here's a simple query in Drill on a Hive table:

SELECT `month`, state, sum(order_total) AS sales
FROM hive.orders 
GROUP BY `month`, state

7. Access multiple data sources

Drill is designed with extensibility in mind. It provides out-of-the-box connectivity to file systems (local or distributed file systems such as S3, HDFS and MapR-FS), HBase and Hive. You can implement a storage plugin to make Drill work with any other data source. Drill can combine data from multiple data sources on the fly in a single query, with no centralized metadata definitions. Here's a query that combines data from a Hive table, an HBase table (view) and a JSON file:

SELECT custview.membership, sum(orders.order_total) AS sales
FROM hive.orders, custview, dfs.`clicks/clicks.json` c 
WHERE orders.cust_id = custview.cust_id AND orders.cust_id = c.user_info.cust_id 
GROUP BY custview.membership

8. User-Defined Functions (UDFs)

Drill exposes a simple and high-performance Java API to build custom functions (UDFs and UDAFs) so that you can add your own business logic. If you have already built UDFs in Hive, you can reuse them with Drill with no modifications. Refer to Developing Custom Functions for more information.

9. High performance

Drill is designed fround the ground up for high throughput and low latency. It doesn't use a general purpose execution engine like MapReduce, Tez or Spark. As a result, Drill is able to deliver its unparalleled flexibility (schema-free JSON model) without compromising performance. Drill's optimizer leverages rule- and cost-based techniques, as well as data locality and operator push-down (the ability to push down query fragments into the back-end data sources). Drill also provides a columnar and vectorized execution engine, resulting in higher memory and CPU efficiency.

10. Scales from a single laptop to a 1000-node cluster

Drill is available as a simple download you can run on your laptop. When you're ready to analyze larger datasets, simply deploy Drill on your Hadoop cluster (up to 1000 commodity servers). Drill leverages the aggregate memory in the cluster to execute queries using an optimistic pipelined model, and automatically spills to disk when the working set doesn't fit in memory.