Frequently Asked Questions

Overview

Why Drill?

The 40-year monopoly of the RDBMS is over. With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational datastores including Hadoop, NoSQL and cloud storage. Apache Drill enables analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores. Drill processes the data in-situ without requiring users to define schemas or transform data.

What are some of Drill's key features?

Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:

How does Drill achieve performance?

Drill is built from the ground up to achieve high throughput and low latency. The following capabilities help accomplish that:

What datastores does Drill support?

Drill is primarily focused on non-relational datastores, including Hadoop, NoSQL and cloud storage. The following datastores are currently supported:

A new datastore can be added by developing a storage plugin. Drill's unique schema-free JSON data model enables it to query non-relational datastores in-situ (many of these systems store complex or schema-free data).

What clients are supported?

Comparisons

Is Drill a 'SQL-on-Hadoop' engine?

Drill supports a variety of non-relational datastores in addition to Hadoop. Drill takes a different approach compared to traditional SQL-on-Hadoop technologies like Hive and Impala. For example, users can directly query self-describing data (eg, JSON, Parquet) without having to create and manage schemas.

The following table provides a more detailed comparison between Drill and traditional SQL-on-Hadoop technologies:

  Drill SQL-on-Hadoop (Hive, Impala, etc.)
Use case Self-service, in-situ, SQL-based analytics Data warehouse offload
Data sources Hadoop, NoSQL, cloud storage (including multiple instances) A single Hadoop cluster
Data model Schema-free JSON (like MongoDB) Relational
User experience Point-and-query Ingest data → define schemas → query
Deployment model Standalone service or co-located with Hadoop or NoSQL Co-located with Hadoop
Data management Self-service IT-driven
SQL ANSI SQL SQL-like
1.0 availability Q2 2015 Q2 2013 or earlier

Is Spark SQL similar to Drill?

No. Spark SQL is primarily designed to enable developers to incorporate SQL statements in Spark programs. Drill does not depend on Spark, and is targeted at business users, analysts, data scientists and developers.

Does Drill replace Hive?

Hive is a batch processing framework most suitable for long-running jobs. For data exploration and BI, Drill provides a much better experience than Hive.

In addition, Drill is not limited to Hadoop. For example, it can query NoSQL databases (eg, MongoDB, HBase) and cloud storage (eg, Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift).

Metadata

How does Drill support queries on self-describing data?

Drill's flexible JSON data model and on-the-fly schema discovery enable it to query self-describing data.

But I already have schemas defined in Hive Metastore? Can I use that with Drill?

Absolutely. Drill has a storage plugin for Hive tables, so you can simply point Drill to the Hive Metastore and start performing low-latency queries on Hive tables. In fact, a single Drill cluster can query data from multiple Hive Metastores, and even perform joins across these datasets.

Is Drill "anti-schema" or "anti-DBA"?

Not at all. Drill actually takes advantage of schemas when available. For example, Drill leverages the schema information in Hive when querying Hive tables. However, when querying schema-free datastores like MongoDB, or raw files on S3 or Hadoop, schemas are not available, and Drill is still able to query that data.

Centralized schemas work well if the data structure is static, and the value of data is well understood and ready to be operationalized for regular reporting purposes. However, during data exploration, discovery and interactive analysis, requiring rigid modeling poses significant challenges. For example:

Drill is all about flexibility. The flexible schema management capabilities in Drill allow users to explore raw data and then create models/structure with CREATE TABLE or CREATE VIEW statements, or with Hive Metastore.

What does a Drill query look like?

Drill uses a decentralized metadata model and relies on its storage plugins to provide metadata. There is a storage plugin associated with each data source that is supported by Drill.

The name of the table in a query tells Drill where to get the data:

SELECT * FROM dfs1.root.`/my/log/files/`;
SELECT * FROM dfs2.root.`/home/john/log.json`;
SELECT * FROM mongodb1.website.users;
SELECT * FROM hive1.logs.frontend;
SELECT * FROM hbase1.events.clicks;

What SQL functionality does Drill support?

Drill supports standard SQL (aka ANSI SQL). In addition, it features several extensions that help with complex data, such as the KVGEN and FLATTEN functions. For more details, refer to the SQL Reference.

Do I need to load data into Drill to start querying it?

No. Drill can query data 'in-situ'.

Getting Started

What is the best way to get started with Drill?

The best way to get started is to try it out. It only takes a few minutes and all you need is a laptop (Mac, Windows or Linux). We've compiled several tutorials to help you get started.

How can I ask questions and provide feedback?

Please post your questions and feedback to user@drill.apache.org. We are happy to help!

How can I contribute to Drill?

The documentation has information on how to contribute.