The 40-year monopoly of the RDBMS is over. With the exponential growth of data in recent years, and the shift towards rapid application development, new data is increasingly being stored in non-relational datastores including Hadoop, NoSQL and cloud storage. Apache Drill enables analysts, business users, data scientists and developers to explore and analyze this data without sacrificing the flexibility and agility offered by these datastores. Drill processes the data in-situ without requiring users to define schemas or transform data.
Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:
Drill is built from the ground up to achieve high throughput and low latency. The following capabilities help accomplish that:
Drill is primarily focused on non-relational datastores, including Hadoop, NoSQL and cloud storage. The following datastores are currently supported:
A new datastore can be added by developing a storage plugin. Drill’s unique schema-free JSON data model enables it to query non-relational datastores in-situ (many of these systems store complex or schema-free data).
Drill supports a variety of non-relational datastores in addition to Hadoop. Drill takes a different approach compared to traditional SQL-on-Hadoop technologies like Hive and Impala. For example, users can directly query self-describing data (eg, JSON, Parquet) without having to create and manage schemas.
The following table provides a more detailed comparison between Drill and traditional SQL-on-Hadoop technologies:
|Drill||SQL-on-Hadoop (Hive, Impala, etc.)|
|Use case||Self-service, in-situ, SQL-based analytics||Data warehouse offload|
|Data sources||Hadoop, NoSQL, cloud storage (including multiple instances)||A single Hadoop cluster|
|Data model||Schema-free JSON (like MongoDB)||Relational|
|User experience||Point-and-query||Ingest data → define schemas → query|
|Deployment model||Standalone service or co-located with Hadoop or NoSQL||Co-located with Hadoop|
|1.0 availability||Q2 2015||Q2 2013 or earlier|
No. Spark SQL is primarily designed to enable developers to incorporate SQL statements in Spark programs. Drill does not depend on Spark, and is targeted at business users, analysts, data scientists and developers.
Hive is a batch processing framework most suitable for long-running jobs. For data exploration and BI, Drill provides a much better experience than Hive.
In addition, Drill is not limited to Hadoop. For example, it can query NoSQL databases (eg, MongoDB, HBase) and cloud storage (eg, Amazon S3, Google Cloud Storage, Azure Blob Storage, Swift).
Drill’s flexible JSON data model and on-the-fly schema discovery enable it to query self-describing data.
Absolutely. Drill has a storage plugin for Hive tables, so you can simply point Drill to the Hive Metastore and start performing low-latency queries on Hive tables. In fact, a single Drill cluster can query data from multiple Hive Metastores, and even perform joins across these datasets.
Not at all. Drill actually takes advantage of schemas when available. For example, Drill leverages the schema information in Hive when querying Hive tables. However, when querying schema-free datastores like MongoDB, or raw files on S3 or Hadoop, schemas are not available, and Drill is still able to query that data.
Centralized schemas work well if the data structure is static, and the value of data is well understood and ready to be operationalized for regular reporting purposes. However, during data exploration, discovery and interactive analysis, requiring rigid modeling poses significant challenges. For example:
Drill is all about flexibility. The flexible schema management capabilities in Drill allow users to explore raw data and then create models/structure with
CREATE TABLE or
CREATE VIEW statements, or with Hive Metastore.
Drill uses a decentralized metadata model and relies on its storage plugins to provide metadata. There is a storage plugin associated with each data source that is supported by Drill.
The name of the table in a query tells Drill where to get the data:
SELECT * FROM dfs1.root.`/my/log/files/`; SELECT * FROM dfs2.root.`/home/john/log.json`; SELECT * FROM mongodb1.website.users; SELECT * FROM hive1.logs.frontend; SELECT * FROM hbase1.events.clicks;
Drill supports standard SQL (aka ANSI SQL). In addition, it features several extensions that help with complex data, such as the
FLATTEN functions. For more details, refer to the SQL Reference.
No. Drill can query data ‘in-situ’.
The best way to get started is to try it out. It only takes a few minutes and all you need is a laptop (Mac, Windows or Linux). We’ve compiled several tutorials to help you get started.
Please post your questions and feedback to firstname.lastname@example.org. We are happy to help!
The documentation has information on how to contribute.