Apache Drill Contribution Ideas

  • Fixing JIRAs
  • SQL functions
  • Support for new file format readers/writers
  • Support for new data sources
  • New query language parsers
  • Application interfaces
    • BI Tool testing
  • General CLI improvements
  • Eco system integrations
    • MapReduce
    • Hive views
    • YARN
    • Spark
    • Hue
    • Phoenix

Fixing JIRAs

This is a good place to begin if you are new to Drill. Feel free to pick issues from the Drill JIRA list. When you pick an issue, assign it to yourself, inform the team, and start fixing it.

For any questions, seek help from the team through the mailing list.

https://issues.apache.org/jira/browse/DRILL/?selectedTab=com.atlassian.jira .jira-projects-plugin:summary-panel

SQL functions

One of the next simple places to start is to implement a DrillFunc.
DrillFuncs is way that Drill express all scalar functions (UDF or system).
 First you can put together a JIRA for one of the DrillFunc's we don't yet have but should (referencing the capabilities of something like Postgres
or SQL Server or your own use case). Then try to implement one.

One example DrillFunc:
ComparisonFunctions.java


Additional ideas on functions that can be added to SQL support

  • Madlib integration
  • Machine learning functions
  • Approximate aggregate functions (such as what is available in BlinkDB)

Support for new file format readers/writers

Currently Drill supports text, JSON and Parquet file formats natively when interacting with file system. More readers/writers can be introduced by implementing custom storage plugins. Example formats are.

  • Sequence
  • RC
  • ORC
  • Protobuf
  • XML
  • Thrift

Support for new data sources

Writing a new file-based storage plugin, such as a JSON or text-based storage plugin, simply involves implementing a couple of interfaces. The JSON storage plugin is a good example.

You can refer to the github commits to the mongo db and hbase storage plugin for implementation details:

Focus on implementing/extending this list of classes and the corresponding implementations done by Mongo and Hbase. Ignore the mongo db plugin optimizer rules for pushing predicates into the scan.

Initially, concentrate on basics:

  • AbstractGroupScan (MongoGroupScan, HbaseGroupScan)
  • SubScan (MongoSubScan, HbaseSubScan)
  • RecordReader (MongoRecordReader, HbaseRecordReader)
  • BatchCreator (MongoScanBatchCreator, HbaseScanBatchCreator)
  • AbstractStoragePlugin (MongoStoragePlugin, HbaseStoragePlugin)
  • StoragePluginConfig (MongoStoragePluginConfig, HbaseStoragePluginConfig)

Implement custom storage plugins for the following non-Hadoop data sources:

  • NoSQL databases (such as Mongo, Cassandra, Couch etc)
  • Search engines (such as Solr, Lucidworks, Elastic Search etc)
  • SQL databases (MySQL< PostGres etc)
  • Generic JDBC/ODBC data sources
  • HTTP URL
  • ----

New query language parsers

Drill exposes strongly typed JSON APIs for logical and physical plans. Drill provides a SQL language parser today, but any language parser that can generate logical/physical plans can use Drill's power on the backend as the distributed low latency query execution engine along with its support for self-describing data and complex/multi-structured data.

  • Pig parser : Use Pig as the language to query data from Drill. Great for existing Pig users.
  • Hive parser : Use HiveQL as the language to query data from Drill. Great for existing Hive users.

Application interfaces

Drill currently provides JDBC/ODBC drivers for the applications to interact along with a basic version of REST API and a C++ API. The following list provides a few possible application interface opportunities:

BI Tool testing

Drill provides JDBC/ODBC drivers to connect to BI tools. We need to make sure Drill works with all major BI tools. Doing a quick sanity testing with your favorite BI tool is a good place to learn Drill and also uncover issues in being able to do so.

General CLI improvements

Currently Drill uses SQLLine as the CLI. The goal of this effort is to improve the CLI experience by adding functionality such as execute statements from a file, output results to a file, display version information, and so on.

Eco system integrations

MapReduce

Allow using result set from Drill queries as input to the Hadoop/MapReduce jobs.

Hive views

Query data from existing Hive views using Drill queries. Drill needs to parse the HiveQL and translate them appropriately (into Drill's SQL or logical/physical plans) to execute the requests.

YARN

https://issues.apache.org/jira/browse/DRILL-1170

Spark

Provide ability to invoke Drill queries as part of Apache Spark programs. This gives ability for Spark developers/users to leverage Drill richness of the query layer , for data source access and as low latency execution engine.

Hue

Hue is a GUI for users to interact with various Hadoop eco system components (such as Hive, Oozie, Pig, HBase, Impala ...). The goal of this project is to expose Drill as an application inside Hue so users can explore Drill metadata and do SQL queries.

Phoenix

Phoenix provides a low latency query layer on HBase for operational applications. The goal of this effort is to explore opportunities for integrating Phoenix with Drill.