Apache Drill Contribution Ideas
- Fixing JIRAs
- SQL functions
- Support for new file format readers/writers
- Support for new data sources
- New query language parsers
- Application interfaces
- BI Tool testing
- General CLI improvements
- Eco system integrations
- Spark
- Hue
- Phoenix
Fixing JIRAs
This is a good place to begin if you are new to Drill. Feel free to pick issues from the Drill JIRA list. When you pick an issue, assign it to yourself, inform the team, and start fixing it.
For any questions, seek help from the team through the mailing list. Kindly note that the mailing lists are text only so please use attachments or pastebin services instead of inline images.
SQL functions
One of the next simple places to start is to implement a DrillFunc. DrillFuncs is way that Drill express all scalar functions (UDF or system). First you can put together a JIRA for one of the DrillFunc’s we don’t yet have but should (referencing the capabilities of something like Postgres or SQL Server or your own use case). Then try to implement one.
One example DrillFunc: ComparisonFunctions.java ** **
Additional ideas on functions that can be added to SQL support
- Madlib integration
- Machine learning functions
- Approximate aggregate functions (such as what is available in BlinkDB)
Support for new file format readers/writers
Currently Drill supports text, JSON and Parquet file formats natively when interacting with file system. More readers/writers can be introduced by implementing custom storage plugins. Example formats are.
Support for new data sources
Writing a new file-based storage plugin, such as a JSON or text-based storage plugin, simply involves implementing a couple of interfaces. The JSON storage plugin is a good example.
You can refer to the github commits to the mongo db and hbase storage plugin for implementation details:
Focus on implementing/extending this list of classes and the corresponding implementations done by Mongo and Hbase. Ignore the mongo db plugin optimizer rules for pushing predicates into the scan.
Initially, concentrate on basics:
- AbstractGroupScan (MongoGroupScan, HbaseGroupScan)
- SubScan (MongoSubScan, HbaseSubScan)
- RecordReader (MongoRecordReader, HbaseRecordReader)
- BatchCreator (MongoScanBatchCreator, HbaseScanBatchCreator)
- AbstractStoragePlugin (MongoStoragePlugin, HbaseStoragePlugin)
- StoragePluginConfig (MongoStoragePluginConfig, HbaseStoragePluginConfig)
Implement custom storage plugins for the following non-Hadoop data sources:
- NoSQL databases (such as Mongo, Cassandra, Couch etc)
- Search engines (such as Solr, Lucidworks, Elastic Search etc)
- SQL databases (MySQL< PostGres etc)
- Generic JDBC/ODBC data sources
- HTTP URL
- -—
New query language parsers
Drill exposes strongly typed JSON APIs for logical and physical plans. Drill provides a SQL language parser today, but any language parser that can generate logical/physical plans can use Drill’s power on the backend as the distributed low latency query execution engine along with its support for self-describing data and complex/multi-structured data.
- Pig parser : Use Pig as the language to query data from Drill. Great for existing Pig users.
- Hive parser : Use HiveQL as the language to query data from Drill. Great for existing Hive users.
Application interfaces
Drill currently provides JDBC/ODBC drivers for the applications to interact along with a basic version of REST API and a C++ API. The following list provides a few possible application interface opportunities:
- Expose Drill tables/views as REST APIs
- Language drivers for Drill (python etc)
- ….
BI Tool testing
Drill provides JDBC/ODBC drivers to connect to BI tools. We need to make sure Drill works with all major BI tools. Doing a quick sanity testing with your favorite BI tool is a good place to learn Drill and also uncover issues in being able to do so.
General CLI improvements
Currently Drill uses SQLLine as the CLI. The goal of this effort is to improve the CLI experience by adding functionality such as execute statements from a file, output results to a file, display version information, and so on.
Eco system integrations
Spark
Provide ability to invoke Drill queries as part of Apache Spark programs. This gives ability for Spark developers/users to leverage Drill richness of the query layer , for data source access and as low latency execution engine.
Hue
Hue is a GUI for users to interact with various Hadoop eco system components (such as Hive, Oozie, Pig, HBase, Impala …). The goal of this project is to expose Drill as an application inside Hue so users can explore Drill metadata and do SQL queries.
Phoenix (https://issues.apache.org/jira/browse/DRILL-7863)
Phoenix provides a low latency query layer on HBase for operational applications. The goal of this effort is to explore opportunities for integrating Phoenix with Drill.