Query Directory Functions

Jun 29, 2016

You can use the following query directory functions when querying multiple files or directories:

  • MAXDIR
  • MINDIR
  • IMAXDIR
  • IMINDIR

The query directory functions restrict a query to one of a number of subdirectories. For example, suppose you had time-series data in subdirectories named 2015, 2014, and 2013. You could use the MAXDIR function to get the latest data and MINDIR to get the earliest data.

In the case where the directory names contain alphabetic characters, the MAXDIR and MINDIR functions return the highest or lowest values, respectively in a case-sensitive string ordering. The IMAXDIR and IMINDIR functions return the corresponding values with case-insensitive ordering.

The query directory functions are recommended instead of the MAX or MIN aggregate functions to prevent Drill from scanning all data in directories.

Query Directory Function Syntax

The following syntax shows how to construct a SELECT statement that using the MAXDIR function:

   SELECT * FROM <plugin>.<workspace>.`<filename>`
   WHERE dir<n> = MAXDIR('<plugin>.<workspace>'[, '<filename>']);

Enclose query directory function arguments in single-quotation marks, not back ticks. The first argument to the function is the plugin and workspace names in dot notation. The second argument is the directory name. The directory name is an optional argument. The dir variable, dir0, dir1, and so on, refers to subdirectories in your workspace path, as explained in "Querying Directories".

Query Directory Function Example

This example creates a top-level directory called querylogs in the /tmp directory. Using the /tmp directory is convenient for example purposes because /tmp is predefined as a workspace in the default dfs storage plugin. In the querylogs directory, you create three subdirectories:

  • 2015
  • 2014
  • 2013

In each subdirectory, you create a CSV file that contains arbitrary log data. Issue MAXDIR and MINDIR queries to exercise the query directory functions:

SELECT * FROM dfs.tmp.`querylogs` WHERE dir0 = MAXDIR('dfs.tmp','querylogs');
+------------+------------+
|  columns   |    dir0    |
+------------+------------+
| ["2015","some arbitrary data"] | 2015       |
+------------+------------+
1 row selected (0.112 seconds)

SELECT * FROM dfs.tmp.`querylogs` WHERE dir0 = MINDIR('dfs.tmp','querylogs');
+------------+------------+
|  columns   |    dir0    |
+------------+------------+
| ["2013","even more data"] | 2013       |
+------------+------------+
1 row selected (0.119 seconds)

In this example, using any variable other than dir0 does not return results because the subdirectory level nesting is only one level deep.