Aggregate and Aggregate Statistical

Aggregate Functions

The following tables list the aggregate and aggregate statistical functions that you can use in Drill queries:

Function Argument Type Return Type
AVG(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL DECIMAL for any integer-type argument, DOUBLE for a floating-point argument, otherwise the same as the argument data type
COUNT(*) - BIGINT
COUNT([DISTINCT] expression) any BIGINT
MAX(expression) BINARY, DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP same as argument type
MIN(expression) BINARY, DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP same as argument type
SUM(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL BIGINT for SMALLINT or INTEGER arguments, DECIMAL for BIGINT arguments, DOUBLE for floating-point arguments, otherwise the same as the argument data type

* In this release, Drill disables the DECIMAL data type, including casting to DECIMAL and reading DECIMAL types from Parquet and Hive. You can enable the DECIMAL type, but this is not recommended.

AVG, COUNT, MIN, MAX, and SUM accept ALL and DISTINCT keywords. The default is ALL.

These examples of aggregate functions use the cp storage plugin to access a the employee.json file installed with Drill. By default, JSON reads numbers as double-precision floating point numbers. These examples assume that you are using the default option all_text_mode set to false.

AVG

Averages a column of all records in a data source. Averages a column of one or more groups of records. Which records to include in the calculation can be based on a condition.

AVG Syntax

SELECT AVG([ALL | DISTINCT] aggregate_expression)
FROM tables
WHERE conditions;

SELECT expression1, expression2, ... expression_n,
       AVG([ALL | DISTINCT] aggregate_expression)
FROM tables
WHERE conditions
GROUP BY expression1, expression2, ... expression_n;

Expressions listed within the AVG function and must be included in the GROUP BY clause.

AVG Examples

ALTER SESSION SET `store.json.all_text_mode` = false;
+-------+------------------------------------+
|  ok   |              summary               |
+-------+------------------------------------+
| true  | store.json.all_text_mode updated.  |
+-------+------------------------------------+
1 row selected (0.073 seconds)

Take a look at the salaries of employees having IDs 1139, 1140, and 1141. These are the salaries that subsequent examples will average and sum.

SELECT * FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
+--------------+------------------+-------------+------------+--------------+--------------------------+-----------+----------------+-------------+------------------------+-------------+----------------+------------------+-----------------+---------+-----------------------+
| employee_id  |    full_name     | first_name  | last_name  | position_id  |      position_title      | store_id  | department_id  | birth_date  |       hire_date        |   salary    | supervisor_id  | education_level  | marital_status  | gender  |    management_role    |
+--------------+------------------+-------------+------------+--------------+--------------------------+-----------+----------------+-------------+------------------------+-------------+----------------+------------------+-----------------+---------+-----------------------+
| 1139         | Jeanette Belsey  | Jeanette    | Belsey     | 12           | Store Assistant Manager  | 18        | 11             | 1972-05-12  | 1998-01-01 00:00:00.0  | 10000.0000  | 17             | Graduate Degree  | S               | M       | Store Management      |
| 1140         | Mona Jaramillo   | Mona        | Jaramillo  | 13           | Store Shift Supervisor   | 18        | 11             | 1961-09-24  | 1998-01-01 00:00:00.0  | 8900.0000   | 1139           | Partial College  | S               | M       | Store Management      |
| 1141         | James Compagno   | James       | Compagno   | 15           | Store Permanent Checker  | 18        | 15             | 1914-02-02  | 1998-01-01 00:00:00.0  | 6400.0000   | 1139           | Graduate Degree  | S               | M       | Store Full Time Staf  |
+--------------+------------------+-------------+------------+--------------+--------------------------+-----------+----------------+-------------+------------------------+-------------+----------------+------------------+-----------------+---------+-----------------------+
3 rows selected (0.284 seconds)
SELECT AVG(salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
+--------------------+
|       EXPR$0       |
+--------------------+
| 8433.333333333334  |
+--------------------+
1 row selected (0.208 seconds)

SELECT AVG(ALL salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
+--------------------+
|       EXPR$0       |
+--------------------+
| 8433.333333333334  |
+--------------------+
1 row selected (0.17 seconds)

SELECT AVG(DISTINCT salary) FROM cp.`employee.json`;
+---------------------+
|       EXPR$0        |
+---------------------+
| 12773.333333333334  |
+---------------------+
1 row selected (0.384 seconds)
SELECT education_level, AVG(salary) FROM cp.`employee.json` GROUP BY education_level;
+----------------------+---------------------+
|   education_level    |       EXPR$1        |
+----------------------+---------------------+
| Graduate Degree      | 4392.823529411765   |
| Bachelors Degree     | 4492.404181184669   |
| Partial College      | 4047.1180555555557  |
| High School Degree   | 3516.1565836298932  |
| Partial High School  | 3511.0852713178297  |
+----------------------+---------------------+
5 rows selected (0.495 seconds)

COUNT

Returns the number of rows that match the given criteria.

COUNT Syntax

SELECT COUNT([DISTINCT | ALL] column) FROM . . .
SELECT COUNT(*) FROM . . .

  • column
    Returns the number of values of the specified column.
  • DISTINCT column
    Returns the number of distinct values in the column.
  • ALL column
    Returns the number of values of the specified column.
  • * (asterisk) Returns the number of records in the table.

COUNT Examples

SELECT COUNT(DISTINCT salary) FROM cp.`employee.json`;
+---------+
| EXPR$0  |
+---------+
| 48      |
+---------+
1 row selected (0.159 seconds)

SELECT COUNT(ALL salary) FROM cp.`employee.json`;
+---------+
| EXPR$0  |
+---------+
| 1155    |
+---------+
1 row selected (0.106 seconds)

SELECT COUNT(salary) FROM cp.`employee.json`;
+---------+
| EXPR$0  |
+---------+
| 1155    |
+---------+
1 row selected (0.102 seconds)

SELECT COUNT(*) FROM cp.`employee.json`;
+---------+
| EXPR$0  |
+---------+
| 1155    |
+---------+
1 row selected (0.174 seconds)

MIN and MAX Functions

These functions return the smallest and largest values of the selected columns, respectively.

MIN and MAX Syntax

MIN(column)
MAX(column)

MIN and MAX Examples

SELECT MIN(salary) FROM cp.`employee.json`;
+---------+
| EXPR$0  |
+---------+
| 20.0    |
+---------+
1 row selected (0.138 seconds)

SELECT MAX(salary) FROM cp.`employee.json`;
+----------+
|  EXPR$0  |
+----------+
| 80000.0  |
+----------+
1 row selected (0.139 seconds)

Use a correlated subquery to find the names and salaries of the lowest paid employees:

SELECT full_name, SALARY FROM cp.`employee.json` WHERE salary = (SELECT MIN(salary) FROM cp.`employee.json`);
+------------------------+---------+
|       full_name        | SALARY  |
+------------------------+---------+
| Leopoldo Renfro        | 20.0    |
| Donna Brockett         | 20.0    |
| Laurie Anderson        | 20.0    |
. . .

SUM Function

Returns the total of a numeric column.

SUM syntax

SUM(column)

Examples

SELECT SUM(ALL salary) FROM cp.`employee.json`;
+------------+
|   EXPR$0   |
+------------+
| 4642640.0  |
+------------+
1 row selected (0.123 seconds)

SELECT SUM(DISTINCT salary) FROM cp.`employee.json`;
+-----------+
|  EXPR$0   |
+-----------+
| 613120.0  |
+-----------+
1 row selected (0.309 seconds)

SELECT SUM(salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
+----------+
|  EXPR$0  |
+----------+
| 25300.0  |
+----------+
1 row selected (1.995 seconds)

Aggregate Statistical Functions

Drill provides following aggregate statistics functions:

  • stddev(expression)
    An alias for stddev_samp
  • stddev_pop(expression) Population standard deviate of input values
  • stddev_samp(expression) Sample standard deviate of input values
  • variance(expression) An alias for var_samp
  • var_pop(expression) Population variance of input values (the population standard deviated squared)
  • var_samp(expression) Sample variance of input values (sample standard deviation squared)

These functions take a SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, or DECIMAL expression as the argument. If the expression is FLOAT, the function returns DOUBLE; otherwise, the function returns DECIMAL. As previously mentioned, DECIMAL is disabled. You can enable the DECIMAL type, but this is not recommended.