Aggregate and Aggregate Statistical

Aggregate Functions

The following table lists the aggregate functions that you can use in Drill queries.

Function Argument Type Return Type
ANY_VALUE(x) BIT, INT, BIGINT, FLOAT4, FLOAT8, DATE, TIMESTAMP, TIME, VARCHAR, VARBINARY, LIST, MAP, INTERVAL, INTERVALDAY, INTERVALYEAR, VARDECIMAL Same as argument type
AVG(x) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL DECIMAL for DECIMAL argument, DOUBLE for all other arguments
BOOL_AND(x), BOOL_OR(x) BIT BIT
BIT_AND(x), BIT_OR(x), BIT_XOR(x) INT, BIGINT Same as argument type
CORR(x,y) Numeric DOUBLE
COVAR_POP(x,y) Numeric DOUBLE
COVAR_SAMP(x,y) Numeric DOUBLE
COUNT(*) - BIGINT
COUNT([DISTINCT] expression) any BIGINT
KENDALL_CORRELATION(x, y) Numeric DOUBLE
MAX(expression), MIN(expression) BINARY, DECIMAL, VARCHAR, DATE, TIME, or TIMESTAMP Same as argument type
REGR_INTERCEPT(x, y) Numeric DOUBLE
REGR_SLOPE(x, y) Numeric DOUBLE
STDDEV, STDDEV_POP, STDDEV_SAMP Numeric DOUBLE
SUM(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL, INTERVAL DECIMAL for DECIMAL argument, BIGINT for any integer-type argument (including BIGINT), DOUBLE for floating-point arguments
VARIANCE, VAR_POP, VAR_SAMP Numeric Numeric
  • Drill 1.14 and later supports the ANY_VALUE function.
  • Starting in Drill 1.14, the DECIMAL data type is enabled by default.
  • AVG, COUNT, MIN, MAX, and SUM accept ALL and DISTINCT keywords. The default is ALL.
  • The aggregate function examples use the cp storage plugin to access the employee.json file installed with Drill. By default, JSON reads numbers as double-precision floating point numbers. These examples assume that you are using the default option all_text_mode set to false.

ANY_VALUE

Introdcued in release: 1.14

Returns one of the values of value across all input values. This function is NOT specified in the SQL standard.

ANY_VALUE Syntax

ANY_VALUE([ ALL | DISTINCT ] value)

ANY_VALUE Examples

SELECT ANY_VALUE(employee_id) AS anyemp FROM cp.`employee.json`;
|--------|
| anyemp |
|--------|
| 1156   |
|--------|

SELECT ANY_VALUE(ALL employee_id) AS anyemp FROM cp.`employee.json`;
|--------|
| anyemp |
|--------|
| 1156   |
|--------|

SELECT ANY_VALUE(DISTINCT employee_id) AS anyemp FROM cp.`employee.json`;
|--------|
| anyemp |
|--------|
| 1156   |
|--------|

SELECT ANY_VALUE(employee_id) as anyemp, salary as empsal FROM cp.`employee.json` GROUP BY salary;
|--------|---------|
| anyemp | empsal  |
|--------|---------|
| 1155   | 20.0    |
| 197    | 3700.0  |
| 1115   | 4200.0  |
| 589    | 4300.0  |
| 403    | 4400.0  |
| 204    | 4500.0  |
...

SELECT ANY_VALUE(employee_id) as anyemp FROM cp.`employee.json` GROUP BY salary ORDER BY anyemp;
|--------|
| anyemp |
|--------|
| 1      |
| 4      |
| 6      |
| 8      |
| 10     |
| 13     |
...

Filtered Aggregates

Introduced in release: 1.21

Starting in Drill 1.21 it is possible to follow an aggregate function invocation with a boolean expression that will filter the values procesed by the aggregate using the following syntax.

agg_func( column ) FILTER(WHERE boolean_expression)

For example

SELECT
  count(n_name) FILTER(WHERE n_regionkey = 1) AS nations_count_in_1_region,
  count(n_name) FILTER(WHERE n_regionkey = 2) AS nations_count_in_2_region,
  count(n_name) FILTER(WHERE n_regionkey = 3) AS nations_count_in_3_region,
  count(n_name) FILTER(WHERE n_regionkey = 4) AS nations_count_in_4_region,
  count(n_name) FILTER(WHERE n_regionkey = 0) AS nations_count_in_0_region
FROM cp.`tpch/nation.parquet`

will return

+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| nations_count_in_1_region | nations_count_in_2_region | nations_count_in_3_region | nations_count_in_4_region | nations_count_in_0_region |
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+
| 5                         | 5                         | 5                         | 5                         | 5                         |
+---------------------------+---------------------------+---------------------------+---------------------------+---------------------------+

Note

Some versions of Drill prior to 1.21 do not fail if FILTER expressions are included with aggregate function calls, but silently do no filtering yielding incorrect results. Filtered aggregates are only supported from version 1.21 onward.

AVG

Returns the average of a numerical expression.

AVG Syntax

AVG([ALL | DISTINCT] expression)

AVG Examples

ALTER SESSION SET `store.json.all_text_mode` = false;
|------|-----------------------------------|
| ok   | summary                           |
|------|-----------------------------------|
| true | store.json.all_text_mode updated. |
|------|-----------------------------------|
1 row selected (0.073 seconds)

Take a look at the salaries of employees having IDs 1139, 1140, and 1141. These are the salaries that subsequent examples will average and sum.

SELECT * FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
|-------------|-----------------|------------|-----------|-------------|-------------------------|----------|---------------|------------|-----------------------|------------|---------------|-----------------|----------------|--------|----------------------|
| employee_id | full_name       | first_name | last_name | position_id | position_title          | store_id | department_id | birth_date | hire_date             | salary     | supervisor_id | education_level | marital_status | gender | management_role      |
|-------------|-----------------|------------|-----------|-------------|-------------------------|----------|---------------|------------|-----------------------|------------|---------------|-----------------|----------------|--------|----------------------|
| 1139        | Jeanette Belsey | Jeanette   | Belsey    | 12          | Store Assistant Manager | 18       | 11            | 1972-05-12 | 1998-01-01 00:00:00.0 | 10000.0000 | 17            | Graduate Degree | S              | M      | Store Management     |
| 1140        | Mona Jaramillo  | Mona       | Jaramillo | 13          | Store Shift Supervisor  | 18       | 11            | 1961-09-24 | 1998-01-01 00:00:00.0 | 8900.0000  | 1139          | Partial College | S              | M      | Store Management     |
| 1141        | James Compagno  | James      | Compagno  | 15          | Store Permanent Checker | 18       | 15            | 1914-02-02 | 1998-01-01 00:00:00.0 | 6400.0000  | 1139          | Graduate Degree | S              | M      | Store Full Time Staf |
|-------------|-----------------|------------|-----------|-------------|-------------------------|----------|---------------|------------|-----------------------|------------|---------------|-----------------|----------------|--------|----------------------|
3 rows selected (0.284 seconds)

SELECT AVG(salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
|-------------------|
| EXPR$0            |
|-------------------|
| 8433.333333333334 |
|-------------------|
1 row selected (0.208 seconds)

SELECT AVG(ALL salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
|-------------------|
| EXPR$0            |
|-------------------|
| 8433.333333333334 |
|-------------------|
1 row selected (0.17 seconds)

SELECT AVG(DISTINCT salary) FROM cp.`employee.json`;
|--------------------|
| EXPR$0             |
|--------------------|
| 12773.333333333334 |
|--------------------|
1 row selected (0.384 seconds)

SELECT education_level, AVG(salary) FROM cp.`employee.json` GROUP BY education_level;
|---------------------|--------------------|
| education_level     | EXPR$1             |
|---------------------|--------------------|
| Graduate Degree     | 4392.823529411765  |
| Bachelors Degree    | 4492.404181184669  |
| Partial College     | 4047.1180555555557 |
| High School Degree  | 3516.1565836298932 |
| Partial High School | 3511.0852713178297 |
|---------------------|--------------------|
5 rows selected (0.495 seconds)

BOOL_AND and BOOL_OR

Returns the result of a logical AND (resp. OR) over the specified expression.

BOOL_AND and BOOL_OR Syntax

BOOL_AND(expression)
BOOL_OR(expression)

BOOL_AND and BOOL_OR Examples

SELECT BOOL_AND(last_name = 'Spence') FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| false  |
|--------|

SELECT BOOL_OR(last_name = 'Spence') FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| true   |
|--------|

BOOL_AND and BOOL_OR Usage Notes

  1. EVERY is nearly an alias for BOOL_AND but returns a TINYINT rather than a BIT.

BIT_AND, BIT_OR and BIT_XOR

Returns the result of a bitwise AND (resp. OR, XOR) over the specified expression.

BIT_AND, BIT_OR and BIT_XOR Syntax

BIT_AND(expression)
BIT_OR(expression)
BIT_XOR(expression)

BIT_AND, BIT_OR, BIT_XOR Examples

SELECT BIT_AND(position_id) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 0      |
|--------|

SELECT BIT_OR(position_id) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 31     |
|--------|

SELECT BIT_XOR(position_id) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 4      |
|--------|

CORR

Returns the Pearson Correlation Coefficient for a given x, y.

SELECT CORR (department_id, salary) as correlation FROM cp.`employee.json`;
+---------------------+
|     correlation     |
+---------------------+
| -0.6699481585713246 |
+---------------------+

COVAR_POP

Returns the population covariance for a data set.

SELECT covar_pop (department_id, salary) AS covariance FROM cp.`employee.json`;
+--------------------+
|     covariance     |
+--------------------+
| -9988.315451359776 |
+--------------------+

COVAR_SAMP

Returns the sample covariance for a data set.

select covar_samp (department_id, salary) as covariance from cp.`employee.json`;
+--------------------+
|     covariance     |
+--------------------+
| -9996.970837366154 |
+--------------------+

COUNT

Returns the number of rows that match the given criteria.

COUNT Syntax

SELECT COUNT([ALL | DISTINCT] expression) FROM . . .
SELECT COUNT(*) FROM . . .
  • expression
    Returns the number of values of the specified expression.
  • DISTINCT expression
    Returns the number of distinct values in the expression.
  • ALL expression
    Returns the number of values of the specified expression.
  • * (asterisk)
    Returns the number of records in the table.

COUNT Examples

SELECT COUNT(DISTINCT salary) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 48     |
|--------|
1 row selected (0.159 seconds)

SELECT COUNT(ALL salary) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 1155   |
|--------|
1 row selected (0.106 seconds)

SELECT COUNT(salary) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 1155   |
|--------|
1 row selected (0.102 seconds)

SELECT COUNT(*) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 1155   |
|--------|
1 row selected (0.174 seconds)

KENDALL_CORRELATION

Returns the Kendall correlation coefficient.

KENDALL_CORRELATION Syntax

KENDALL_CORRELATION( expression1, expression2 )

KENDALL_CORRELATION Examples

with seq as (
  select row_number() over (order by 1) x from cp.`employee.json` limit 10
)
select kendall_correlation(x+random(), x+random()+5) from seq;
EXPR$0  0.2

1 row selected (0.213 seconds)

MIN and MAX

These functions return the smallest and largest values of the selected expressions, respectively.

MIN and MAX Syntax

MIN(expression)
MAX(expression)

MIN and MAX Examples

SELECT MIN(salary) FROM cp.`employee.json`;
|--------|
| EXPR$0 |
|--------|
| 20.0   |
|--------|
1 row selected (0.138 seconds)

SELECT MAX(salary) FROM cp.`employee.json`;
|---------|
| EXPR$0  |
|---------|
| 80000.0 |
|---------|
1 row selected (0.139 seconds)

Use a correlated subquery to find the names and salaries of the lowest paid employees:

SELECT full_name, SALARY FROM cp.`employee.json` WHERE salary = (SELECT MIN(salary) FROM cp.`employee.json`);
|-----------------|--------|
| full_name       | SALARY |
|-----------------|--------|
| Leopoldo Renfro | 20.0   |
| Donna Brockett  | 20.0   |
| Laurie Anderson | 20.0   |
. . .

SUM

Returns the sum of a numerical expresion.

SUM syntax

SUM([DISTINCT | ALL] expression)

Examples

SELECT SUM(ALL salary) FROM cp.`employee.json`;
|-----------|
| EXPR$0    |
|-----------|
| 4642640.0 |
|-----------|
1 row selected (0.123 seconds)

SELECT SUM(DISTINCT salary) FROM cp.`employee.json`;
|----------|
| EXPR$0   |
|----------|
| 613120.0 |
|----------|
1 row selected (0.309 seconds)

SELECT SUM(salary) FROM cp.`employee.json` WHERE employee_id IN (1139, 1140, 1141);
|---------|
| EXPR$0  |
|---------|
| 25300.0 |
|---------|
1 row selected (1.995 seconds)

Aggregate Statistical Functions

The following table lists the aggregate statistical functions that you can use in Drill queries.

Function Argument Type Return Type
APPROX_COUNT_DUPS(expression) any BIGINT
STDDEV(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL DECIMAL for DECIMAL arguments, otherwise DOUBLE
STDDEV_POP(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL DECIMAL for DECIMAL arguments, otherwise DOUBLE
VARIANCE(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL DECIMAL for DECIMAL arguments, otherwise DOUBLE
VAR_POP(expression) SMALLINT, INTEGER, BIGINT, FLOAT, DOUBLE, DECIMAL DECIMAL for DECIMAL arguments, otherwise DOUBLE

APPROX_COUNT_DUPS

Returns an approximate count of the values that are duplicates (not unique).

APPROX_COUNT_DUPS Syntax

APPROX_COUNT_DUPS( expression )

APPROX_COUNT_DUPS Examples

select
  COUNT(*),
  APPROX_COUNT_DUPS(e1.employee_id),
  APPROX_COUNT_DUPS(e1.gender)
FROM cp.`employee.json` e1

|--------|--------|--------|
| EXPR$0 | EXPR$1 | EXPR$2 |
| ------ | ------ | ------ |
| 1155   | 0      | 1153   |
|--------|--------|--------|

Use COUNT - APPROX_COUNT_DUPS to approximate a distinct count.

select
  COUNT(*),
  COUNT(salary) - APPROX_COUNT_DUPS(salary),
  COUNT(distinct salary)
from cp.`employee.json`;

|--------|--------|--------|
| EXPR$0 | EXPR$1 | EXPR$2 |
|--------|--------|--------|
| 1155   | 48     | 48     |
|--------|--------|--------|

APPROX_COUNT_DUPS Usage Notes

The underlying Bloom filter is a probabilistic data structure that may return a false positive when an element is tested for duplication. Consequently, the approximate count returned overestimates the true duplicate count. In return for this inaccuracy, Bloom filters are highly space- and time-efficient at large scales with the specifics determined by the parameters of the filter (see below).

Configuration options

Note

The APPROX_COUNT_DUPS function is used internally by Drill when it computes table statistics. As a result, setting configuration options that affect it in the global configuration scope will affect the computation of table statistics accordingly.

  • exec.statistics.ndv_extrapolation_bf_elements
  • exec.statistics.ndv_extrapolation_bf_fpprobability

REGR_INTERCEPT

Returns the intercept of the least squares linear regression fit.

REGR_INTERCEPT Syntax

REGR_INTERCEPT( expression1, expression2 )

REGR_INTERCEPT Examples

with seq as (
  select row_number() over (order by 1) x from cp.`employee.json` limit 10
)
select regr_intercept(x+random(), x+random()+5) from seq;
EXPR$0  4.9715020855109255

1 row selected (0.221 seconds)

REGR_SLOPE

Returns the slope of the least squares linear regression fit.

REGR_INTERCEPT Syntax

REGR_INTERCEPT( expression1, expression2 )

REGR_INTERCEPT Examples

with seq as (
  select row_number() over (order by 1) x from cp.`employee.json` limit 10
)
select regr_slope(x+random(), x+random()+5) from seq;
EXPR$0  0.9719696129009783

1 row selected (0.283 seconds)

STDDEV

Returns the sample standard deviation.

STDDEV Syntax

STDDEV(expression)

STDDEV Examples

SELECT STDDEV(salary) from cp.`employee.json`;

|-------------------|
| EXPR$0            |
|-------------------|
| 5371.847873988941 |
|-------------------|

STDDEV Usage Notes

  1. Aliases: STDDEV_SAMP

STDDEV_POP

Returns the estimate of the population standard deviation obtained by applying Bessel’s correction to the sample standard deviation.

STDDEV_POP Syntax

STDDEV_POP(expression)

STDDEV_POP Examples

SELECT STDDEV_POP(salary) from cp.`employee.json`;

|-------------------|
| EXPR$0            |
|-------------------|
| 5369.521895151171 |
|-------------------|

VARIANCE

Returns the sample variance.

VARIANCE Syntax

VARIANCE(expression)

VARIANCE Examples

SELECT VARIANCE(salary) from cp.`employee.json`;

|--------------------|
| EXPR$0             |
|--------------------|
| 28856749.581279505 |
|--------------------|

VARIANCE Usage Notes

  1. Aliases: VAR_SAMP

VAR_POP

Returns the estimate of the population variance obtained by applying Bessel’s correction to the sample variance.

VAR_POP Syntax

VAR_POP(expression)

VAR_POP Examples

SELECT VAR_POP(salary) from cp.`employee.json`;

|--------------------|
| EXPR$0             |
|--------------------|
| 28831765.382507823 |
|--------------------|