Supported Data Types

Drill reads from and writes to data sources having a wide variety of types.

SQL Data Type Description Example
BIGINT 8-byte signed integer in the range -9,223,372,036,854,775,808 to 9,223,372,036,854,775,807 9223372036854775807
BINARY Variable-length byte string B@e6d9eb7
BOOLEAN True or false true
DATE Years, months, and days in YYYY-MM-DD format since 4713 BC 2015-12-30
DECIMAL(p,s), or DEC(p,s), NUMERIC(p,s)1 38-digit precision number, precision is p, and scale is s DECIMAL(6,2) is 1234.56, 4 digits before and 2 digits after the decimal point
FLOAT 4-byte floating point number 0.456
DOUBLE, DOUBLE PRECISION 8-byte floating point number, precision-scalable 0.456
INTEGER or INT 4-byte signed integer in the range -2,147,483,648 to 2,147,483,647 2147483646
INTERVAL2 A day-time or year-month interval ‘1 10:20:30.123’ (day-time) or ‘1-2’ year to month (year-month)
SMALLINT3 2-byte signed integer in the range -32,768 to 32,767 32000
TIME 24-hour based time before or after January 1, 2001 in hours, minutes, seconds format: HH:mm:ss 22:55:55.23
TIMESTAMP JDBC timestamp in year, month, date hour, minute, second, and optional milliseconds format: yyyy-MM-dd HH:mm:ss.SSS 2015-12-30 22:55:55.23
CHARACTER VARYING, CHARACTER, CHAR4 or VARCHAR UTF8-encoded variable-length string. The default limit is 1 character. The maximum character limit is 2,147,483,647. CHAR(30) casts data to a 30-character string maximum.

DECIMAL Data Type

Starting in Drill 1.14, DECIMAL data type support is enabled by default. Drill uses the vardecimal data type to store decimal and numeric data types in a compressed format that optimizes storage space. The vardecimal data type stores decimal and numeric values as variable length columns that can represent any decimal precision.

In Drill, the SQL DECIMAL and NUMERIC data types map to the java.math.BigDecimal Java data type.

Syntax

The DECIMAL data type accepts numeric values, for which you can define a precision and a scale, as shown:

   column_name DECIMAL[(precision[,scale])]

Precision is an integer that indicates how many digits the number will contain. Scale is an integer that indicates the number of digits to the right of the decimal point. For example, the number 2325.67 has a precision of 6 and scale of 2. In a query, DECIMAL (6,2).

Scale cannot be greater than the precision. If you do not indicate a precision or scale, the default precision is 38 and scale is 0.

Decimal Data Type Storage Format Support

Drill can read decimal data types from and write decimal data types to the following storage format:

  • parquet

Drill can read decimal data types from the following storage formats, but cannot write decimal data types to them:

  • hive
  • jdbc
  • avro

Decimal Type Options

You can use the SET command with the planner.enable_decimal_data_type option to turn DECIMAL data type support off and on. If you disable the planner.enable_decimal_data_type option (by setting it to false), Drill treats decimal literals as DOUBLE.

Drill 1.14 introduces the following additional decimal-related options for the Parquet storage format that you can modify using the SET command:

  • store.parquet.writer.use_primitive_types_for_decimals (boolean) Allows Drill to use INT32 and INT64 logical types when storing decimal values in Parquet if the precision of the value allows it. Default is true. Set to false to prevent Drill from using INT32 and INT64 logical types, as shown:

      SET `store.parquet.writer.use_primitive_types_for_decimals` = false;
    
  • store.parquet.writer.logical_type_for_decimals Indicates the logical type that Drill should use to store decimals. The default is fixed_len_byte_array. Alternatively, you can set this option to binary, as shown:

      SET `store.parquet.writer.logical_type_for_decimals` = 'binary';
    

Composite Types

Drill supports the following composite types:

  • Map
  • Array

A map is a set of name/value pairs. A value in a map can be a scalar type, such as string or int, or a complex type, such as an array or another map. An array is a repeated list of values. A value in an array can be a scalar type, such as string or int, or an array can be a complex type, such as a map or another array.

Drill uses map and array data types internally for reading complex and nested data structures from data sources. In this release of Drill, you cannot reference a composite type by name in a query, but Drill supports array values coming from data sources. For example, you can use the index syntax to query data and get the value of an array element:

a[1]

You can refer to the value for a key in a map using dot notation:

t.m.k

The section “Query Complex Data” shows how to use composite types to access nested arrays. “Handling Different Data Types” includes examples of JSON maps and arrays. Drill provides functions for handling array and map types:

ANY Type

The ANY type is a key technological advance in Drill that enables it to address late typing problems. Drill uses the ANY type internally and you might see references to ANY in the output of the DESCRIBE or other commands. You cannot cast a value to the ANY type in this release.

Using the ANY type, the parser postpones the problem of resolving the type of some value until the query is actually running. At that point, Drill has an empirical schema available for each record batch to use for final code generation and optimization. If the empirical schema changes due to changes in the data processing, Drill regenerates the code as necessary.

Casting and Converting Data Types

In Drill, you cast or convert data to the required type for moving data from one data source to another. You do not assign a data type to every column name in a CREATE TABLE statement to define the table as you do in database software. Instead, you use the CREATE TABLE AS (CTAS) statement with one or more of the following functions to define the table:

In some cases, Drill converts schema-less data to correctly-typed data implicitly. In this case, you do not need to cast the data. The file format of the data and the nature of your query determines the requirement for casting or converting. Differences in casting depend on the data source. The following list describes how Drill treats data types from various data sources:

  • HBase Does not implicitly cast input to SQL types. Convert data to appropriate types as as described in the section “Querying HBase”. Use CONVERT_TO or CONVERT_FROM data types.
  • Hive Implicitly casts Hive types to SQL types as shown in the Hive type mapping example
  • JSON Implicitly casts JSON data to its corresponding SQL types or to VARCHAR if Drill is in all text mode.
  • MapR-DB Implicitly casts MapR-DB data to SQL types when you use the maprdb format for reading MapR-DB data. The dfs storage plugin defines the format when you install Drill from the mapr-drill package on a MapR node.
  • Parquet Implicitly casts Parquet data to the SQL types shown in SQL Data Types to Parquet.
  • Text: CSV, TSV, and other text Implicitly casts all textual data to VARCHAR.

Implicit Casting Precedence of Data Types

Drill’s implicit casting logic was overhauled in version 1.21, changing internally from a linear ranking of the SQL data types by “precedence” to a directed graph over the SQL data types which encodes the cost5 of casting between them in a transitive way. Without detailing the cost of every cast, the following principles give a good idea of what implicit cast Drill will attempt to insert, if any.

  1. The cost of casting from type A to type B is the path of least cost over the data type DAG.
  2. The NULL type is castable to any type with VARCHAR being the cheapest.
  3. It is cheaper to cast a NULL argument to any non-NULL type than to cast a non-NULL to another non-NULL.
  4. One expensive cast is generally cheaper than multiple cheap casts.
  5. Where type widenings naturally occur, e.g. INT4 → INT8, these casts are cheap.

Here are three examples of implcit casting at work.

select sqrt('2');
EXPR$0  1.4142135623730951

1 row selected (0.24 seconds)

select current_date > '1970-01-01';
EXPR$0  true

1 row selected (0.262 seconds)

select 'False' = false;
EXPR$0  true

1 row selected (0.133 seconds)

Explicit Casting

In a textual file, such as CSV, Drill interprets every field as a VARCHAR, as previously mentioned. To handle textual data, you can use the following functions to cast and convert compatible data types:

If the SELECT statement includes a WHERE clause that compares a column of an unknown data type, cast both the value of the column and the comparison value in the WHERE clause.

Explicit Type Casting Maps

The following tables show data types that Drill can cast to/from other data types. Not all types are available for explicit casting in the current release.

Numerical and Character Data Types

To SMALLINT INT BIGINT DECIMAL FLOAT CHAR FIXEDBINARY VARCHAR VARBINARY
From:                  
SMALLINT6 yes yes yes yes yes yes yes yes yes
INT yes yes yes yes yes yes yes yes yes
BIGINT yes yes yes yes yes yes yes yes yes
DECIMAL yes yes yes yes yes yes yes yes yes
DOUBLE yes yes yes yes yes yes no yes no
FLOAT yes yes yes yes yes yes no yes no
CHAR yes yes yes yes yes no yes yes yes
FIXEDBINARY7 yes yes yes yes yes no no yes yes
VARCHAR8 yes yes yes yes yes yes yes no yes
VARBINARY7 yes yes yes yes yes no yes yes no

Date and Time Data Types

To: DATE TIME TIMESTAMP INTERVALYEAR INTERVALDAY
From:          
CHAR Yes Yes Yes Yes Yes
FIXEDBINARY7 No No No No No
VARCHAR Yes Yes Yes Yes Yes
VARBINARY7 No No Yes No No
DATE No No Yes No No
TIME No Yes Yes No No
TIMESTAMP Yes Yes Yes No No
INTERVALYEAR Yes No Yes No Yes
INTERVALDAY Yes No Yes Yes No

Data Types for CONVERT_TO and CONVERT_FROM Functions

The CONVERT_TO function converts data to bytes from the input type. The CONVERT_FROM function converts data from bytes to the input type. For example, the following CONVERT_TO function converts an integer to bytes using big endian encoding:

CONVERT_TO(mycolumn, 'INT_BE')

The following table lists the data types you can use with the CONVERT_TO and CONVERT_FROM functions:

Type Input Type Output Type
JSON bytes varchar
BOOLEAN_BYTE bytes(1) BOOLEAN
TINYINT_BE bytes(1) TINYINT
TINYINT bytes(1) TINYINT
SMALLINT_BE bytes(2) SMALLINT
SMALLINT bytes(2) SMALLINT
INT_BE bytes(4) INT
INT bytes(4) INT
BIGINT_BE bytes(8) BIGINT
BIGINT bytes(8) BIGINT
FLOAT bytes(4) FLOAT (float4)
DOUBLE bytes(8) DOUBLE (float8)
INT_HADOOPV bytes(1-9) INT
BIGINT_HADOOPV BYTES(1-9) BIGINT
DATE_EPOCH_BE bytes(8) DATE
DATE_EPOCH bytes(8) DATE
TIME_EPOCH_BE bytes(8) TIME
TIME_EPOCH bytes(8) TIME
TIMESTAMP_EPOCH bytes(8) TIMESTAMP
TIMESTAMP_IMPALA* bytes(12) TIMESTAMP
UTF8 bytes VARCHAR
UTF16 bytes VAR16CHAR
UINT8 bytes(8) UINT8
UINT8_BE bytes(8) UINT8

* In Drill 1.2 and later, use the TIMESTAMP_IMPALA type with the CONVERT_FROM function to decode a timestamp from Hive or Impala, as shown in the section, “About INT96 Support”.

This table includes types such as INT, for converting little endian-encoded data and types such as INT_BE for converting big endian-encoded data to Drill internal types. You need to convert binary representations, such as data in HBase, to a Drill internal format as you query the data. If you are unsure that the size of the source and destination INT or BIGINT you are converting is the same, use CAST to convert these data types to/from binary.

*_HADOOPV in the data type name denotes the variable length integer as defined by Hadoop libraries. Use a *_HADOOPV type if user data is encoded in this format by a Hadoop tool outside MapR.

Footnotes

  1. Starting in Drill 1.14, the DECIMAL data type is enabled by default. 

  2. Internally, INTERVAL is represented as INTERVALDAY or INTERVALYEAR. 

  3. SMALLINT is not currently supported. 

  4. The CHAR data type is internally represented as VARCHAR by Drill. 

  5. Casting costs in Drill are set so as to achieve a target set of query behaviours, as measured by the test suite. They are not the computational cost nor are do they derive from any similar single principle. 

  6. Not supported in this release. 

  7. Used to cast binary UTF-8 data coming to/from sources such as HBase. The CAST function does not support all representations of FIXEDBINARY and VARBINARY. Only the UTF-8 format is supported. If your FIXEDBINARY or VARBINARY data is in a format other than UTF-8, or big-endian encoded, use the CONVERT_TO/FROM functions instead of CAST.  2 3 4

  8. You cannot convert a character string having a decimal point to an INT or BIGINT.