String Distance Functions

Introduced in release: 1.14.

Drill provides a functions for calculating a variety well known string distance metrics. Typically, you use string distance functions in the WHERE clause of a query to measure the difference between two strings. For example, if you want to match a street address, but do not know how to spell a street name, you could execute a query on the data source with the street addresses:

   SELECT street_address
   FROM address-data
   WHERE cosine_distance( `street_address`, “1234 North Quail Ln” ) <  0.5;

The search would return addresses from rows with street addresses similar to 1234 North Quail Ln, such as:

   1234 N. Quail Lane
   1234 N Quaile Lan

Drill supports the following string distance functions.

Function	Return type	Description
COSINE_DISTANCE(string1, string2)	FLOAT8	Returns the cosine distance, a measurement of the angular distance between between two strings regarded as word vectors.
FUZZY_SCORE(string1, string2)	FLOAT8	Returns the score from a fuzzy string matching algorithm¹. Higher scores indicate greater similarity.
HAMMING_DISTANCE(string1, string2)	FLOAT8	Returns the Hamming distance between two strings of equal length, a measurement of the number of positions at which corresponding characters differ.
JACCARD_DISTANCE(string1, string2)	FLOAT8	Returns the Jaccard distance between two strings regarded as unordered sets of characters, a measurement of the overlap between two sets.
JARO_DISTANCE(string1, string2)	FLOAT8	Returns the Jaro-Winkler distance, a measurement of the fraction of matching characters between two strings.
LEVENSHTEIN_DISTANCE(string1, string2)	FLOAT8	Returns the Levenshtein distance between two strings, a measurement of the number of single character modifications needed change one string into another.
LONGEST_COMMON_SUBSTRING_DISTANCE(string1, string2)	FLOAT8	Returns the length of the longest common substring across two strings².

Calculates the score from a matching algorithm similar to the searching algorithms implemented in editors such as Sublime Text, TextMate, Atom, and others. One point is given for every matched character. Subsequent matches yield two bonus points. ↩
Generally this algorithm is fairly inefficient, as for length m, n of the input CharSequence’s left and right respectively, the runtime of the algorithm is O(m*n). ↩

← Phonetic FunctionsCryptography Functions →