String Distance Functions
Introduced in release: 1.14.
Drill provides a functions for calculating a variety well known string distance metrics. Typically, you use string distance functions in the WHERE clause of a query to measure the difference between two strings. For example, if you want to match a street address, but do not know how to spell a street name, you could execute a query on the data source with the street addresses:
SELECT street_address
FROM address-data
WHERE cosine_distance( `street_address`, “1234 North Quail Ln” ) < 0.5;
The search would return addresses from rows with street addresses similar to 1234 North Quail Ln, such as:
1234 N. Quail Lane
1234 N Quaile Lan
Drill supports the following string distance functions.
Function | Return type | Description |
---|---|---|
COSINE_DISTANCE(string1, string2) | FLOAT8 | Returns the cosine distance, a measurement of the angular distance between between two strings regarded as word vectors. |
FUZZY_SCORE(string1, string2) | FLOAT8 | Returns the score from a fuzzy string matching algorithm1. Higher scores indicate greater similarity. |
HAMMING_DISTANCE(string1, string2) | FLOAT8 | Returns the Hamming distance between two strings of equal length, a measurement of the number of positions at which corresponding characters differ. |
JACCARD_DISTANCE(string1, string2) | FLOAT8 | Returns the Jaccard distance between two strings regarded as unordered sets of characters, a measurement of the overlap between two sets. |
JARO_DISTANCE(string1, string2) | FLOAT8 | Returns the Jaro-Winkler distance, a measurement of the fraction of matching characters between two strings. |
LEVENSHTEIN_DISTANCE(string1, string2) | FLOAT8 | Returns the Levenshtein distance between two strings, a measurement of the number of single character modifications needed change one string into another. |
LONGEST_COMMON_SUBSTRING_DISTANCE(string1, string2) | FLOAT8 | Returns the length of the longest common substring across two strings2. |
-
Calculates the score from a matching algorithm similar to the searching algorithms implemented in editors such as Sublime Text, TextMate, Atom, and others. One point is given for every matched character. Subsequent matches yield two bonus points. ↩
-
Generally this algorithm is fairly inefficient, as for length m, n of the input CharSequence’s left and right respectively, the runtime of the algorithm is O(m*n). ↩