Levenshtein distance in databases

September 17, 2023 - Drygast

Database MS SQL MySQL Software Dev

In databases, Levenshtein Distance can be used to detect errors in string fields, such as misspellings or typos. For example, given a list of names with corresponding addresses, Levenshtein Distance can be used to identify typos in spellings, such as Jones/Jimes, or to detect duplicates in certain fields, such as address1/address2.

What is it?

Levenshtein Distance is a measure of similarity between two strings or sequences of words or symbols. It is also sometimes referred to as Edit Distance because it can be used to measure the number of edits (insertions, deletions, and substitutions) it would take to transform one sequence to the other. It can be used to compare strings of equal or different lengths and is especially useful in data processing, data management, and computational linguistics.

It works by calculating the edit distance between two strings. The result is the number of inserts, deletes, and substitutions it would take to transform one string to the other. For example, if two strings x and y differ in length by more than one character, the edit distance between them is infinite, since it would take an infinite number of edits to match their lengths.

The Levenshtein Distance algorithm has several variations, all of which can be used to rapidly evaluate similarities between strings. One of the most popular variations, the Wagner-Fischer algorithm is based on dynamic programming. It starts by creating a matrix of distances, and then calculates the edit distance between two strings based on the values stored in the matrix.

One of the most popular applications of Levenshtein Distance in databases is fuzzy string matching. For example, given a list of reference strings stored in a database, the edit distance between a query string and each reference string can be quickly calculated to determine the closest match. By setting a certain threshold, mismatched characters can be ignored, allowing the algorithm to identify nearly identical strings.

Pros and cons

Pros

Levenshtein distance allows us to identify how similar two pieces of data are to each other by determining the minimum number of edits required for one string to become the other.
Levenshtein distance can be used to match data from different sources when the exact match isn't available.
It can be used to detect typos and misspellings in entered data, which can aid in data accuracy and quality.
Different applications of Levenshtein distance to fuzzy matching can help to improve the efficiency of the database and improve search results accuracy.

Cons

Levenshtein distance can be computationally expensive, and the calculations can bog down a system if not used properly.
Using Levenshtein distance can reduce query performance, meaning that the system may not be as responsive as it could be if other comparisons were used.
Levenshtein distance is only applicable to strings, and so it cannot be used to match data from different sources if they don't have any common identifying text.
There is a certain amount of arbitrariness with Levenshtein distance and how you define your parameters for determining the similarity between the pieces of data.

Examples

MS SQL

Here is a slightly reworked version of the code written by Arnold Fribble in https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540.

CREATE FUNCTION fn_levenshtein_distance(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
  DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
    @cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
  SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = ', @j = 1, @i = 1, @c = 0
  WHILE @j <= @tl
    SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
  WHILE @i <= @sl
  BEGIN
    SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = ', @j = 1, @cmin = 4000
    WHILE @j <= @tl BEGIN SET @c = @c + 1 SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END IF @c > @c1 SET @c = @c1
      SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
      IF @c > @c1 SET @c = @c1
      IF @c < @cmin SET @cmin = @c SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1 END IF @cmin > @d BREAK
    SELECT @cv1 = @cv0, @i = @i + 1
  END
  RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END

It could be used in a number of different ways.

SELECT f.filmId, f.title, dbo.fn_levenshtein_distance(f.title, 'Matrix', 8) AS distance FROM Films f;
----------------------------------
| filmId | title      | distance |
| 1      | The Matrix | 4        |
| 2      | Godfather  | -1       |
----------------------------------

SELECT f.filmId, f.title FROM Films f WHERE dbo.fn_levenshtein_distance(f.title, 'Matrix', 8) BETWEEN 0 AND 4;
-----------------------
| filmId | title      |
| 1      | The Matrix |
-----------------------

MySQL

This code is directly from https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/.

DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
    RETURNS INT
    DETERMINISTIC
    BEGIN
        DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
        DECLARE s1_char CHAR;
        -- max strlen=255
        DECLARE cv0, cv1 VARBINARY(256);

        SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;

        IF s1 = s2 THEN
            RETURN 0;
        ELSEIF s1_len = 0 THEN
            RETURN s2_len;
        ELSEIF s2_len = 0 THEN
            RETURN s1_len;
        ELSE
            WHILE j <= s2_len DO
                SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
            END WHILE;
            WHILE i <= s1_len DO
                SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
                WHILE j <= s2_len DO
                    SET c = c + 1;
                    IF s1_char = SUBSTRING(s2, j, 1) THEN
                        SET cost = 0; ELSE SET cost = 1;
                    END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
                    IF c > c_temp THEN SET c = c_temp; END IF;
                    SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
                    IF c > c_temp THEN
                        SET c = c_temp;
                    END IF;
                    SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
                END WHILE;
                SET cv1 = cv0, i = i + 1;
            END WHILE;
        END IF;
        RETURN c;
    END$$
DELIMITER ;

It could be used in a number of different ways.

USE sakila;
SET @keyword = 'Florencia';
SELECT city_id, city, levenshtein(@keyword, city) AS distance FROM sakila.city WHERE levenshtein(@keyword, city) BETWEEN 0 AND 4;
----------------------------------
| city_id | city      | distance |
| 91      | Brescia   | 4        |
| 170     | Florencia | 0        |
| 561     | Valencia  | 4        |
----------------------------------

In conclusion

Levenshtein distance can be very useful for databases when doing string comparisons. It is a numerical measurement used to compare two strings of text, indicating how many word changes would have to be made in order to turn one string into the other. For example, if you are looking to compare words with different spellings, Levenshtein distance can be very helpful. It is also a great tool when trying to match two strings within a range of similarity thresholds.

However, Levenshtein distance should not be used for databases if performance is a priority. It is a time consuming process and requires a lot of computation power as compared to more time-efficient algorithms such as Boyer-Moore string-matching algorithm. In addition, it cannot identify true matches that are beyond the set threshold. For these reasons, it is generally better to use alternative methods if performance is essential.