In databases, Levenshtein Distance can be used to detect errors in string fields, such as misspellings or typos. For example, given a list of names with corresponding addresses, Levenshtein Distance can be used to identify typos in spellings, such as Jones/Jimes, or to detect duplicates in certain fields, such as address1/address2.
What is it?
Levenshtein Distance is a measure of similarity between two strings or sequences of words or symbols. It is also sometimes referred to as Edit Distance because it can be used to measure the number of edits (insertions, deletions, and substitutions) it would take to transform one sequence to the other. It can be used to compare strings of equal or different lengths and is especially useful in data processing, data management, and computational linguistics.
It works by calculating the edit distance between two strings. The result is the number of inserts, deletes, and substitutions it would take to transform one string to the other. For example, if two strings x and y differ in length by more than one character, the edit distance between them is infinite, since it would take an infinite number of edits to match their lengths.
The Levenshtein Distance algorithm has several variations, all of which can be used to rapidly evaluate similarities between strings. One of the most popular variations, the Wagner-Fischer algorithm is based on dynamic programming. It starts by creating a matrix of distances, and then calculates the edit distance between two strings based on the values stored in the matrix.
One of the most popular applications of Levenshtein Distance in databases is fuzzy string matching. For example, given a list of reference strings stored in a database, the edit distance between a query string and each reference string can be quickly calculated to determine the closest match. By setting a certain threshold, mismatched characters can be ignored, allowing the algorithm to identify nearly identical strings.
Pros and cons
Pros
- Levenshtein distance allows us to identify how similar two pieces of data are to each other by determining the minimum number of edits required for one string to become the other.
- Levenshtein distance can be used to match data from different sources when the exact match isn't available.
- It can be used to detect typos and misspellings in entered data, which can aid in data accuracy and quality.
- Different applications of Levenshtein distance to fuzzy matching can help to improve the efficiency of the database and improve search results accuracy.
Cons
- Levenshtein distance can be computationally expensive, and the calculations can bog down a system if not used properly.
- Using Levenshtein distance can reduce query performance, meaning that the system may not be as responsive as it could be if other comparisons were used.
- Levenshtein distance is only applicable to strings, and so it cannot be used to match data from different sources if they don't have any common identifying text.
- There is a certain amount of arbitrariness with Levenshtein distance and how you define your parameters for determining the similarity between the pieces of data.
Examples
MS SQL
Here is a slightly reworked version of the code written by Arnold Fribble in https://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540.
CREATE FUNCTION fn_levenshtein_distance(@s nvarchar(4000), @t nvarchar(4000), @d int)
RETURNS int
AS
BEGIN
DECLARE @sl int, @tl int, @i int, @j int, @sc nchar, @c int, @c1 int,
@cv0 nvarchar(4000), @cv1 nvarchar(4000), @cmin int
SELECT @sl = LEN(@s), @tl = LEN(@t), @cv1 = ', @j = 1, @i = 1, @c = 0
WHILE @j <= @tl
SELECT @cv1 = @cv1 + NCHAR(@j), @j = @j + 1
WHILE @i <= @sl
BEGIN
SELECT @sc = SUBSTRING(@s, @i, 1), @c1 = @i, @c = @i, @cv0 = ', @j = 1, @cmin = 4000
WHILE @j <= @tl BEGIN SET @c = @c + 1 SET @c1 = @c1 - CASE WHEN @sc = SUBSTRING(@t, @j, 1) THEN 1 ELSE 0 END IF @c > @c1 SET @c = @c1
SET @c1 = UNICODE(SUBSTRING(@cv1, @j, 1)) + 1
IF @c > @c1 SET @c = @c1
IF @c < @cmin SET @cmin = @c SELECT @cv0 = @cv0 + NCHAR(@c), @j = @j + 1 END IF @cmin > @d BREAK
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN CASE WHEN @cmin <= @d AND @c <= @d THEN @c ELSE -1 END
END
It could be used in a number of different ways.
SELECT f.filmId, f.title, dbo.fn_levenshtein_distance(f.title, 'Matrix', 8) AS distance FROM Films f;
----------------------------------
| filmId | title | distance |
| 1 | The Matrix | 4 |
| 2 | Godfather | -1 |
----------------------------------
SELECT f.filmId, f.title FROM Films f WHERE dbo.fn_levenshtein_distance(f.title, 'Matrix', 8) BETWEEN 0 AND 4;
-----------------------
| filmId | title |
| 1 | The Matrix |
-----------------------
MySQL
This code is directly from https://lucidar.me/en/web-dev/levenshtein-distance-in-mysql/.
DELIMITER $$
CREATE FUNCTION levenshtein( s1 VARCHAR(255), s2 VARCHAR(255) )
RETURNS INT
DETERMINISTIC
BEGIN
DECLARE s1_len, s2_len, i, j, c, c_temp, cost INT;
DECLARE s1_char CHAR;
-- max strlen=255
DECLARE cv0, cv1 VARBINARY(256);
SET s1_len = CHAR_LENGTH(s1), s2_len = CHAR_LENGTH(s2), cv1 = 0x00, j = 1, i = 1, c = 0;
IF s1 = s2 THEN
RETURN 0;
ELSEIF s1_len = 0 THEN
RETURN s2_len;
ELSEIF s2_len = 0 THEN
RETURN s1_len;
ELSE
WHILE j <= s2_len DO
SET cv1 = CONCAT(cv1, UNHEX(HEX(j))), j = j + 1;
END WHILE;
WHILE i <= s1_len DO
SET s1_char = SUBSTRING(s1, i, 1), c = i, cv0 = UNHEX(HEX(i)), j = 1;
WHILE j <= s2_len DO
SET c = c + 1;
IF s1_char = SUBSTRING(s2, j, 1) THEN
SET cost = 0; ELSE SET cost = 1;
END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j, 1)), 16, 10) + cost;
IF c > c_temp THEN SET c = c_temp; END IF;
SET c_temp = CONV(HEX(SUBSTRING(cv1, j+1, 1)), 16, 10) + 1;
IF c > c_temp THEN
SET c = c_temp;
END IF;
SET cv0 = CONCAT(cv0, UNHEX(HEX(c))), j = j + 1;
END WHILE;
SET cv1 = cv0, i = i + 1;
END WHILE;
END IF;
RETURN c;
END$$
DELIMITER ;
It could be used in a number of different ways.
USE sakila;
SET @keyword = 'Florencia';
SELECT city_id, city, levenshtein(@keyword, city) AS distance FROM sakila.city WHERE levenshtein(@keyword, city) BETWEEN 0 AND 4;
----------------------------------
| city_id | city | distance |
| 91 | Brescia | 4 |
| 170 | Florencia | 0 |
| 561 | Valencia | 4 |
----------------------------------
In conclusion
Levenshtein distance can be very useful for databases when doing string comparisons. It is a numerical measurement used to compare two strings of text, indicating how many word changes would have to be made in order to turn one string into the other. For example, if you are looking to compare words with different spellings, Levenshtein distance can be very helpful. It is also a great tool when trying to match two strings within a range of similarity thresholds.
However, Levenshtein distance should not be used for databases if performance is a priority. It is a time consuming process and requires a lot of computation power as compared to more time-efficient algorithms such as Boyer-Moore string-matching algorithm. In addition, it cannot identify true matches that are beyond the set threshold. For these reasons, it is generally better to use alternative methods if performance is essential.