June 15, 2015 at 7:55 pm
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540
The above URL giving clear picture about Levenshtein algorithm between two strings and it seems to be working fine as expected. How do I apply this into one table? My requirement here is, I want to show matched strings from Name Column in my DB table. Let's assume my DB table has 10 rows and should be displayed as follows..
ABCD
AB_B
QWE_
QWER
PQRSWE
P__S_W_
AS23_R
AS_3T_
Is there any possiblity to apply this algorithm for DB table?
June 15, 2015 at 8:57 pm
You could apply it to a table like so:
USE tempdb
GO
CREATE FUNCTION edit_distance(@s1 nvarchar(3999), @s2 nvarchar(3999))
RETURNS int
AS
BEGIN
DECLARE @s1_len int, @s2_len int, @i int, @j int, @s1_char nchar, @c int, @c_temp int,
@cv0 varbinary(8000), @cv1 varbinary(8000)
SELECT @s1_len = LEN(@s1), @s2_len = LEN(@s2), @cv1 = 0x0000, @j = 1, @i = 1, @c = 0
WHILE @j <= @s2_len
SELECT @cv1 = @cv1 + CAST(@j AS binary(2)), @j = @j + 1
WHILE @i <= @s1_len
BEGIN
SELECT @s1_char = SUBSTRING(@s1, @i, 1), @c = @i, @cv0 = CAST(@i AS binary(2)), @j = 1
WHILE @j <= @s2_len
BEGIN
SET @c = @c + 1
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j-1, 2) AS int) +
CASE WHEN @s1_char = SUBSTRING(@s2, @j, 1) THEN 0 ELSE 1 END
IF @c > @c_temp SET @c = @c_temp
SET @c_temp = CAST(SUBSTRING(@cv1, @j+@j+1, 2) AS int)+1
IF @c > @c_temp SET @c = @c_temp
SELECT @cv0 = @cv0 + CAST(@c AS binary(2)), @j = @j + 1
END
SELECT @cv1 = @cv0, @i = @i + 1
END
RETURN @c
END
GO
DECLARE @strings TABLE (stID int identity, s1 varchar(8000), s2 varchar(8000));
INSERT @strings(s1, s2) VALUES
('ABCD', 'AB_B'),
('QWE','QWER'),
('PQRSWE','P__S_W_'),
('AS23_R','AS_3T_');
SELECT s1, s2, ED = dbo.edit_distance(s1,s2)
FROM @strings;
-- Itzik Ben-Gan 2001
June 15, 2015 at 9:47 pm
I put together a real-life example of where the Levenshtein Distance is used.
Say you had a SQL Job or ETL package that imported data into a data warehouse or data mart. You have a lookup table and only values that had a match in that table could be imported. Values that are not matched are inserted into a table for unmatched values for further processing either by some Data Quality Services (DQS) software or to be looked at by a real person....
Now let's say we wanted to provide that software or person with the closest possible match (that's what Levenshtein is all about). You could do something like this:
-- Sample Data
DECLARE @CityLookup TABLE(city varchar(8000));
DECLARE @UnmatchedValues TABLE (city varchar(8000));
INSERT @CityLookup VALUES ('Chicago'),('Detroit'),('Mexico City');
INSERT @UnmatchedValues VALUES ('Chicaggo'),('Detrot'),('MexicCity');
-- Finding the "closest match"
WITH BestGuess AS
(
SELECT
UnmatchedValue = unmatched.city,
ClosestMatch = c.city,
ED = (dbo.edit_distance(unmatched.city, c.city)),
LevRank = RANK() OVER
(PARTITION BY unmatched.city
ORDER BY (dbo.edit_distance(unmatched.city, c.city)))
FROM
(
SELECT i.city
FROM @UnmatchedValues i
WHERE NOT EXISTS (SELECT 1 FROM @CityLookup c WHERE i.city=c.city)
) unmatched
CROSS JOIN @CityLookup c
)
SELECT UnmatchedValue, ClosestMatch, ED
FROM BestGuess
WHERE LevRank = 1;
Examining the above code you can see that for "Chicaggo" the closest possible match is Chicago, for "Detrot" - Detroit, for "MexicCity" - Mexico City.... I hope that makes sense, if not let us know.
Note that the Levenshtein function you posted is actually very slow (note the comments on that page). There is a guy who developed a much faster version here: http://randomrumenations.blogspot.com/2009/06/improved-t-sql-levenshtein-distance.html. Lastly - the Damerau-Levenshtein Distance is a more accurate algorithm but it's going to be slower. You can learn more about that here: http://www.sqlservercentral.com/articles/Fuzzy+Match/92822/%5B/url%5D.
-- Itzik Ben-Gan 2001
Viewing 3 posts - 1 through 3 (of 3 total)
You must be logged in to reply to this topic. Login to reply
This website stores cookies on your computer.
These cookies are used to improve your website experience and provide more personalized services to you, both on this website and through other media.
To find out more about the cookies we use, see our Privacy Policy