SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Fuzzy-String Search: Find misspelled information with T-SQL


Fuzzy-String Search: Find misspelled information with T-SQL

Author
Message
Thomas Keller
Thomas Keller
SSC Eights!
SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)

Group: General Forum Members
Points: 938 Visits: 158
@Davin21: Indeed, I had originally linked the very informative SimMetrics Web Site, which was linked from your sourceforge link, but it seems to be permanently down nowSad
Thomas Keller
Thomas Keller
SSC Eights!
SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)

Group: General Forum Members
Points: 938 Visits: 158
@Martin Bastable: Thanks for your kind words, and thanks to those who proofread various drafts of my article, including Kellie Sanchez, Dan Hess, Eric Freeman, Ernest Cook, and Chris Anderson! (Sorry I didn't thank y'all in the article itself.)

I have also implemented systems such as you describe, that pre-process the data and combine it in various ways, to make "fuzzy" searching quicker and more flexible. But for this article, I wanted to present a method that can be dropped into any database with a minimum of work, and still have acceptable performance when "fuzzy" searching on one field.
Martin Bastable
Martin Bastable
Ten Centuries
Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)

Group: General Forum Members
Points: 1338 Visits: 248
I call it a win win! You could use it as a stand alone search in a field, or combine it with other things for a super search Smile

M.
dmusick
dmusick
SSC-Enthusiastic
SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)

Group: General Forum Members
Points: 119 Visits: 49
Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a "percent similar" number, so I could filter out only the most likely duplicates and go through those manually.

I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don't double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.

For example, the word 'farm' and 'charm' can be broken into (using an underscore to represent the space):

_f fa ar rm m_

and

_c ch ha ar rm m_

Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.

If the second string was 'family charm', it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.

Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.

Here is the code I wrote to implement this algorithm:

CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )

RETURNS decimal(10,9)
AS
BEGIN
Declare @percentSimilar decimal(10,9);
Declare @length1 int;
Declare @length2 int;
Declare @inCommon int;
Declare @minSequences int;
Declare @maxLength int;
Declare @position int;
Declare @tally table ( sequence char(2),
times1 int,
times2 int);
Declare @intersect table ( sequence char(2),
overlap int);

If @string1 <> '' and @string2 <> '' --if either string is an empty string, return 0
Begin
Set @string1 = ' ' + @string1 + ' ';
Set @string2 = ' ' + @string2 + ' '; --put a space at the beginning and end of each string
Set @length1 = Len(@string1);
Set @length2 = Len(@string2);

If @length1 < @length2 --find the minimum 2-character sequences, to divide the total overlap
Set @minSequences = @length1 - 1; --by and get the percentage
Else
Set @minSequences = @length2 - 1;

Set @position = 1; --add the two character strings from string 1 to the tally table
While @position < @length1
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);
Set @position = @position + 1
End

Set @position = 1; --add the two character strings from string 2 to the tally table
While @position < @length2
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);
Set @position = @position + 1
End

Insert Into @intersect --count how many times each sequence overlaps the other
Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap
From @tally
Group By sequence

Select @inCommon = Sum(overlap) From @intersect; --the total overlap of two character sequences

Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;

End
Else
Set @percentSimilar = 0;

RETURN @percentSimilar;

END
dmusick
dmusick
SSC-Enthusiastic
SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)

Group: General Forum Members
Points: 119 Visits: 49
I wanted to add that one of the big strengths of this algorithm is that it will detect similarities even if the words are scrambled in different orders or if one string has an additional word in the beginning. Because it is only looking at 2 character sequences, it doesn't care about positioning within the string. This is very helpful when dealing with names, which can often be out of order, such as when the data entry person switches the first and last name, or they put a double last name (common among Hispanics) in the middle name, then the last name, and then the next person does it the opposite way.

I have seen so many bizarre data entry issues, where the same person was entered into the database quite differently each time, but my algorithm was able to find them.
Thomas Keller
Thomas Keller
SSC Eights!
SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)

Group: General Forum Members
Points: 938 Visits: 158
@dmusick: Interesting approach, and certainly useful. Note that DLD shows the same similarity for 'farm' and 'charm' (40% diff = 60% sim). For performance, I would try to rework your UDF to avoid a local table variable within the UDF. When benchmarking my OpHTML UDF, used in the final SELECT, I noticed that SQL opens transactions for all table variable activity. My Lim UDF, used in the WHERE clause, has better performance because it has no local table variable.
ChrisM@Work
ChrisM@Work
SSC Guru
SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)

Group: General Forum Members
Points: 157174 Visits: 21385
Interesting article and well worth a second read, and a play with the code.

I've used two TSQL fuzzy-matching algorithms in the past, both work "well enough" - subjective and also data-dependant.
The first uses token-matching, like dmusick's algorithm in a previous post. Grab the first three letters from the reference word, see if it matches in the target. Then do the same with letters 2,3,4, then 3,4,5 and so on. It's not overly complicated and can be squeezed into an iTVF. The downside is that matching on words of twice the token length or less isn't reliable, so matching algorithms will depend upon the word length. Here's a typical token-matching iTVF employing 3-character tokens for reference words of 7 characters or more, and two-character tokens for reference words between 3 and 6 characters long;

CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5] 
(@Reference VARCHAR(100) = NULL,
@Target VARCHAR(100) = NULL)
RETURNS table WITH SCHEMABINDING
AS
-- Chris Morris 2012
-- Fuzzy-matching using tokens
-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT
d.Method,
MatchRatio = CAST(CASE
WHEN d.Method = 1 THEN 100
WHEN d.Method = 3 THEN LenTarget*100.00/LenReference
WHEN d.Method = 4 THEN LenReference*100.00/LenTarget
WHEN d.Method = 5 THEN
(
SELECT
MatchPC = (100.00 * ISNULL(NULLIF(SUM(
CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
),0)+2.00,0) / LenReference)
* CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END
FROM ( -- Tally
SELECT TOP (CAST(LenReference AS INT)-2) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,
(SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b
) Tally
CROSS APPLY (SELECT PosInTarget = CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target)) x
)
WHEN d.Method = 6 THEN
(
SELECT
MatchPC = (100.00 * ISNULL(NULLIF(SUM(
CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
),0)+1.00,0) / LenReference)
* CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END
FROM ( -- Tally
SELECT TOP (CAST(LenReference AS INT)-1) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)
) Tally
CROSS APPLY (SELECT PosInTarget = CAST(CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) AS DECIMAL(5,2))) x
)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Method = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) >= 7 AND DATALENGTH(@Target) >= 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6
ELSE 7
END,
LenTarget = CAST(DATALENGTH(@Target) AS DECIMAL(5,2)),
LenReference = CAST(DATALENGTH(@Reference) AS DECIMAL(5,2))
) d



When comparing two columns in the same table, this function will take about 30s to process a million rows.


The second method is superficially similar to the DLD algorithm but simplified for performance; for n = 1 to length of the reference word, run through the letters of the reference word one at a time, and measure where they are found in the target word and the reference word starting at position n-1. A little simple arithmetic on the resulting numbers yields a match score. Here's the code:

DECLARE @Reference VARCHAR(25) = 'Joones' 
DECLARE @Target VARCHAR(25) = 'Joonse'

;WITH Tally AS (
SELECT TOP (DATALENGTH(@Reference)) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,
(SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b
)
SELECT
-- * -- switch the comments around to see how it works
MatchRatio = SUM(Matches)/DATALENGTH(@Reference)
FROM Tally
CROSS APPLY (SELECT TestLetter = SUBSTRING(@Reference, Tally.n, 1)) x
CROSS APPLY (
SELECT -- start search one character position *before* n to catch switched letters
PosInReference = CAST(CHARINDEX(TestLetter, @Reference, n-1) AS DECIMAL(5,2)),
PosInTarget = CAST(CHARINDEX(TestLetter, @Target, n-1) AS DECIMAL(5,2))
) z
CROSS APPLY (
SELECT Matches = CASE WHEN PosInReference > PosInTarget
THEN PosInTarget/PosInReference ELSE PosInReference/PosInTarget END
) m



It's substantially faster than the token-matching function, processing a million rows in about 4 seconds.

“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Thomas Keller
Thomas Keller
SSC Eights!
SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)SSC Eights! (938 reputation)

Group: General Forum Members
Points: 938 Visits: 158
@ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)
dmusick
dmusick
SSC-Enthusiastic
SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)SSC-Enthusiastic (119 reputation)

Group: General Forum Members
Points: 119 Visits: 49
I have been trying to think of good ways to get rid of the table variable. I was actually quite reluctant to use it at all, but I couldn't think of a better way to do it. I would have preferred to use arrays, but SQL doesn't have them. I may have to come up with a system for using string arrays to achieve what I need (like the good old days of programming!)
ChrisM@Work
ChrisM@Work
SSC Guru
SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)SSC Guru (157K reputation)

Group: General Forum Members
Points: 157174 Visits: 21385
Thomas Keller (9/19/2012)
@ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)


Habit, Tom, nothing more.

“Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

For fast, accurate and documented assistance in answering your questions, please read this article.
Understanding and using APPLY, (I) and (II) Paul White
Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden
Exploring Recursive CTEs by Example Dwain Camps
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum








































































































































































SQLServerCentral


Search