RE: Fuzzy-String Search: Find misspelled information with T-SQL

SSC Enthusiast

Points: 119

September 18, 2012 at 8:26 am

Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a "percent similar" number, so I could filter out only the most likely duplicates and go through those manually.

I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don't double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.

For example, the word 'farm' and 'charm' can be broken into (using an underscore to represent the space):

_f fa ar rm m_

and

_c ch ha ar rm m_

Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.

If the second string was 'family charm', it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.

Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.

Here is the code I wrote to implement this algorithm:

CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )

RETURNS decimal(10,9)

BEGIN

Declare @percentSimilar decimal(10,9);

Declare @length1 int;

Declare @length2 int;

Declare @inCommon int;

Declare @minSequences int;

Declare @maxLength int;

Declare @position int;

Declare @tally table (sequence char(2),

times1 int,

times2 int);

Declare @intersect table (sequence char(2),

overlapint);

If @string1 <> '' and @string2 <> '' --if either string is an empty string, return 0

Begin

Set @string1 = ' ' + @string1 + ' ';

Set @string2 = ' ' + @string2 + ' '; --put a space at the beginning and end of each string

Set @length1 = Len(@string1);

Set @length2 = Len(@string2);

If @length1 < @length2--find the minimum 2-character sequences, to divide the total overlap

Set @minSequences = @length1 - 1;--by and get the percentage

Else

Set @minSequences = @length2 - 1;

Set @position = 1;--add the two character strings from string 1 to the tally table

While @position < @length1

Begin

Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);

Set @position = @position + 1

End

Set @position = 1;--add the two character strings from string 2 to the tally table

While @position < @length2

Begin

Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);

Set @position = @position + 1

End

Insert Into @intersect--count how many times each sequence overlaps the other

Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap

From @tally

Group By sequence

Select @inCommon = Sum(overlap) From @intersect; --the total overlap of two character sequences

Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;

End

Else

Set @percentSimilar = 0;

RETURN @percentSimilar;

END