Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a "percent similar" number, so I could filter out only the most likely duplicates and go through those manually.
I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don't double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.
For example, the word 'farm' and 'charm' can be broken into (using an underscore to represent the space):
_f fa ar rm m_
and
_c ch ha ar rm m_
Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.
If the second string was 'family charm', it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.
Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.
Here is the code I wrote to implement this algorithm:
CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )
RETURNS decimal(10,9)
AS
BEGIN
Declare @percentSimilar decimal(10,9);
Declare @length1 int;
Declare @length2 int;
Declare @inCommon int;
Declare @minSequences int;
Declare @maxLength int;
Declare @position int;
Declare @tally table (sequence char(2),
times1 int,
times2 int);
Declare @intersect table (sequence char(2),
overlapint);
If @string1 <> '' and @string2 <> '' --if either string is an empty string, return 0
Begin
Set @string1 = ' ' + @string1 + ' ';
Set @string2 = ' ' + @string2 + ' '; --put a space at the beginning and end of each string
Set @length1 = Len(@string1);
Set @length2 = Len(@string2);
If @length1 < @length2--find the minimum 2-character sequences, to divide the total overlap
Set @minSequences = @length1 - 1;--by and get the percentage
Else
Set @minSequences = @length2 - 1;
Set @position = 1;--add the two character strings from string 1 to the tally table
While @position < @length1
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);
Set @position = @position + 1
End
Set @position = 1;--add the two character strings from string 2 to the tally table
While @position < @length2
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);
Set @position = @position + 1
End
Insert Into @intersect--count how many times each sequence overlaps the other
Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap
From @tally
Group By sequence
Select @inCommon = Sum(overlap) From @intersect; --the total overlap of two character sequences
Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;
End
Else
Set @percentSimilar = 0;
RETURN @percentSimilar;
END