• Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a "percent similar" number, so I could filter out only the most likely duplicates and go through those manually.

    I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don't double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.

    For example, the word 'farm' and 'charm' can be broken into (using an underscore to represent the space):

    _f fa ar rm m_

    and

    _c ch ha ar rm m_

    Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.

    If the second string was 'family charm', it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.

    Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.

    Here is the code I wrote to implement this algorithm:

    CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )

    RETURNS decimal(10,9)

    AS

    BEGIN

    Declare @percentSimilar decimal(10,9);

    Declare @length1 int;

    Declare @length2 int;

    Declare @inCommon int;

    Declare @minSequences int;

    Declare @maxLength int;

    Declare @position int;

    Declare @tally table (sequence char(2),

    times1 int,

    times2 int);

    Declare @intersect table (sequence char(2),

    overlapint);

    If @string1 <> '' and @string2 <> '' --if either string is an empty string, return 0

    Begin

    Set @string1 = ' ' + @string1 + ' ';

    Set @string2 = ' ' + @string2 + ' '; --put a space at the beginning and end of each string

    Set @length1 = Len(@string1);

    Set @length2 = Len(@string2);

    If @length1 < @length2--find the minimum 2-character sequences, to divide the total overlap

    Set @minSequences = @length1 - 1;--by and get the percentage

    Else

    Set @minSequences = @length2 - 1;

    Set @position = 1;--add the two character strings from string 1 to the tally table

    While @position < @length1

    Begin

    Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);

    Set @position = @position + 1

    End

    Set @position = 1;--add the two character strings from string 2 to the tally table

    While @position < @length2

    Begin

    Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);

    Set @position = @position + 1

    End

    Insert Into @intersect--count how many times each sequence overlaps the other

    Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap

    From @tally

    Group By sequence

    Select @inCommon = Sum(overlap) From @intersect; --the total overlap of two character sequences

    Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;

    End

    Else

    Set @percentSimilar = 0;

    RETURN @percentSimilar;

    END