Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

  • This is a great article, thank you so much for posting. I am trying to implement a similar solution in my organization, and was doing some testing on your function. I found some odd results when comparing strings that use the same letters. I am wondering if these results should be expected in these cases?

    DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)

    SET @string1 = 'sottosanti'

    SET @string2 = 'ottositnas'

    PRINT dbo.JaroWinkler(@string1, @string2)

    GO

    DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)

    SET @string1 = 'margin'

    SET @string2 = 'arming'

    PRINT dbo.JaroWinkler(@string1, @string2)

    Thanks,

    Adam Sottosanti

    Adam Sottosanti

  • Hi Adam,

    I suggest you to try string matching function I developed and was using for years. This function compares two strings based on sequence of letters in the strings.

    if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]'))

    drop function [dbo].[fn_CompareThePair]

    GO

    --

    create function dbo.fn_CompareThePair

    (@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1)

    returns integer

    /*************************************************************************************************

    Function CompareThePair

    This function accepts two string values and returns an integer value between

    zero and one hundred indicating the similarity between the two string. This

    algorithm was developed specifically to search for the spelling

    variations of the names and addresses. For this purpose,

    it is superior to SOUNDEX, which searches only for similar sounding words.

    SOUNDEX fails when dealing with not Latin names(asian names as an example)

    or with their interpretations.

    Usage: select dbo.fn_CompareThePair('margin', 'arming', 1, 1)

    Input Parameters

    @String1 first string

    @String2 second string

    @FirstIn option to include comparison on the first character

    @LastIn option to include comparison on the last character

    --History

    * Developed by marat 21/04/2008 based on the string matching algorithm developed in 2003 and used for address and name matching.

    *

    ****************************************************************************************************/

    begin

    declare @Target integer

    declare @Hits integer

    declare @ExHits integer

    declare @Pos integer

    declare @index integer

    declare @Cursor integer

    declare @Keep integer

    --

    set @Hits = 0

    set @ExHits = 0

    --

    if @FirstIn > 0

    begin

    if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1

    end

    if @LastIn > 0

    begin

    if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1

    end

    set @Target =

    case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn

    else len(@String2) -1 + @FirstIn + @LastIn

    end

    --Run I

    set @Pos = 1

    set @Cursor = 1

    while @Pos < len(@String1)

    begin

    set @index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor)

    if @index > 0

    begin

    set @Cursor = @index + 1

    set @Hits = @Hits + 1

    end

    set @Pos = @Pos + 1

    end

    set @Keep = @Hits --keep first result

    --Run II

    set @Pos = 1

    set @Cursor = 1

    set @Hits = 0

    while @Pos < len(@String2)

    begin

    set @index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor)

    if @index > 0

    begin

    set @Cursor = @index + 1

    set @Hits = @Hits + 1

    end

    set @Pos = @Pos + 1

    end

    finish:

    if @Keep > @Hits set @Hits = @Keep

    return 100*(@Hits + @ExHits)/@Target

    end

    GO

  • Thank you for the code Anatoliy. I'll play around with this. It seems like a good approach might be blend this with the Jaro Winkler function.

    Adam

    Adam Sottosanti

  • Edit: I guess it also helps to read the entire article and discussion before asking questions, since doing so makes it abundantly clear I was using the wrong function to begin with! Whoops!

    Adam Sottosanti

  • Just wanted to say thanks very much!

    I wanted to use Jaro-Winkler matching of names and email addresses across two different db tables to filter out the bogus entries. This works perfectly. Several hundred rows matched on two JW's and an average in less than 2 minutes.

    We might try this on addresses next, but I'm less confident of the results. I've tried similar before (not in sql) and the data is just too messy to get reliable results.

    Thanks again!!!

  • Here's my C# SQL CLR implementation- just deploy to an SQL server.

    In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).

    The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.

    Also, it's very fast, I was getting 36k rows processing and sorting on JaroWinkler scores in under a second.

    JaroWinkler.cs

    using System;

    using System.Collections.Generic;

    using System.Text;

    namespace JaroWinkler

    {

    class JaroWinklerDistance

    {

    private double mWeightThreshold;

    private double mNumChars;

    /**

    * Construct a basic Jaro string distance without the Winkler

    * modifications. See the class documentation above for more information

    * on the exact algorithm and its parameters.

    */

    public JaroWinklerDistance() {

    mNumChars=Double.PositiveInfinity;

    mWeightThreshold=0;

    }

    /**

    * Construct a Winkler-modified Jaro string distance with the

    * specified weight threshold for refinement and an initial number

    * of characters over which to reweight. See the class

    * documentation above for more information on the exact algorithm

    * and its parameters.

    */

    public JaroWinklerDistance(double weightThreshold, int numChars) {

    mNumChars = numChars;

    mWeightThreshold = weightThreshold;

    }

    /**

    * Returns the Jaro-Winkler distance between the specified character

    * sequences. Teh distance is symmetric and will fall in the

    * range <code>0</code> (perfect match) to <code>1</code> (no overlap).

    * See the class definition above for formal definitions.

    *

    *

    This method is defined to be:

    *

    * <pre>

    * distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)</code></pre>

    *

    * @param cSeq1 First character sequence to compare.

    * @param cSeq2 Second character sequence to compare.

    * @return The Jaro-Winkler comparison value for the two character

    * sequences.

    */

    public double distance(String cSeq1, String cSeq2)

    {

    return 1.0 - proximity(cSeq1,cSeq2);

    }

    /**

    * Return the Jaro-Winkler comparison value between the specified

    * character sequences. The comparison is symmetric and will fall

    * in the range <code>0</code> (no match) to <code>1</code>

    * (perfect match)inclusive. See the class definition above for

    * an exact definition of Jaro-Winkler string comparison.

    *

    *

    The method {@link #distance(CharSequence,CharSequence)} returns

    * a distance measure that is one minus the comparison value.

    *

    * @param cSeq1 First character sequence to compare.

    * @param cSeq2 Second character sequence to compare.

    * @return The Jaro-Winkler comparison value for the two character

    * sequences.

    */

    public double proximity(String cSeq1, String cSeq2) {

    int len1 = cSeq1.Length;

    int len2 = cSeq2.Length;

    if (len1 == 0)

    return len2 == 0 ? 1.0 : 0.0;

    int searchRange = Math.Max(0,Math.Max(len1,len2)/2 - 1);

    bool[] matched1 = new bool[len1];

    matched1.Initialize();

    //Arrays.Fill(matched1,false);

    bool[] matched2 = new bool[len2];

    matched2.Initialize();

    //Arrays.fill(matched2,false);

    int numCommon = 0;

    for (int i = 0; i < len1; ++i) {

    int start = Math.Max(0,i-searchRange);

    int end = Math.Min(i+searchRange+1,len2);

    for (int j = start; j < end; ++j) {

    if (matched2[j]) continue;

    if (cSeq1 != cSeq2[j])

    continue;

    matched1 = true;

    matched2[j] = true;

    ++numCommon;

    break;

    }

    }

    if (numCommon == 0) return 0.0;

    int numHalfTransposed = 0;

    int k = 0;

    for (int i = 0; i < len1; ++i) {

    if (!matched1) continue;

    while (!matched2[k]) { ++k; }

    if (cSeq1 != cSeq2[k])

    {

    ++numHalfTransposed;

    }

    ++k;

    }

    // System.out.println("numHalfTransposed=" + numHalfTransposed);

    int numTransposed = numHalfTransposed/2;

    // System.out.println("numCommon=" + numCommon

    // + " numTransposed=" + numTransposed);

    double numCommonD = numCommon;

    double weight = (numCommonD/len1

    + numCommonD/len2

    + (numCommon - numTransposed)/numCommonD)/3.0;

    if (weight <= mWeightThreshold) return weight;

    double max = Math.Min(mNumChars,Math.Min(cSeq1.Length,cSeq2.Length));

    int pos = 0;

    while (pos < max && cSeq1[pos] == cSeq2[pos])

    {

    ++pos;

    }

    if (pos == 0) return weight;

    return weight + 0.1 * pos * (1.0 - weight);

    }

    /**

    * A constant for the Jaro distance. The value is the same as

    * would be returned by the nullary constructor

    * <code>JaroWinklerDistance()</code>.

    *

    *

    Instances are thread safe, so this single distance instance

    * may be used for all comparisons within an application.

    */

    //public static JaroWinklerDistance JARO_DISTANCE = new JaroWinklerDistance();

    /**

    * A constant for the Jaro-Winkler distance with defaults set as

    * in Winkler's papers. The value is the same as would be

    * returned by the nullary constructor

    * <code>JaroWinklerDistance(0.7,4)</code>.

    *

    *

    Instances are thread safe, so this single distance instance

    * may be used for all comparisons within an application.

    */

    //public static JaroWinklerDistance JARO_WINKLER_DISTANCE = new JaroWinklerDistance(0.70,4);

    }

    }

    JaroWinklerWrapper.cs

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using JaroWinkler;

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlDouble JaroWinklerDistance(string s, string t)

    {

    JaroWinklerDistance a = new JaroWinklerDistance();

    return a.distance(s, t);

    }

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlDouble JaroWinklerProximity(string s, string t)

    {

    JaroWinklerDistance a = new JaroWinklerDistance();

    return a.proximity(s, t);

    }

    };

    Then call it from your code something like this

    select

    Name,

    'TestString',

    dbo.JaroWinklerDistance('TestString',Name)

    from

    Names

    WHERE

    Name is not null

  • For addresses, I'd recommend first using nested REPLACE()'s (SQL Server 2000, at least, will easily go 100 or 120 deep) to standardize abbreviations. You'll have to look at your own data to include common misspellings, as well.

    Personally, I also standardize on the singular abbreviations, on the assumption that plural vs. singular when most other factors are the same is more often a typo than two different streets with the same <entity> having the same numeric address on both.

    USPS Standard Abbreviations

    Splitting addresses into street #, street name, and suite can be very... interesting.

  • mike.renwick (2/2/2011)


    Here's my C# SQL CLR implementation- just deploy to an SQL server.

    In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).

    The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.

    ...

    Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...

    Adam Sottosanti

  • Adam Sottosanti (2/3/2011)


    Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...

    I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.

  • I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.

    Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.

    Adam Sottosanti

  • Adam Sottosanti (2/3/2011)


    Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.

    Have you considered building a (large) precomputed Jaro-Winkler lookup table, and indexing it? Perhaps run your precise matching first, and once you're at the Jaro-Winkler stage, run against the lookup table first, and then work on those values that weren't in the table (and add them to the table)?

  • Yep, for the most part that is exactly what I'm dong for the fuzzy matching.

    Adam Sottosanti

  • Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

    I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,

    PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

    THANKS IN ADVANCE

  • kishorreddy.yuva (12/29/2013)


    Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

    I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,

    PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

    THANKS IN ADVANCE

    Code to create a CLR function has been included as well as a TSQL function. Just call either of the functions from your stored proc.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Hi Jason,

    I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

    Please provide me the code either for CLR function or to convert all this functions into stored procedure

Viewing 15 posts - 31 through 45 (of 57 total)

You must be logged in to reply to this topic. Login to reply