Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««12345»»

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL Expand / Collapse
Author
Message
Posted Tuesday, December 28, 2010 12:59 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 3, 2014 3:55 PM
Points: 143, Visits: 360
This is a great article, thank you so much for posting. I am trying to implement a similar solution in my organization, and was doing some testing on your function. I found some odd results when comparing strings that use the same letters. I am wondering if these results should be expected in these cases?


DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)

SET @string1 = 'sottosanti'
SET @string2 = 'ottositnas'

PRINT dbo.JaroWinkler(@string1, @string2)

GO

DECLARE @string1 VARCHAR(100), @string2 VARCHAR(100)

SET @string1 = 'margin'
SET @string2 = 'arming'

PRINT dbo.JaroWinkler(@string1, @string2)


Thanks,
Adam Sottosanti


Adam Sottosanti
Post #1039963
Posted Tuesday, December 28, 2010 3:32 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, January 29, 2014 8:13 PM
Points: 194, Visits: 1,096
Hi Adam,
I suggest you to try string matching function I developed and was using for years. This function compares two strings based on sequence of letters in the strings.


if exists (select 1 from dbo.sysobjects where id = object_id(N'[dbo].[fn_CompareThePair]'))
drop function [dbo].[fn_CompareThePair]
GO
--
create function dbo.fn_CompareThePair
(@String1 varchar (100), @String2 varchar (100), @FirstIn bit = 1, @LastIn bit = 1)
returns integer
/*************************************************************************************************
Function CompareThePair
This function accepts two string values and returns an integer value between
zero and one hundred indicating the similarity between the two string. This
algorithm was developed specifically to search for the spelling
variations of the names and addresses. For this purpose,
it is superior to SOUNDEX, which searches only for similar sounding words.
SOUNDEX fails when dealing with not Latin names(asian names as an example)
or with their interpretations.
Usage: select dbo.fn_CompareThePair('margin', 'arming', 1, 1)
Input Parameters
@String1 first string
@String2 second string
@FirstIn option to include comparison on the first character
@LastIn option to include comparison on the last character
--History
* Developed by marat 21/04/2008 based on the string matching algorithm developed in 2003 and used for address and name matching.
*
****************************************************************************************************/
begin

declare @Target integer
declare @Hits integer
declare @ExHits integer
declare @Pos integer
declare @Index integer
declare @Cursor integer
declare @Keep integer

--
set @Hits = 0
set @ExHits = 0
--
if @FirstIn > 0
begin
if left(@string1,1) = left(@string2,1) set @ExHits = @ExHits + 1
end
if @LastIn > 0
begin
if right(@string1,1) = right(@string2,1) set @ExHits = @ExHits + 1
end


set @Target =
case when len(@String1) >= len(@String2) then len(@String1) -1 + @FirstIn + @LastIn
else len(@String2) -1 + @FirstIn + @LastIn
end
--Run I
set @Pos = 1
set @Cursor = 1
while @Pos < len(@String1)
begin
set @Index = charindex(substring(@String1, @Pos, 2), @String2, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

set @Keep = @Hits --keep first result
--Run II
set @Pos = 1
set @Cursor = 1
set @Hits = 0
while @Pos < len(@String2)
begin
set @Index = charindex(substring(@String2, @Pos, 2), @String1, @Cursor)
if @Index > 0
begin
set @Cursor = @Index + 1
set @Hits = @Hits + 1
end
set @Pos = @Pos + 1
end

finish:
if @Keep > @Hits set @Hits = @Keep
return 100*(@Hits + @ExHits)/@Target
end
GO
Post #1040026
Posted Tuesday, December 28, 2010 4:13 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 3, 2014 3:55 PM
Points: 143, Visits: 360
Thank you for the code Anatoliy. I'll play around with this. It seems like a good approach might be blend this with the Jaro Winkler function.

Adam


Adam Sottosanti
Post #1040039
Posted Wednesday, December 29, 2010 10:24 AM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 3, 2014 3:55 PM
Points: 143, Visits: 360
Edit: I guess it also helps to read the entire article and discussion before asking questions, since doing so makes it abundantly clear I was using the wrong function to begin with! Whoops!

Adam Sottosanti
Post #1040448
Posted Wednesday, February 2, 2011 10:16 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, May 16, 2013 8:00 AM
Points: 1, Visits: 4
Just wanted to say thanks very much!

I wanted to use Jaro-Winkler matching of names and email addresses across two different db tables to filter out the bogus entries. This works perfectly. Several hundred rows matched on two JW's and an average in less than 2 minutes.

We might try this on addresses next, but I'm less confident of the results. I've tried similar before (not in sql) and the data is just too messy to get reliable results.

Thanks again!!!
Post #1057590
Posted Wednesday, February 2, 2011 10:48 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Yesterday @ 6:25 AM
Points: 3, Visits: 89
Here's my C# SQL CLR implementation- just deploy to an SQL server.

In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).

The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.

Also, it's very fast, I was getting 36k rows processing and sorting on JaroWinkler scores in under a second.

JaroWinkler.cs
using System;
using System.Collections.Generic;
using System.Text;

namespace JaroWinkler
{
class JaroWinklerDistance
{

private double mWeightThreshold;
private double mNumChars;

/**
* Construct a basic Jaro string distance without the Winkler
* modifications. See the class documentation above for more information
* on the exact algorithm and its parameters.
*/
public JaroWinklerDistance() {
mNumChars=Double.PositiveInfinity;
mWeightThreshold=0;
}

/**
* Construct a Winkler-modified Jaro string distance with the
* specified weight threshold for refinement and an initial number
* of characters over which to reweight. See the class
* documentation above for more information on the exact algorithm
* and its parameters.
*/
public JaroWinklerDistance(double weightThreshold, int numChars) {
mNumChars = numChars;
mWeightThreshold = weightThreshold;
}

/**
* Returns the Jaro-Winkler distance between the specified character
* sequences. Teh distance is symmetric and will fall in the
* range <code>0</code> (perfect match) to <code>1</code> (no overlap).
* See the class definition above for formal definitions.
*
*

This method is defined to be:
*
* <pre>
* distance(cSeq1,cSeq2) = 1 - proximity(cSeq1,cSeq2)</code></pre>
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double distance(String cSeq1, String cSeq2)
{
return 1.0 - proximity(cSeq1,cSeq2);
}

/**
* Return the Jaro-Winkler comparison value between the specified
* character sequences. The comparison is symmetric and will fall
* in the range <code>0</code> (no match) to <code>1</code>
* (perfect match)inclusive. See the class definition above for
* an exact definition of Jaro-Winkler string comparison.
*
*

The method {@link #distance(CharSequence,CharSequence)} returns
* a distance measure that is one minus the comparison value.
*
* @param cSeq1 First character sequence to compare.
* @param cSeq2 Second character sequence to compare.
* @return The Jaro-Winkler comparison value for the two character
* sequences.
*/
public double proximity(String cSeq1, String cSeq2) {
int len1 = cSeq1.Length;
int len2 = cSeq2.Length;
if (len1 == 0)
return len2 == 0 ? 1.0 : 0.0;

int searchRange = Math.Max(0,Math.Max(len1,len2)/2 - 1);

bool[] matched1 = new bool[len1];
matched1.Initialize();
//Arrays.Fill(matched1,false);
bool[] matched2 = new bool[len2];
matched2.Initialize();
//Arrays.fill(matched2,false);

int numCommon = 0;
for (int i = 0; i < len1; ++i) {
int start = Math.Max(0,i-searchRange);
int end = Math.Min(i+searchRange+1,len2);
for (int j = start; j < end; ++j) {
if (matched2[j]) continue;
if (cSeq1[i] != cSeq2[j])
continue;
matched1[i] = true;
matched2[j] = true;
++numCommon;
break;
}
}
if (numCommon == 0) return 0.0;

int numHalfTransposed = 0;
int k = 0;
for (int i = 0; i < len1; ++i) {
if (!matched1[i]) continue;
while (!matched2[k]) { ++k; }
if (cSeq1[i] != cSeq2[k])
{
++numHalfTransposed;
}
++k;
}
// System.out.println("numHalfTransposed=" + numHalfTransposed);
int numTransposed = numHalfTransposed/2;

// System.out.println("numCommon=" + numCommon
// + " numTransposed=" + numTransposed);
double numCommonD = numCommon;
double weight = (numCommonD/len1
+ numCommonD/len2
+ (numCommon - numTransposed)/numCommonD)/3.0;

if (weight <= mWeightThreshold) return weight;
double max = Math.Min(mNumChars,Math.Min(cSeq1.Length,cSeq2.Length));
int pos = 0;
while (pos < max && cSeq1[pos] == cSeq2[pos])
{
++pos;
}
if (pos == 0) return weight;
return weight + 0.1 * pos * (1.0 - weight);

}

/**
* A constant for the Jaro distance. The value is the same as
* would be returned by the nullary constructor
* <code>JaroWinklerDistance()</code>.
*
*

Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
//public static JaroWinklerDistance JARO_DISTANCE = new JaroWinklerDistance();

/**
* A constant for the Jaro-Winkler distance with defaults set as
* in Winkler's papers. The value is the same as would be
* returned by the nullary constructor
* <code>JaroWinklerDistance(0.7,4)</code>.
*
*

Instances are thread safe, so this single distance instance
* may be used for all comparisons within an application.
*/
//public static JaroWinklerDistance JARO_WINKLER_DISTANCE = new JaroWinklerDistance(0.70,4);


}

}

JaroWinklerWrapper.cs
using System;
using System.Data;
using System.Data.SqlClient;
using System.Data.SqlTypes;
using Microsoft.SqlServer.Server;
using JaroWinkler;

public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlDouble JaroWinklerDistance(string s, string t)
{
JaroWinklerDistance a = new JaroWinklerDistance();
return a.distance(s, t);
}

[Microsoft.SqlServer.Server.SqlFunction]
public static SqlDouble JaroWinklerProximity(string s, string t)
{
JaroWinklerDistance a = new JaroWinklerDistance();
return a.proximity(s, t);
}
};

Then call it from your code something like this
select 
Name,
'TestString',
dbo.JaroWinklerDistance('TestString',Name)
from
Names
WHERE
Name is not null


Post #1057608
Posted Thursday, February 3, 2011 11:45 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Yesterday @ 11:09 AM
Points: 869, Visits: 2,399
For addresses, I'd recommend first using nested REPLACE()'s (SQL Server 2000, at least, will easily go 100 or 120 deep) to standardize abbreviations. You'll have to look at your own data to include common misspellings, as well.

Personally, I also standardize on the singular abbreviations, on the assumption that plural vs. singular when most other factors are the same is more often a typo than two different streets with the same <entity> having the same numeric address on both.

USPS Standard Abbreviations

Splitting addresses into street #, street name, and suite can be very... interesting.
Post #1058271
Posted Thursday, February 3, 2011 12:05 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 3, 2014 3:55 PM
Points: 143, Visits: 360
mike.renwick (2/2/2011)
Here's my C# SQL CLR implementation- just deploy to an SQL server.

In my opinion, for anything involving significant string manipulation, a CLR language is vastly superior to SQL server almost every time (as well as being a lot more elegant).

The JaroWinkler algorithm was ported from the java lingpipe version, so I'd advise checking the output before committing to production.

...



Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...



Adam Sottosanti
Post #1058281
Posted Thursday, February 3, 2011 12:22 PM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Yesterday @ 11:09 AM
Points: 869, Visits: 2,399
Adam Sottosanti (2/3/2011)


Sweet, thanks Mike! If I ever get some free time, maybe I'll run some side by side compares. I run new records through multiple "rounds" of matching against a lookup table which is at 250K records and growing, using T-SQL, and it keeps getting slower... and slower... and slower...



I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.
Post #1058292
Posted Thursday, February 3, 2011 5:01 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 3, 2014 3:55 PM
Points: 143, Visits: 360


I've found (heavily) indexed #temp tables to help significantly with speed on larger datasets. Calculate your double metaphone and cleaned up variants once, and then do Jaro-Winkler based on a WHERE clause (first two characters match, first two characters of double metaphone match, etc.). It's naturally a cartesian [O(n^2)] operation, which does get very slow, very fast, so to speak.


Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.



Adam Sottosanti
Post #1058463
« Prev Topic | Next Topic »

Add to briefcase «««12345»»

Permissions Expand / Collapse