Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««12345»»»

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL Expand / Collapse
Author
Message
Posted Friday, June 19, 2009 5:42 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, January 29, 2014 8:13 PM
Points: 194, Visits: 1,096
OK,
Address matching isn't quite exactly related to the topic, but of course it's related to name matching somehow. Because of the scale of the address matching topic I will not discuss it on this BLOG and rather have a little bit discussion on name match.
Regarding matching on initial, it's quite useful when matching on household level; when you have father/son relationship, when they share the same first name.
For each name part we can have the following cases if we using fuzzy matching:
1. Exact match
2. Nick name match
3. Fuzzy match
4. Match on initial
5 No match
all cases are listed according their matching weight, actually you can have it differently;for surname we don't do match on first letter(case 4) and nickname(case2); also there could be cases when surname could change, in case of marriage.
Each name part itself has matching weight; for example match on surname has higher weight than match on first or second name, not sure about to compare weight of first name and second name against surname.
That mean we having here a scoring engine with possible tuneup parameters.
That was my approach.
Also a word about cleansing function used by Ira.
I think it's kind of so called matchkey generator and it take me back to the times when no fuzzy matching was used but we were using matchkey to do our match.
For example keeping first 3 letters and removing wovels and doubles.
Match key generation could be another topic of discussion, but I would stop here.
Thank you again for the interesting discussion and useful script.
Post #738712
Posted Sunday, July 12, 2009 6:29 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, April 14, 2014 1:31 AM
Points: 42, Visits: 289
Update: Based on several excellent forum contribution’s I asked Michael Capes(friend with big brain) to review them and provide feedback. The result is a revision to the original script/article. The following is the result of his review and the code for for the article has been updated. The code in the article for the JaroWinkler algothrim has has been updated and new code is in the zip file TestJaroWinkler.ZIP an will be available soon. Any question please let me know.



I believe the "expected" results that the forum poster gave are incorrect. I was looking through the pdf document from Winkler, that I believe the poster took his test set. The paper is titled: "AN APPLICATION OF THE FELLEGI-SUNTER MODEL OF RECORD LINKAGE TO THE 1990 U.S. DECENNIAL CENSUS" There's a table in that paper which matches the poster's comments. I don't beleive this table was produced with Jaro-Winkler. Instead, it looks like it was produced by another algorithm.

Furthermore, I got the C# SimMetrics utility to run this morning. It's results also matched the earlier results from my Oracle and T-SQL test.

I believe the revised T-SQL is correctly implementing the Jaro-Winkler algorthim. I've verified it against two other implementations, and all three results match.


Ira



Ira Warren Whiteside
Post #751785
Posted Monday, July 13, 2009 12:53 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, April 14, 2014 1:31 AM
Points: 42, Visits: 289
I agree but keep in mind many folks still rely on tsql. I have used SSIS extensively, but have found it has a bit of a learning curve. I wanted to remove the "Black Box"aura surronding Fuzzy Matching

Ira Warren Whiteside
Post #752293
Posted Monday, July 13, 2009 12:54 PM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, April 14, 2014 1:31 AM
Points: 42, Visits: 289
Martin,

I have updated the article and code.

Thanks
Ira


Ira Warren Whiteside
Post #752294
Posted Thursday, July 29, 2010 8:02 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, March 31, 2014 8:17 AM
Points: 110, Visits: 182
I got jaro-winkler functionality working by exposing a C# class via CLR- any thoughts on how this might compare speed/efficiency wise? I figured C# would be a more efficient / speedy string handler?

Interested parties can PM me for the code if they want.

Regards,
Mike
Post #960731
Posted Monday, November 15, 2010 4:37 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, June 11, 2014 3:02 PM
Points: 39, Visits: 382
Hi Guys Can you please tell how can we achieve Fuzzy Grouping with this.
I am able to get fuzzy lookup but no fuzzy goruping.
Can you please give me the query.

Thanks in advance.
Post #1021153
Posted Monday, November 15, 2010 4:42 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, June 11, 2014 3:02 PM
Points: 39, Visits: 382
Hi Guys Can you please tell how can we achieve Fuzzy Grouping with this.
I am able to get fuzzy lookup but no fuzzy goruping.
Can you please give me the query.

Thanks in advance.
Post #1021154
Posted Tuesday, November 16, 2010 9:23 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, June 11, 2014 3:02 PM
Points: 39, Visits: 382
Hi All I am try to do a fuzzy grouping
with the example given I am not able to do it please let me know if I am missing anything.
Post #1021578
Posted Tuesday, November 16, 2010 9:35 AM


SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, April 14, 2014 1:31 AM
Points: 42, Visits: 289
Hi can you provide more detail?

Ira Warren Whiteside
Post #1021596
Posted Wednesday, November 17, 2010 10:37 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, June 11, 2014 3:02 PM
Points: 39, Visits: 382
Hi ,

I have set of records as shown below.

ID String
1 help
2 HELP
3 hel
4 he
5 helipad
6 Helpad
7 vice
8 VICEadmin
9 vice admin
10 Vice President

I wanted to aggregate the count of common strings based on the similarity.
My end should look like

COUNTS STRING
2 help
1 hel
1 he
2 helipad
1 vice
2 Vice admin
1 Vice President

Please let me know thanks in advance.
Post #1022299
« Prev Topic | Next Topic »

Add to briefcase «««12345»»»

Permissions Expand / Collapse