SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL


Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

Author
Message
Nadrek
Nadrek
SSCommitted
SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)

Group: General Forum Members
Points: 1975 Visits: 2729
Adam Sottosanti (2/3/2011)


Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.



Have you considered building a (large) precomputed Jaro-Winkler lookup table, and indexing it? Perhaps run your precise matching first, and once you're at the Jaro-Winkler stage, run against the lookup table first, and then work on those values that weren't in the table (and add them to the table)?
Adam Sottosanti
Adam Sottosanti
SSC-Enthusiastic
SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)SSC-Enthusiastic (152 reputation)

Group: General Forum Members
Points: 152 Visits: 404
Yep, for the most part that is exactly what I'm dong for the fuzzy matching.

Adam Sottosanti
kishorreddy.yuva
kishorreddy.yuva
Grasshopper
Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)

Group: General Forum Members
Points: 14 Visits: 95
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

THANKS IN ADVANCE
SQLRNNR
SQLRNNR
SSC-Dedicated
SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)

Group: General Forum Members
Points: 33678 Visits: 18560
kishorreddy.yuva (12/29/2013)
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

THANKS IN ADVANCE


Code to create a CLR function has been included as well as a TSQL function. Just call either of the functions from your stored proc.



Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server, MVP


SQL RNNR

Posting Performance Based Questions - Gail Shaw

kishorreddy.yuva
kishorreddy.yuva
Grasshopper
Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)Grasshopper (14 reputation)

Group: General Forum Members
Points: 14 Visits: 95
Hi Jason,

I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

Please provide me the code either for CLR function or to convert all this functions into stored procedure
SQLRNNR
SQLRNNR
SSC-Dedicated
SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)SSC-Dedicated (33K reputation)

Group: General Forum Members
Points: 33678 Visits: 18560
kishorreddy.yuva (12/29/2013)
Hi Jason,

I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

Please provide me the code either for CLR function or to convert all this functions into stored procedure


If you look through this thread you will see an example of how to call the function from a stored procedure.



Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server, MVP


SQL RNNR

Posting Performance Based Questions - Gail Shaw

Nadrek
Nadrek
SSCommitted
SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)SSCommitted (2K reputation)

Group: General Forum Members
Points: 1975 Visits: 2729
I've renamed the pure SQL functions somewhat, but here's an example of how I would call them from a stored procedure:


DECLARE @result FLOAT
SET @result = CalculateJaroWinkler('Bob','Bill')
PRINT @result
SET @result = CalculateJaroWinkler('Bob','Bub') --0.7999
PRINT @result
SET @result = CalculateJaroWinkler('Barnes','Banres') --0.9555
PRINT @result



Alternately, for a very primitive table compare example:

DECLARE @JaroThreshold FLOAT
SET @JaroThreshold = 0.9

SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS Word
INTO #tempA
FROM MyListOfWords
WHERE MyWord LIKE '__a%'
ORDER BY MyWord

ALTER TABLE #tempA ADD CONSTRAINT [PK_tempA] PRIMARY KEY CLUSTERED
(
[Word] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
GO

SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS word
INTO #tempB
FROM MyListOfWords
WHERE MyWord LIKE '__a%'
ORDER BY MyWord

ALTER TABLE #tempB ADD CONSTRAINT [PK_tempB] PRIMARY KEY CLUSTERED
(
[Word] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
GO

SELECT a.word, b.word, CalculateJaroWinkler(a.word,b.word)
FROM #tempA a
INNER JOIN #tempB b
ON b.LeftTwo = a.LeftTwo
AND b.Word <> a.Word
WHERE CalculateJaroWinkler(a.word,b.word) > @JaroThreshold

DROP TABLE #tempA
DROP TABLE #tempB


juanabreu
juanabreu
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 7
I'm hoping someone is still looking at this post. I have implemented the algorithm and it is working just fine. However, it performs quite horribly with my table with 60k rows. Is there anything I can do to speed things up? my table has basically first name and last name and that't it.

running a query as such

select [dbo].[JaroWinkler]('smith', LAST_NAME ) from LEIE could take up to 7 minutes.

Any Idea would be most appreciative.

Thanks!
mike.renwick-894639
mike.renwick-894639
SSC-Enthusiastic
SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)SSC-Enthusiastic (140 reputation)

Group: General Forum Members
Points: 140 Visits: 190
I'd try the CLR approach.

Also it sounds like you could match directly rather than via fuzzy, and perhaps only run jaro winkler on an unmatched subset?

When I use this in exercises, it's usually the last type of matching I attempt having exhausted other identifiers or direct matching approaches.
juanabreu
juanabreu
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 7
Thanks SSC, I'm trying to use this with a web application, which would query a DB and get potential matches. I was hoping to have an option for a fuzzy search based on last name , first name combo or one or the other.
I'm quite new to the CLR thing, do you have any resources docs on how that would work? thanks for replying, or on your suggestion with the subset how would you go about doing a subset based on a last name search with a table that has 60k rows.

Thanks again!!!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search