Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««12345

Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL Expand / Collapse
Author
Message
Posted Thursday, February 3, 2011 5:10 PM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 8:06 AM
Points: 880, Visits: 2,434
Adam Sottosanti (2/3/2011)


Agreed. I try to avoid the cartesian joins during matching as much as I can, and generally require certain demographics to match (inner join on DOB + Gender instead of cross join) to limit the result set, then also limit the amount I run through at a time. I may miss a handful of records with this approach, but they are easy to identify, and since the matching is fuzzy to begin with, you are going to need a clean up process anyway.



Have you considered building a (large) precomputed Jaro-Winkler lookup table, and indexing it? Perhaps run your precise matching first, and once you're at the Jaro-Winkler stage, run against the lookup table first, and then work on those values that weren't in the table (and add them to the table)?
Post #1058464
Posted Thursday, February 3, 2011 5:19 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Monday, October 6, 2014 2:55 PM
Points: 143, Visits: 363
Yep, for the most part that is exactly what I'm dong for the fuzzy matching.

Adam Sottosanti
Post #1058468
Posted Sunday, December 29, 2013 7:45 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, October 20, 2014 10:07 AM
Points: 14, Visits: 92
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

THANKS IN ADVANCE
Post #1526346
Posted Sunday, December 29, 2013 11:38 AM


SSCoach

SSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoach

Group: General Forum Members
Last Login: Today @ 7:55 AM
Points: 17,822, Visits: 15,746
kishorreddy.yuva (12/29/2013)
Hi I AM TRYING TO USE THIS JARO-WINKLER ALGORITHM,

I CANT CREATE A STORED PROCEDURE WITH THE SCALAR FUNCTIONS HERE,
PLEASE PROVIDE ME THE CODE TO CREATE A STORED PROCEDURE

THANKS IN ADVANCE


Code to create a CLR function has been included as well as a TSQL function. Just call either of the functions from your stored proc.




Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server, MVP


SQL RNNR

Posting Performance Based Questions - Gail Shaw
Post #1526354
Posted Sunday, December 29, 2013 11:06 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, October 20, 2014 10:07 AM
Points: 14, Visits: 92
Hi Jason,

I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

Please provide me the code either for CLR function or to convert all this functions into stored procedure
Post #1526388
Posted Wednesday, January 1, 2014 5:05 PM


SSCoach

SSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoach

Group: General Forum Members
Last Login: Today @ 7:55 AM
Points: 17,822, Visits: 15,746
kishorreddy.yuva (12/29/2013)
Hi Jason,

I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

Please provide me the code either for CLR function or to convert all this functions into stored procedure


If you look through this thread you will see an example of how to call the function from a stored procedure.




Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server, MVP


SQL RNNR

Posting Performance Based Questions - Gail Shaw
Post #1526958
Posted Thursday, January 2, 2014 8:53 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 8:06 AM
Points: 880, Visits: 2,434
I've renamed the pure SQL functions somewhat, but here's an example of how I would call them from a stored procedure:

DECLARE @result FLOAT
SET @result = CalculateJaroWinkler('Bob','Bill')
PRINT @result
SET @result = CalculateJaroWinkler('Bob','Bub') --0.7999
PRINT @result
SET @result = CalculateJaroWinkler('Barnes','Banres') --0.9555
PRINT @result

Alternately, for a very primitive table compare example:
DECLARE @JaroThreshold FLOAT
SET @JaroThreshold = 0.9

SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS Word
INTO #tempA
FROM MyListOfWords
WHERE MyWord LIKE '__a%'
ORDER BY MyWord

ALTER TABLE #tempA ADD CONSTRAINT [PK_tempA] PRIMARY KEY CLUSTERED
(
[Word] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
GO

SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS word
INTO #tempB
FROM MyListOfWords
WHERE MyWord LIKE '__a%'
ORDER BY MyWord

ALTER TABLE #tempB ADD CONSTRAINT [PK_tempB] PRIMARY KEY CLUSTERED
(
[Word] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]
GO

SELECT a.word, b.word, CalculateJaroWinkler(a.word,b.word)
FROM #tempA a
INNER JOIN #tempB b
ON b.LeftTwo = a.LeftTwo
AND b.Word <> a.Word
WHERE CalculateJaroWinkler(a.word,b.word) > @JaroThreshold

DROP TABLE #tempA
DROP TABLE #tempB

Post #1527168
« Prev Topic | Next Topic »

Add to briefcase «««12345

Permissions Expand / Collapse