Roll Your Own Fuzzy Match / Grouping (Jaro Winkler) - T-SQL

  • kishorreddy.yuva (12/29/2013)


    Hi Jason,

    I just copy the code from here and try to execute in my machine, But when i check these all are function not stored procedure.

    Please provide me the code either for CLR function or to convert all this functions into stored procedure

    If you look through this thread you will see an example of how to call the function from a stored procedure.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • I've renamed the pure SQL functions somewhat, but here's an example of how I would call them from a stored procedure:

    DECLARE @result FLOAT

    SET @result = CalculateJaroWinkler('Bob','Bill')

    PRINT @result

    SET @result = CalculateJaroWinkler('Bob','Bub') --0.7999

    PRINT @result

    SET @result = CalculateJaroWinkler('Barnes','Banres') --0.9555

    PRINT @result

    Alternately, for a very primitive table compare example:

    DECLARE @JaroThreshold FLOAT

    SET @JaroThreshold = 0.9

    SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS Word

    INTO #tempA

    FROM MyListOfWords

    WHERE MyWord LIKE '__a%'

    ORDER BY MyWord

    ALTER TABLE #tempA ADD CONSTRAINT [PK_tempA] PRIMARY KEY CLUSTERED

    (

    [Word] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]

    GO

    SELECT top 500 LEFT(MyWord,2) AS LeftTwo, MyWord AS word

    INTO #tempB

    FROM MyListOfWords

    WHERE MyWord LIKE '__a%'

    ORDER BY MyWord

    ALTER TABLE #tempB ADD CONSTRAINT [PK_tempB] PRIMARY KEY CLUSTERED

    (

    [Word] ASC

    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = ON, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, FILLFACTOR = 100) ON [PRIMARY]

    GO

    SELECT a.word, b.word, CalculateJaroWinkler(a.word,b.word)

    FROM #tempA a

    INNER JOIN #tempB b

    ON b.LeftTwo = a.LeftTwo

    AND b.Word <> a.Word

    WHERE CalculateJaroWinkler(a.word,b.word) > @JaroThreshold

    DROP TABLE #tempA

    DROP TABLE #tempB

  • I'm hoping someone is still looking at this post. I have implemented the algorithm and it is working just fine. However, it performs quite horribly with my table with 60k rows. Is there anything I can do to speed things up? my table has basically first name and last name and that't it.

    running a query as such

    select [dbo].[JaroWinkler]('smith', LAST_NAME ) from LEIE could take up to 7 minutes.

    Any Idea would be most appreciative.

    Thanks!

  • I'd try the CLR approach.

    Also it sounds like you could match directly rather than via fuzzy, and perhaps only run jaro winkler on an unmatched subset?

    When I use this in exercises, it's usually the last type of matching I attempt having exhausted other identifiers or direct matching approaches.

  • Thanks SSC, I'm trying to use this with a web application, which would query a DB and get potential matches. I was hoping to have an option for a fuzzy search based on last name , first name combo or one or the other.

    I'm quite new to the CLR thing, do you have any resources docs on how that would work? thanks for replying, or on your suggestion with the subset how would you go about doing a subset based on a last name search with a table that has 60k rows.

    Thanks again!!!

  • See my code on page 4 of this thread - and it's Mike 😉

    I can't go through the details of CLR with you- but basically it's c# that you publish to sql server (and you or your DBA need to do a bit of config to enable CLR on the server itself). SQL Server can then call the routine.

    When I did this before it did 36k rows in under a second (according to my last post anyway!)

  • Hi,

    I'm new in the SQL stuff and have been engaged in a project to work with strings matching. I have read the entire article on (http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/) and that is really interesting for what I have been working.

    I tried to create exactly the same scenario here, but in your query to call the Jaro Winkler function you mentioned the dbo.JaroWinkler, but this function name does not exist in your code.

    Below is the query to match the strings.

    ------------------------------

    SELECT NameLookup.name_group_id, NameInput.Cust_Id, NameInput.Name_Input, NameLookup.first_name ,NameLookup.first_name_normalized, NameInput.Last_Name, dbo.JaroWinkler(NameInput.Name_Input, NameLookup.first_name) AS Jaro3,

    RANK() OVER (Partition BY NameLookup.name_group_id ORDER BY dbo.JaroWinkler(NameInput.Name_Input, NameLookup.first_name) Desc) Jar02

    FROM NameInput CROSS JOIN NameLookup Where dbo.JaroWinkler(NameInput.Name_Input, NameLookup.first_name) > .95order by NameLookup.name_group_id

    Would you mind provide me some guidance on this case?

    Thanks in advance

    Eric

  • Hey Eric, Check out the zip files at the bottom of the article, you should find it there.

    Adam Sottosanti

  • Hey Ira,

    Thank you so much for this. I'm implementing the Jaro-Winkler algorithm to do fuzzy matching in a data warehouse. I'll be able to integrate this TSQL script into my SSIS ETL Packages

  • The CLR implementation found in the assembly Microsoft.MasterDataServices.DataQuality, can be used in SQL and it is about 300 times faster..

  • ip7915 (3/4/2016)


    The CLR implementation found in the assembly Microsoft.MasterDataServices.DataQuality, can be used in SQL and it is about 300 times faster..

    Yep. And if you have DQS setup it's already available for you.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Deleted

  • Hi Ira,

    Does not seem to work for some cases... For example:

      select dbo.[fn_calculateJaro]('Busan-Gimhae LRT Warranty','Busan Gimhae')
    returns 0 because of a dash character '-'

    However...
    select DIFFERENCE('Busan-Gimhae LRT Warranty','Busan Gimhae') 
    returns 4, meaning a good match.

Viewing 13 posts - 46 through 57 (of 57 total)

You must be logged in to reply to this topic. Login to reply