Fuzzy-String Search: Find misspelled information with T-SQL

  • I would absolutely start with "cleaning" up data, followed by exact search on the cleaned version, followed by a double metaphone implementation, since that simply generates one or two possible values for any given input string - if you choose, you can save (and index) these - once generated, they're purely 1-3 equality lookups, which can be indexed.

    Pure T-SQL Double Metaphone implementation:

    http://www.sqlservercentral.com/scripts/Miscellaneous/30219/[/url]

    Following double metaphone, if you must get into CROSS JOIN type compares of string A against string B, for names (and addresses) I prefer to go to Jaro-Winkler; being used by the U.S. Census is a significant reference.

    Pure T-SQL Jaro-Winkler implementation:

    http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/[/url]

  • @jobudzin, I love Confio Ignite, but I quickly realized the pain of naming many SQL statements legibly... and then re-naming them, without duplication, when they were recompiled. A completely manual process in their standard product unfortunately.

    So I started putting "--Ignite" followed by a short legible text near the beginning of each SQL statement of interest, and "--Ignite" followed by a version number near the beginning of each SQL procedure. This made the manual process more of a cut and paste.

    Finally, with some help from Confio Support, I came up with a SQL script that I can run after each recompile, to automatically name all the hashes according to those comments.

    If you are interested, I shall post it on SSC.

    @Nadrek, excellent references when dealing with names. But I needed something more "exact" for VIN matching.

  • I really enjoyed the article but had some questions, you had mentioned the potential use of an Inline Table -valued UDF and a Recursive CTE, I was wondering if you had actually done anything with that?

    Also, I am working on altering the information provided to handle a cross-referencing task that I have and before I reinvent the wheel, I though I might inquire if you or someone has used a method similar to this for cross-referencing data.

  • I am very interested if you could post the script. I worked with Confio support and have a similiar script but nothing to take into account the comments of the actual stored procedure itself. Thanks!

  • @swoozie, I have not put further effort into "achieving the nirvana of an Inline Table-Valued UDF using a Recursive CTE (Common Table Expression) with zero logical reads" for DLD, and I haven't heard of anyone else doing it yet.

    Regarding cross-referencing, I assume you mean finding the items in List A that are not found in List B with a simple inner join, but which are only a few typos off from items in List B. (If you are cross-referencing keywords in articles for relevance matching, then the lists could be the keywords in two articles, with appearance counts.)

    Below is a working example, in which I find the words that have been searched for on fuzzy-string.com, which were not found in the dictionary, but are only 1 typo off from dictionary words. (Of course, without using the table in which I have already saved the results of the UDF, so you can see how to use the UDF:)

    Step 2 would be to repeat the process, allowing 2 typos, after saving the results from Step 1 and eliminating them from List A. (No need to find 2-typo matches for a word, if you already found 1-typo matches for it.) This can be done in a loop where @iThreshold goes from 1 to the maximum difference you want to allow.

    Remember that DLD is measured in typos; if you need a percentage similarity, you may be better off using the Jaro-Winkler algorithm.

    The "XRef" query is the important one that uses the UDF, but the "ListA" query is also important since it eliminates the exact matches, and caches the lengths. Note the --Ignite comments, which will help me when the long-running statements show up in Confio.

    If your lists are long, this can take awhile. On fuzzy-string.com, searching one word against a hundred thousand can take a second or two, so cross-referencing a hundred thousand to a hundred thousand could take a day or two. This is why I included the TOP 1000, to limit the time taken for this example to a minute or two.

    BEGIN -- Ignite=NonWdSrch v1.0.0

    DECLARE @vcCosts VARCHAR(253) -- Ignite=NonWdSrch Declare

    , @iThreshold INT

    ;

    CREATE TABLE #ListA_NotInListB ( -- Ignite=NonWdSrch CreateA

    [idWord] BIGINT NOT NULL PRIMARY KEY

    , [vcWord] NVARCHAR(150) NOT NULL UNIQUE

    , [biSearches] BIGINT NOT NULL

    , [iLen] INT NOT NULL

    )

    ;

    CREATE TABLE #ListB ( -- Ignite=NonWdSrch CreateB

    [idWord] BIGINT NOT NULL PRIMARY KEY

    , [vcWord] NVARCHAR(150) NOT NULL UNIQUE

    , [biSearches] BIGINT NOT NULL

    )

    ;

    SELECT @iThreshold = 1 -- Ignite=NonWdSrch Init

    , @vcCosts = [usr].[udfDamerauLevenshteinRow0] (@iThreshold)

    ;

    WITH ListA_NotInListB AS ( -- Ignite=NonWdSrch ListA

    SELECT A.[idWord], COUNT_BIG(*) AS biSearches

    FROM [usr].[tblSearch] A

    LEFT JOIN [usr].[tblWordTWL] B

    ON B.[idWord] = A.[idWord]

    WHERE B.[idWord] IS NULL

    GROUP BY A.[idWord]

    ) INSERT INTO #ListA_NotInListB (

    [idWord], [vcWord], [biSearches], [iLen]

    ) SELECT TOP 1000 -- limit time taken for example

    A.[idWord], LkpA.[vcLookupString], A.[biSearches]

    , LEN( LkpA.[vcLookupString]) AS [iLen]

    FROM ListA_NotInListB A

    JOIN [usr].[tblLookupString] LkpA

    ON LkpA.[idLookupString] = A.[idWord]

    ORDER BY LkpA.[vcLookupString]

    ;

    INSERT INTO #ListB ( -- Ignite=NonWdSrch ListB

    [idWord], [vcWord], [biSearches]

    ) SELECT TOP 1000 -- limit time taken for example

    B.[idWord], LkpB.[vcLookupString], InfoB.[biSearches]

    FROM [usr].[tblWordTWL] B

    JOIN [usr].[tblLookupString] LkpB

    ON LkpB.[idLookupString] = B.[idWord]

    JOIN [usr].[tblWordLenTWL] InfoB

    ON InfoB.[idWord] = B.[idWord]

    ORDER BY LkpB.[vcLookupString]

    ;

    WITH Comp AS ( -- Ignite=NonWdSrch XRef

    SELECT A.[idWord] AS [idWordA], B.[idWord] AS [idWordB]

    , [usr].[udfDamerauLevenshteinLim] (B.[vcWord], A.[vcWord]

    , @iThreshold, A.[iLen], @vcCosts) AS [iDLDistance]

    FROM #ListA_NotInListB A CROSS JOIN #ListB B

    ) SELECT C.[idWordA], A.[vcWord] AS [vcWordA]

    , A.[biSearches] AS [biSearchesA], C.[iDLDistance]

    , C.[idWordB], B.[vcWord] AS [vcWordB]

    , B.[biSearches] AS [biSearchesB]

    FROM Comp C

    JOIN #ListA_NotInListB A

    ON A.[idWord] = C.[idWordA]

    JOIN #ListB B

    ON B.[idWord] = C.[idWordB]

    WHERE C.[iDLDistance] = @iThreshold

    ORDER BY A.[vcWord], B.[vcWord]

    ;

    DROP TABLE #ListB; -- Ignite=NonWdSrch DropB

    DROP TABLE #ListA_NotInListB; -- Ignite=NonWdSrch DropA

    END

  • @jobudzin, I have submitted my "NameSQLinConfioIgnite" script to SSC, will advise when it is published. I just ran it to capture the usage of my example in the previous post, and it named these SQL statements that had taken enough milliseconds for Ignite to notice:

    NonWdSrch CreateB v1.0.0

    NonWdSrch ListA v1.0.0

    NonWdSrch XRef v1.0.0

  • @jobudzin, my "NameSQLinConfioIgnite" script is scheduled to publish on SSC Feb 5-6. If you want it sooner, you can send me an email from my website http://www.tomkellerconsulting.com/[/url] (linked from my profile here, or from the bottom of fuzzy-string.com).

  • Thanks!

  • No rush. That works for me. Thanks! Appreciate the help.

  • @jobudzin, my NameSQLinConfioIgnite script did get published at http://www.sqlservercentral.com/scripts/Performance+Tuning/106514/. Enjoy, and please let me know if you have success or any issue with it.

  • @thomas, that is the exact same script word-for-word Confio support provided me. I was wondering if they were going to be different, but guess not. Yeah this works for the most part, but I am going to tweak it to remove the numbering as I don't think I need the names to be unique because they are referring to the same stored proc (if applicable) but just different parts. It has its advantages, but I think for now I just want to focus on a higher level of what stored procedures are having issues and then I can dig deeper if needed to see the particular parts of the stored procedure which are causing the most performance issues. Thanks again!

Viewing 11 posts - 31 through 40 (of 40 total)

You must be logged in to reply to this topic. Login to reply