SQLServerCentral

Fuzzy-String Search: Find misspelled information with T-SQL


https://www.sqlservercentral.com/Forums/Topic1360540.aspx

By Thomas Keller - Monday, September 17, 2012 2:46 PM

Comments posted to this topic are about the item Fuzzy-String Search: Find misspelled information with T-SQL
By dwain.c - Monday, September 17, 2012 3:24 PM

This is really cool stuff!

Only scanned the article so far but have bookmarked it for a return read this weekend.

I had often wondered if there's an existing algorigthm out there to do this sort of thing.

Thanks Thomas!
By ZA_Crafty - Monday, September 17, 2012 5:59 PM

Awesome!!!!! Thanks for sharing this knowledge!!!
I've been pondering this for a while now. Especially with my interest in Neural Networks as well.

Perhaps the performance can be improved by implementing an external assembly, and CLR function?
By Thomas Keller - Monday, September 17, 2012 6:14 PM

@ZA_Crafty: Perhaps CLR could be faster, but a Transact-SQL-only solution was important to me and to my clients.
By Davin21 - Monday, September 17, 2012 6:18 PM

I've tended to find using a CLR implementation of the Jaro-Winkler algorithm has suited my needs a bit more, but really cool to see a T-SQL only alternative!
By Thomas Keller - Monday, September 17, 2012 6:26 PM

@Davin21 I have some non-T-SQL implementations (including a Jaro-Winkler) listed on my Reference page - please let me know of others I should consider linking to.
By Eric Hobbs - Monday, September 17, 2012 6:53 PM

:-D
Fantastic article. Thanks for sharing some of this knowledge. I'm sure that I am not the only one who will be making use of this technique.
By Davin21 - Monday, September 17, 2012 7:08 PM

@ThomasKeller - I found the information from SimMetrics (linky) to be quite useful - what i think would be really interesting to see would be a weighted version of this to give precedence to the first n characters in the string (as I've found that weighting the first four characters particularly for first name / surname validation) gives much more reliable results.
By Jinxseee - Monday, September 17, 2012 8:15 PM

Thanks for the info.
I find the CLR/external assembly implementation runs a lot faster for the algorithms I have implemented (Jaro, Levenstein etc), but having a T-SQL version is certainly handy.
By Martin Bastable - Monday, September 17, 2012 8:16 PM

Great article! Smile Brilliantly written too!

Out of interest, what other fuzzy search methods do people use?

One I came up with for a crm system a fair few years ago works as follows...

(please note examples given below are all bad examples but quick to type Smile )

Requirements:
Fuzzy matching of data for automated de-duplication, automated `existing record` matching so duplicates wouldn't get created in the first place, and fuzzy searches by users on varying fields of data

Solution:
Firstly, identified main fields our system was interested in handling the above. e.g. Firstname, Lastname, email address, 1st line of their address, phone numbers

Secondly, came up with some algorithms for `preparing data`- i.e. cleaning up variants of strings and common errors, and getting rid of useless data, while retaining the ability to not touch the original data.
e.g. Phone numbers: internationalization of numbers, stripping of excess characters etc. so (+44) 123 45678 90 -> 441234567890, 01234567890 -> 441234567890, (0712)34567890 ex 123 -> 071234567890.
e.g. Names: Get rid of characters that don't matter much, punctuation, extra words, etc.. Making up some examples (not the actual algorithm we use): Fred-> frd, Mc Donald -> MCDONALD, Bob Smithly-Smithington -> BOB, Frank J. -> FRANK
e.g. Addresses: 123, Little House -> 123LITTLEHOUSE, Flat 4b, 57, Some Street -> FLAT4B57SOMEST

You get the idea. Character removal can help with misspellings, variants, etc. - and there are plenty of fancy algorithms, encoding methods, etc. around to tweak this. Can vary per field/data type. We did some analysis on our data to find what was effective - which may vary by data type - names, addresses, emails, etc.

Anyway

Thirdly,

Create a second duplicate set of fields to hold the above encoded data. I.e. EncodedFirstname, EncodedPhoneNumber, etc.

So we end up with a set of `encoded fields` - e.g.
Firstname: Mc Donald EncodedFirstname: MCDONALD
Address1: , 12 3 I can't type road, EncodedAddress1: 123ICANTTYPERD
etc.

Note that users can still see their method of entered data e.g. (+44) 567890 as a phone number and not a purified 44567890 that we search against.

Results:

Multi-part searches can be completed very easily based on the encoded fields.
E.g. search for a phone number 0123 456 789, depending on your level of accuracy require you can exactly match on 44123456789, part match on 123456789, or return matches of only a certain number of matching numbers e.g. 1234567 - but you always search against a consistent phone number
E.g. A name search might return exact matches only on the encoded string, or include exact matches + e.g. matches where the first X characters or last X characters match.

As we have multiple separate versions of EncodedFields with different encoding/cleaning methods for the same source field.

Depending on the fuzziness of results required by the end user, we just vary the scope of the match - the number of characters matched on, etc.

You can use multiple fields combined with each other to generate unique keys for automated duplicate matching, with varying levels of confidence in a match.

e.g. ENCODEDFIRSTNAME+ENCODEDSURNAME+DOB
e.g. ENCODEDFIRSTNAME+LEFT(ENCODEDADDRESSLINE1, 5)
e.g. LEFT(ENCODEDFIRSTNAME, 3) + LEFT(ENCODEDSURNAME, 3) + LEFT(ADDRESSLINE1, 3)

The common combinations of the encoded fields we stick into separate extra fields KEYFIRSTNAMESURNAMEDOB etc. for speed of later searches.
(Incidentally, we don't use firstname + surname + date of birth as an automated duplicate match as there were too many people in our datasets with the same name and potential date of birth Smile )

Very fuzzy matches against multiple fields can return small sets of records (e.g. first 3 letters or last 3 letters of encodedfirstname = "abc" + first 3 letters or last 3 letters of surname = "zyx") - 2 keyfields greatly reduces the number of matches, even with very fuzzy input data.

User interface - often overlooked - when we present the results of a search to the users, rather than just a massive list, we give a massive list with visual clues as to the more likely matches. So e.g. a firstname/surname search, exactly matching first/surname entries are highlighted, exactly matching first OR surname gets a different highlight, fuzzy matches appear normally. The users have e.g.`Fuzzy, Partial and Exact`buttons they can click to quickly try more restrictive variants of the search if too many results are returned. (Can also auto-research on a more restrictive result set if too many records are returned). Can also stick the more likely matches at the top of the search results.

We share the `encoding` functions that generate our encoded strings, numbers etc. across parts our applications. To update or upgrade what we encode, how we encode it, our algorithms or some of our searches, we merely update our code in one place (.net libraries, which we cross use with SQL server - good old CLR Smile ) and then apply against our encoded fields/table to update the encoded data. Any part of our system (which has desktop front end, public web side, back end applications and automated processes) all point at the same data. Of course, all search code passes the search criteria through the encoding routines first before doing the search, to make sure our search data is in the correct format.

Creation of the encoded fields is spread across time (i.e. when records are added or updated). Records are always up to date. Possible to update all encoded fields in bulk at the same time too (e.g. when updating algorithms). Doesn't touch the original data.

Searches are very very fast - all searches are performed against what are simply extra fields in the database, with relevant indexes.

Automated De-duplication routines can match on various combinations of relevant (note I say relevant, we only de-duplicate on certain levels of certainty of match) matching fields and feed back to us on statistics of the most use matches, which help us identify areas of improvement.

We have had this running for years and years, catching, preventing and hunting down huge numbers of potential duplicates. Wipes the floor with the `bought in` CRM systems I have tried.

Of course we also use front end validation etc. in our interfaces to help, but hey - you know what data can be like Smile

Unfortunately not been allowed the time for a few years to add more fancy techniques in to the mix - ahh the potential!

There are overheads - overhead of the extra encoded/key fields etc. - but this isn't a problem for us.

We run the above against millions of records.

Hoping this might be of interest to someone!

M.
By Thomas Keller - Monday, September 17, 2012 8:31 PM

@Davin21: Indeed, I had originally linked the very informative SimMetrics Web Site, which was linked from your sourceforge link, but it seems to be permanently down nowSad
By Thomas Keller - Monday, September 17, 2012 8:42 PM

@Martin Bastable: Thanks for your kind words, and thanks to those who proofread various drafts of my article, including Kellie Sanchez, Dan Hess, Eric Freeman, Ernest Cook, and Chris Anderson! (Sorry I didn't thank y'all in the article itself.)

I have also implemented systems such as you describe, that pre-process the data and combine it in various ways, to make "fuzzy" searching quicker and more flexible. But for this article, I wanted to present a method that can be dropped into any database with a minimum of work, and still have acceptable performance when "fuzzy" searching on one field.
By Martin Bastable - Tuesday, September 18, 2012 12:26 AM

I call it a win win! You could use it as a stand alone search in a field, or combine it with other things for a super search Smile

M.
By dmusick - Tuesday, September 18, 2012 1:26 AM

Several years ago, I had the recurring task of locating and merging records for duplicate customer names in a large database. Because the names, DOB and SSN are not generally identical, I had to develop an algorithm that could compare any two strings and give a "percent similar" number, so I could filter out only the most likely duplicates and go through those manually.

I put a space at the beginning and end of each string and break each string up into 2 character sequences of strings. Then, I count how many of these sequences in the shorter string are also in the longer string, pairing them up, so I don't double-count. The number of common 2 character strings divided by the total in the shorter string is the percent similar.

For example, the word 'farm' and 'charm' can be broken into (using an underscore to represent the space):

_f fa ar rm m_

and

_c ch ha ar rm m_

Comparing the two, we can pair up three of these 2 character strings (ar rm m_). Since the shorter string has 5, this is 3/5, or 60% similarity.

If the second string was 'family charm', it would be 100% similar, because it would contain all the 2 character sequences. This is one weakness of this algorithm, although it is easy to compensate for by comparing string length also. This problem mostly comes up when one string is very short.

Overall, I have used this algorithm quite a lot, and it seems to work very well, finding duplicates and similarities that would be extremely difficult otherwise.

Here is the code I wrote to implement this algorithm:

CREATE FUNCTION [dbo].[percentSimilar](@string1 as varchar(1000), @string2 as varchar(1000) )

RETURNS decimal(10,9)
AS
BEGIN
Declare @percentSimilar decimal(10,9);
Declare @length1 int;
Declare @length2 int;
Declare @inCommon int;
Declare @minSequences int;
Declare @maxLength int;
Declare @position int;
Declare @tally table ( sequence char(2),
times1 int,
times2 int);
Declare @intersect table ( sequence char(2),
overlap int);

If @string1 <> '' and @string2 <> '' --if either string is an empty string, return 0
Begin
Set @string1 = ' ' + @string1 + ' ';
Set @string2 = ' ' + @string2 + ' '; --put a space at the beginning and end of each string
Set @length1 = Len(@string1);
Set @length2 = Len(@string2);

If @length1 < @length2 --find the minimum 2-character sequences, to divide the total overlap
Set @minSequences = @length1 - 1; --by and get the percentage
Else
Set @minSequences = @length2 - 1;

Set @position = 1; --add the two character strings from string 1 to the tally table
While @position < @length1
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string1, @position, 2),1,0);
Set @position = @position + 1
End

Set @position = 1; --add the two character strings from string 2 to the tally table
While @position < @length2
Begin
Insert Into @tally (sequence, times1, times2) Values(Substring(@string2, @position, 2),0,1);
Set @position = @position + 1
End

Insert Into @intersect --count how many times each sequence overlaps the other
Select sequence, case when sum(times1) < sum(times2) then sum(times1) else sum(times2) end as overlap
From @tally
Group By sequence

Select @inCommon = Sum(overlap) From @intersect; --the total overlap of two character sequences

Set @percentSimilar = 1.000000000 * @inCommon / @minSequences;

End
Else
Set @percentSimilar = 0;

RETURN @percentSimilar;

END
By dmusick - Tuesday, September 18, 2012 1:34 AM

I wanted to add that one of the big strengths of this algorithm is that it will detect similarities even if the words are scrambled in different orders or if one string has an additional word in the beginning. Because it is only looking at 2 character sequences, it doesn't care about positioning within the string. This is very helpful when dealing with names, which can often be out of order, such as when the data entry person switches the first and last name, or they put a double last name (common among Hispanics) in the middle name, then the last name, and then the next person does it the opposite way.

I have seen so many bizarre data entry issues, where the same person was entered into the database quite differently each time, but my algorithm was able to find them.
By Thomas Keller - Tuesday, September 18, 2012 4:12 AM

@dmusick: Interesting approach, and certainly useful. Note that DLD shows the same similarity for 'farm' and 'charm' (40% diff = 60% sim). For performance, I would try to rework your UDF to avoid a local table variable within the UDF. When benchmarking my OpHTML UDF, used in the final SELECT, I noticed that SQL opens transactions for all table variable activity. My Lim UDF, used in the WHERE clause, has better performance because it has no local table variable.
By ChrisM@Work - Tuesday, September 18, 2012 9:39 PM

Interesting article and well worth a second read, and a play with the code.

I've used two TSQL fuzzy-matching algorithms in the past, both work "well enough" - subjective and also data-dependant.
The first uses token-matching, like dmusick's algorithm in a previous post. Grab the first three letters from the reference word, see if it matches in the target. Then do the same with letters 2,3,4, then 3,4,5 and so on. It's not overly complicated and can be squeezed into an iTVF. The downside is that matching on words of twice the token length or less isn't reliable, so matching algorithms will depend upon the word length. Here's a typical token-matching iTVF employing 3-character tokens for reference words of 7 characters or more, and two-character tokens for reference words between 3 and 6 characters long;

CREATE FUNCTION [dbo].[FuzzyMatch_iTVF2k5] 
(@Reference VARCHAR(100) = NULL,
@Target VARCHAR(100) = NULL)
RETURNS table WITH SCHEMABINDING
AS
-- Chris Morris 2012
-- Fuzzy-matching using tokens
-- See also http://research.microsoft.com/pubs/75996/bm_sigmod03.pdf

RETURN
SELECT
d.Method,
MatchRatio = CAST(CASE
WHEN d.Method = 1 THEN 100
WHEN d.Method = 3 THEN LenTarget*100.00/LenReference
WHEN d.Method = 4 THEN LenReference*100.00/LenTarget
WHEN d.Method = 5 THEN
(
SELECT
MatchPC = (100.00 * ISNULL(NULLIF(SUM(
CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
),0)+2.00,0) / LenReference)
* CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END
FROM ( -- Tally
SELECT TOP (CAST(LenReference AS INT)-2) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,
(SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b
) Tally
CROSS APPLY (SELECT PosInTarget = CHARINDEX(SUBSTRING(@Reference, Tally.n, 3), @Target)) x
)
WHEN d.Method = 6 THEN
(
SELECT
MatchPC = (100.00 * ISNULL(NULLIF(SUM(
CASE WHEN Tally.n < PosInTarget THEN Tally.n/PosInTarget ELSE PosInTarget/Tally.n END
),0)+1.00,0) / LenReference)
* CASE WHEN LenTarget > LenReference THEN LenReference/LenTarget ELSE 1.00 END
FROM ( -- Tally
SELECT TOP (CAST(LenReference AS INT)-1) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)
) Tally
CROSS APPLY (SELECT PosInTarget = CAST(CHARINDEX(SUBSTRING(@Reference, Tally.n, 2), @Target) AS DECIMAL(5,2))) x
)
ELSE NULL
END AS DECIMAL(5,2))

FROM (
SELECT Method = CASE
WHEN @Reference = @Target THEN 1
WHEN @Reference IS NULL OR @Target IS NULL THEN 2
WHEN @Reference LIKE '%'+@Target+'%' THEN 3
WHEN @Target LIKE '%'+@Reference+'%' THEN 4
WHEN DATALENGTH(@Reference) >= 7 AND DATALENGTH(@Target) >= 7 THEN 5
WHEN DATALENGTH(@Reference) > 2 AND DATALENGTH(@Target) > 2 THEN 6
ELSE 7
END,
LenTarget = CAST(DATALENGTH(@Target) AS DECIMAL(5,2)),
LenReference = CAST(DATALENGTH(@Reference) AS DECIMAL(5,2))
) d



When comparing two columns in the same table, this function will take about 30s to process a million rows.


The second method is superficially similar to the DLD algorithm but simplified for performance; for n = 1 to length of the reference word, run through the letters of the reference word one at a time, and measure where they are found in the target word and the reference word starting at position n-1. A little simple arithmetic on the resulting numbers yields a match score. Here's the code:

DECLARE @Reference VARCHAR(25) = 'Joones' 
DECLARE @Target VARCHAR(25) = 'Joonse'

;WITH Tally AS (
SELECT TOP (DATALENGTH(@Reference)) n = ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM (SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) a,
(SELECT n FROM (SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1) d (n)) b
)
SELECT
-- * -- switch the comments around to see how it works
MatchRatio = SUM(Matches)/DATALENGTH(@Reference)
FROM Tally
CROSS APPLY (SELECT TestLetter = SUBSTRING(@Reference, Tally.n, 1)) x
CROSS APPLY (
SELECT -- start search one character position *before* n to catch switched letters
PosInReference = CAST(CHARINDEX(TestLetter, @Reference, n-1) AS DECIMAL(5,2)),
PosInTarget = CAST(CHARINDEX(TestLetter, @Target, n-1) AS DECIMAL(5,2))
) z
CROSS APPLY (
SELECT Matches = CASE WHEN PosInReference > PosInTarget
THEN PosInTarget/PosInReference ELSE PosInReference/PosInTarget END
) m



It's substantially faster than the token-matching function, processing a million rows in about 4 seconds.
By Thomas Keller - Wednesday, September 19, 2012 2:32 AM

@ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)
By dmusick - Wednesday, September 19, 2012 2:49 AM

I have been trying to think of good ways to get rid of the table variable. I was actually quite reluctant to use it at all, but I couldn't think of a better way to do it. I would have preferred to use arrays, but SQL doesn't have them. I may have to come up with a system for using string arrays to achieve what I need (like the good old days of programming!)
By ChrisM@Work - Wednesday, September 19, 2012 3:00 AM

Thomas Keller (9/19/2012)
@ChrisM@Work: Interesting approaches, but I'm wondering why you use DATALENGTH instead of LEN, are you trying to include trailing spaces? (In order to work with NVARCHAR instead of VARCHAR, you would need to divide DATALENGTH by two.)


Habit, Tom, nothing more.
By Thomas Keller - Wednesday, September 19, 2012 11:11 AM

@dmusick: That (replacing a table variable with string variables) is essentially what I've done in my Lim UDF. Since I only need to keep three rows, I only need three strings.

In my OpHTML UDF, I need to keep all the rows, so it still uses a table variable. In Confio Ignite, I can see the direct relationship between transactions/sec and executions of OpHTML (once per row in search results, and in history of search results). This is true even WITH SCHEMABINDING.

I have thought about replacing the table variable with one VARCHAR(MAX), or TEXT in SQL2000, but for most use cases, returning just a few rows from a search, it wouldn't make that much difference to tweak the UDF that's only used on the matched rows. (In my site, the history of search results was using too many resources, so I just limited it to 250 most recent.)
By Thomas Keller - Friday, September 21, 2012 10:36 AM

@all: Scripts updated 2012-09-21 from v1.0.0 to v1.1.0
Performance of [udfDamerauLevenshteinOpHTML] has been improved, by doing one less INSERT and one less SELECT on the table variable. (The values of @Pos1 and @theOps at the end of the first loop are proper to start the second loop, so they do not need to be saved and reloaded.) All other changes are only in comments.
By fmendes - Saturday, September 22, 2012 3:46 PM

Out of interest, what other fuzzy search methods do people use?


Metaphone comes to mind.
By Thomas Keller - Sunday, September 23, 2012 9:59 AM

@Martin Bastable @fmendes Yes, the algorithms I've heard of most in a SQL context (using extended stored procedures) include Jaro-Winkler and Metaphone so I've linked those among others on my Reference page.

Note also the SimMetrics link to Java implementations of various algorithms provided by @Davin21.

But as far as I know, mine is the first published Transact-SQL implementation of a generic string-distance algorithm.
By rajesh R-424835 - Monday, September 24, 2012 7:38 AM

This is awesome. You really hit it!

I took, modified and ran against my database of person's name. If I search for "Snith", it is not suggesting "Smith"!
it seems to have some limitation on the size of columns that is being searched.
By Amy.G - Monday, September 24, 2012 7:44 AM

Is there an algorithm out there similar to this that will dicern from people's first name what gender they are? This may be asking a bit much, but I thought it wouldn't hurt to ask.
By Thomas Keller - Tuesday, September 25, 2012 9:14 AM

@Amy.G I believe you need a database of first names; because the most an algorithm could do for you is to normalize the names. I googled "infer gender from first name" and found several useful pages including https://sites.google.com/site/facebooknamelist/namelist
By Thomas Keller - Tuesday, September 25, 2012 9:28 AM

@rajesh R-424835: Thanks for saying it's awesome, although you had an issue! The only limitation on size of column being searched, is that only NVARCHAR(253) is passed into the UDF.

But the assumption is that you are searching a column that only stores one search term per row.

If you are trying to find misspelled terms inside a longer string, you need to extract the terms first, as described in Todd Fifield's article here.

Since you mention "database of person's name", do you have the first and last names separated? If you have a row with "Smith" in a column, and you use my code to search for "Snith" on that column, it should certainly find it.

If you'd like me to help further, please post more details of the SQL you are trying. If you want to do that privately, you can click the green "Make a Suggestion" tab on the right side of http://fuzzy-string.com/, then click Contact Support.
By Neha05 - Tuesday, December 25, 2012 1:38 AM

Good article.
By jobudzin - Friday, January 10, 2014 1:25 AM

What exactly do you mean in your article under the Confio Ignite graph picture where you state "Comments like "--Ignite=" are included in my code, to make it easy to name the individual SQL statements in the graph."? We use Ignite at our company and the only way I know how to name the specific SQL Hash that they provide is to insert a record into their table. Are you saying if the developer puts that in the comments of the stored procedure or view that it will be displayed that way? Just curious.
By Nadrek - Friday, January 10, 2014 2:41 AM

I would absolutely start with "cleaning" up data, followed by exact search on the cleaned version, followed by a double metaphone implementation, since that simply generates one or two possible values for any given input string - if you choose, you can save (and index) these - once generated, they're purely 1-3 equality lookups, which can be indexed.

Pure T-SQL Double Metaphone implementation:
http://www.sqlservercentral.com/scripts/Miscellaneous/30219/


Following double metaphone, if you must get into CROSS JOIN type compares of string A against string B, for names (and addresses) I prefer to go to Jaro-Winkler; being used by the U.S. Census is a significant reference.

Pure T-SQL Jaro-Winkler implementation:
http://www.sqlservercentral.com/articles/Fuzzy+Match/65702/
By Thomas Keller - Monday, January 13, 2014 7:38 AM

@jobudzin, I love Confio Ignite, but I quickly realized the pain of naming many SQL statements legibly... and then re-naming them, without duplication, when they were recompiled. A completely manual process in their standard product unfortunately.

So I started putting "--Ignite" followed by a short legible text near the beginning of each SQL statement of interest, and "--Ignite" followed by a version number near the beginning of each SQL procedure. This made the manual process more of a cut and paste.

Finally, with some help from Confio Support, I came up with a SQL script that I can run after each recompile, to automatically name all the hashes according to those comments.

If you are interested, I shall post it on SSC.

@Nadrek, excellent references when dealing with names. But I needed something more "exact" for VIN matching.
By swoozie - Thursday, January 23, 2014 6:19 AM

I really enjoyed the article but had some questions, you had mentioned the potential use of an Inline Table -valued UDF and a Recursive CTE, I was wondering if you had actually done anything with that?

Also, I am working on altering the information provided to handle a cross-referencing task that I have and before I reinvent the wheel, I though I might inquire if you or someone has used a method similar to this for cross-referencing data.
By jobudzin - Thursday, January 23, 2014 7:09 AM

I am very interested if you could post the script. I worked with Confio support and have a similiar script but nothing to take into account the comments of the actual stored procedure itself. Thanks!
By Thomas Keller - Friday, January 24, 2014 4:44 PM

@swoozie, I have not put further effort into "achieving the nirvana of an Inline Table-Valued UDF using a Recursive CTE (Common Table Expression) with zero logical reads" for DLD, and I haven't heard of anyone else doing it yet.

Regarding cross-referencing, I assume you mean finding the items in List A that are not found in List B with a simple inner join, but which are only a few typos off from items in List B. (If you are cross-referencing keywords in articles for relevance matching, then the lists could be the keywords in two articles, with appearance counts.)

Below is a working example, in which I find the words that have been searched for on fuzzy-string.com, which were not found in the dictionary, but are only 1 typo off from dictionary words. (Of course, without using the table in which I have already saved the results of the UDF, so you can see how to use the UDFSmile

Step 2 would be to repeat the process, allowing 2 typos, after saving the results from Step 1 and eliminating them from List A. (No need to find 2-typo matches for a word, if you already found 1-typo matches for it.) This can be done in a loop where @iThreshold goes from 1 to the maximum difference you want to allow.

Remember that DLD is measured in typos; if you need a percentage similarity, you may be better off using the Jaro-Winkler algorithm.

The "XRef" query is the important one that uses the UDF, but the "ListA" query is also important since it eliminates the exact matches, and caches the lengths. Note the --Ignite comments, which will help me when the long-running statements show up in Confio.

If your lists are long, this can take awhile. On fuzzy-string.com, searching one word against a hundred thousand can take a second or two, so cross-referencing a hundred thousand to a hundred thousand could take a day or two. This is why I included the TOP 1000, to limit the time taken for this example to a minute or two.


BEGIN -- Ignite=NonWdSrch v1.0.0
DECLARE @vcCosts VARCHAR(253) -- Ignite=NonWdSrch Declare
, @iThreshold INT
;
CREATE TABLE #ListA_NotInListB ( -- Ignite=NonWdSrch CreateA
[idWord] BIGINT NOT NULL PRIMARY KEY
, [vcWord] NVARCHAR(150) NOT NULL UNIQUE
, [biSearches] BIGINT NOT NULL
, [iLen] INT NOT NULL
)
;
CREATE TABLE #ListB ( -- Ignite=NonWdSrch CreateB
[idWord] BIGINT NOT NULL PRIMARY KEY
, [vcWord] NVARCHAR(150) NOT NULL UNIQUE
, [biSearches] BIGINT NOT NULL
)
;
SELECT @iThreshold = 1 -- Ignite=NonWdSrch Init
, @vcCosts = [usr].[udfDamerauLevenshteinRow0] (@iThreshold)
;
WITH ListA_NotInListB AS ( -- Ignite=NonWdSrch ListA
SELECT A.[idWord], COUNT_BIG(*) AS biSearches
FROM [usr].[tblSearch] A
LEFT JOIN [usr].[tblWordTWL] B
ON B.[idWord] = A.[idWord]
WHERE B.[idWord] IS NULL
GROUP BY A.[idWord]
) INSERT INTO #ListA_NotInListB (
[idWord], [vcWord], [biSearches], [iLen]
) SELECT TOP 1000 -- limit time taken for example
A.[idWord], LkpA.[vcLookupString], A.[biSearches]
, LEN( LkpA.[vcLookupString]) AS [iLen]
FROM ListA_NotInListB A
JOIN [usr].[tblLookupString] LkpA
ON LkpA.[idLookupString] = A.[idWord]
ORDER BY LkpA.[vcLookupString]
;
INSERT INTO #ListB ( -- Ignite=NonWdSrch ListB
[idWord], [vcWord], [biSearches]
) SELECT TOP 1000 -- limit time taken for example
B.[idWord], LkpB.[vcLookupString], InfoB.[biSearches]
FROM [usr].[tblWordTWL] B
JOIN [usr].[tblLookupString] LkpB
ON LkpB.[idLookupString] = B.[idWord]
JOIN [usr].[tblWordLenTWL] InfoB
ON InfoB.[idWord] = B.[idWord]
ORDER BY LkpB.[vcLookupString]
;
WITH Comp AS ( -- Ignite=NonWdSrch XRef
SELECT A.[idWord] AS [idWordA], B.[idWord] AS [idWordB]
, [usr].[udfDamerauLevenshteinLim] (B.[vcWord], A.[vcWord]
, @iThreshold, A.[iLen], @vcCosts) AS [iDLDistance]
FROM #ListA_NotInListB A CROSS JOIN #ListB B
) SELECT C.[idWordA], A.[vcWord] AS [vcWordA]
, A.[biSearches] AS [biSearchesA], C.[iDLDistance]
, C.[idWordB], B.[vcWord] AS [vcWordB]
, B.[biSearches] AS [biSearchesB]
FROM Comp C
JOIN #ListA_NotInListB A
ON A.[idWord] = C.[idWordA]
JOIN #ListB B
ON B.[idWord] = C.[idWordB]
WHERE C.[iDLDistance] = @iThreshold
ORDER BY A.[vcWord], B.[vcWord]
;
DROP TABLE #ListB; -- Ignite=NonWdSrch DropB
DROP TABLE #ListA_NotInListB; -- Ignite=NonWdSrch DropA
END

By Thomas Keller - Friday, January 24, 2014 5:13 PM

@jobudzin, I have submitted my "NameSQLinConfioIgnite" script to SSC, will advise when it is published. I just ran it to capture the usage of my example in the previous post, and it named these SQL statements that had taken enough milliseconds for Ignite to notice:

NonWdSrch CreateB v1.0.0
NonWdSrch ListA v1.0.0
NonWdSrch XRef v1.0.0
By Thomas Keller - Monday, January 27, 2014 5:02 AM

@jobudzin, my "NameSQLinConfioIgnite" script is scheduled to publish on SSC Feb 5-6. If you want it sooner, you can send me an email from my website http://www.tomkellerconsulting.com/ (linked from my profile here, or from the bottom of fuzzy-string.com).
By swoozie - Monday, January 27, 2014 5:05 AM

Thanks!
By jobudzin - Monday, January 27, 2014 7:34 AM

No rush. That works for me. Thanks! Appreciate the help.
By Thomas Keller - Wednesday, February 12, 2014 11:39 PM

@jobudzin, my NameSQLinConfioIgnite script did get published at http://www.sqlservercentral.com/scripts/Performance+Tuning/106514/. Enjoy, and please let me know if you have success or any issue with it.
By jobudzin - Thursday, February 13, 2014 3:46 AM

@Thomas, that is the exact same script word-for-word Confio support provided me. I was wondering if they were going to be different, but guess not. Yeah this works for the most part, but I am going to tweak it to remove the numbering as I don't think I need the names to be unique because they are referring to the same stored proc (if applicable) but just different parts. It has its advantages, but I think for now I just want to focus on a higher level of what stored procedures are having issues and then I can dig deeper if needed to see the particular parts of the stored procedure which are causing the most performance issues. Thanks again!