virtually nothing about the subject in SQL Server
Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.
see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005
It's been around for a while. These components also use Jaccard index and Levenshtein distance.
MDS and DQS also leverage these algorithms.
see mdq.Similarity (Transact-SQL)
Then there's the SSIS Data Profiiing task which has been using these techniques for a while
Thanks for the comment.
Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.
Really? I couldn't spot that reference (even after cleaning my glasses!)
From the article:
Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function.
Nope. Part of the problem (and I'll get into this more in Part 3) with mdq.ngrams is the added cost of the padding functionality. The formula to calculate the number of tokens returned by mdq.ngrams is N+(N-1) vs N-(N-1) for NGrams8K (or the variations of it). I originally included test times with parallel execution plans (using make_parallel) and also included mdq.NGrams when I first submitted the article but it was to long so I edited those results out. Here's the original test results (the SQLCLR version is mdq.ngrams):
I also excluded a second test where I compared mdq.NGrams8k to the loop version, rCTE version and mdq.NGrams for a specialized "Remove Duplicate Characters" function:
CREATE FUNCTION dbo.RemoveDupes8K(@string varchar(8000), @preserved varchar(100))
RETURNS TABLE WITH SCHEMABINDING AS RETURN
SELECT CleanedString =
( SELECT token+''
WHERE token <> SUBSTRING(@string,position+1,1) -- exclude chars equal to the next char
OR token LIKE @preserved -- preserve characters that match the @preserved pattern
FOR XML PATH(''),TYPE
).value('.','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic
Here's the results of that test:
I suspect that this could be done slightly faster with a better CLR than mdq.ngrams which was the motivation behind this post. The Linq version Con Alexis posted, however, was notably slower.
"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."
-- Itzik Ben-Gan 2001