Nasty Fast N-Grams (Part 1): Character-Level Unigrams

  • Alan Burstein

    SSC Guru

    Points: 61067

    Comments posted to this topic are about the item Nasty Fast N-Grams (Part 1): Character-Level Unigrams

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • David.Poole

    SSC Guru

    Points: 75199

    Good read. I'm looking forward to the rest of the series.

    Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

    Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.

    In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

    I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

  • g.britton

    SSChampion

    Points: 13686

    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Gerald Britton, MCSE-DP, MVPToronto PASS Chapter[/url]

  • Eirikur Eiriksson

    SSC Guru

    Points: 182367

    Nice piece and very good work Alan!

    😎

  • Y.B.

    SSChampion

    Points: 11534

    Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.


    SELECT quote FROM brain WHERE original = 1
    0 rows returned

  • Alan Burstein

    SSC Guru

    Points: 61067

    David.Poole (6/23/2016)


    Good read. I'm looking forward to the rest of the series.

    Is the CTE tally table to encapsulate the tally table in a single function for when there isn't already one within the database?

    Thanks for the feedback David. There's a couple reasons for the CTE Tally Table. First, putting the logic into a CTE allows poeple to just copy/paste the code. The second is that the CTE Tally tends to be a little faster and certainly generates less reads. That said, a memory-optimized tally table seems to outperform the CTE Tally Table (something I did not mention in the article).

    Given that a tally table has so many uses and even with a million records they are tiny I tend to have a physical tally table in my databases.

    In fact they are so useful I put them in the model database so any new databases automatically gain a pre-populated tally table.

    Me too. Even though CTE Tally tables are faster and don't produce any reads, there are a some situations where a physical tally table is better. Part of the motivation of this article was also to promote Tally Tables.

    I've not used Azure but know there are limitations on what you can do with system tables but I do a cross-apply on sys.columns to generate the ROW_NUMBER() information to populate the tally table. Save's typing and copy/pasting all those NULLS.

    I use the sys.columns method often. The downside, however, is that it creates a busier execution plan and cant be used views and functions with schemabinding. A good numbers ITV [/url]is always nice to have around. πŸ˜‰

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan Burstein

    SSC Guru

    Points: 61067

    g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan Burstein

    SSC Guru

    Points: 61067

    Eirikur Eiriksson (6/23/2016)


    Nice piece and very good work Alan!

    😎

    Thanks you sir! 😎

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan Burstein

    SSC Guru

    Points: 61067

    Y.B. (6/23/2016)


    Excellent article! While I've had no need for this yet, I can easily see how this could come in handy one day. Added to my briefcase...looking forward to the rest of the series.

    Thanks a lot YB!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • g.britton

    SSChampion

    Points: 13686

    Alan.B (6/23/2016)


    g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    Really? I couldn't spot that reference (even after cleaning my glasses!) Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function. Then there's the Similarity function. even better I think.

    Gerald Britton, MCSE-DP, MVPToronto PASS Chapter[/url]

  • g.britton

    SSChampion

    Points: 13686

    A little off-topic, but Master Data Services comes with CLR-functions that do N-Grams and other nice things. Here's a reference on that:

    Regular Expressions, advanced string matching and new Split function SQL Server 2008 R2[/url]

    to whet your appetite, the functions include:

    mdq.NGrams

    mdq.RegexReplace

    mdq.RegexExtract

    mdq.RegexSplit

    mdq.RegexIsMatch

    mdq.Similarity

    mdq.RegexIsValid

    mdq.SimilarityDate

    mdq.RegexMask

    mdq.Split

    They pretty much do what their names imply

    Gerald Britton, MCSE-DP, MVPToronto PASS Chapter[/url]

  • Alan Burstein

    SSC Guru

    Points: 61067

    g.britton (6/23/2016)


    Alan.B (6/23/2016)


    g.britton (6/23/2016)


    virtually nothing about the subject in SQL Server

    Except for Integration Services that is. Fuzzy Lookup and Fuzzy Grouping use q-grams: n-grams with a distance component.

    see Fuzzy Lookup and Fuzzy Grouping in SQL Server Integration Services 2005

    It's been around for a while. These components also use Jaccard index and Levenshtein distance.

    MDS and DQS also leverage these algorithms.

    see mdq.Similarity (Transact-SQL)

    Then there's the SSIS Data Profiiing task which has been using these techniques for a while

    Thanks for the comment.

    Yep - I got the idea to write a purely set-based T-SQL N-Grams function while playing around with mdq.ngrams (which I referenced in the article). Not everyone uses or can use SSIS/DQS/MDS and the only N-Grams iTVF returned by a google search for "t-sql n-grams" or "ngrams in sql" is an older version of the function I write about in this article. This was part of the motivation for writing this article: to expose the concept of N-Grams to a wider audience.

    Really? I couldn't spot that reference (even after cleaning my glasses!)

    From the article:

    Even Microsoft Data Quality Services includes an N-Grams CLR function.

    Anyhow I've gotta believe that the mdq version (a CLR is it not?) is gonna beat the pants off a T_SQL function.

    Nope. Part of the problem (and I'll get into this more in Part 3) with mdq.ngrams is the added cost of the padding functionality. The formula to calculate the number of tokens returned by mdq.ngrams is N+(N-1) vs N-(N-1) for NGrams8K (or the variations of it). I originally included test times with parallel execution plans (using make_parallel) and also included mdq.NGrams when I first submitted the article but it was to long so I edited those results out. Here's the original test results (the SQLCLR version is mdq.ngrams):

    I also excluded a second test where I compared mdq.NGrams8k to the loop version, rCTE version and mdq.NGrams for a specialized "Remove Duplicate Characters" function:

    CREATE FUNCTION dbo.RemoveDupes8K(@string varchar(8000), @preserved varchar(100))

    RETURNS TABLE WITH SCHEMABINDING AS RETURN

    SELECT CleanedString =

    ( SELECT token+''

    FROM dbo.NGrams8K(@string,1)

    WHERE token <> SUBSTRING(@string,position+1,1) -- exclude chars equal to the next char

    OR token LIKE @preserved -- preserve characters that match the @preserved pattern

    FOR XML PATH(''),TYPE

    ).value('.','varchar(8000)'); -- using Wayne Sheffield’s concatenation logic

    Here's the results of that test:

    I suspect that this could be done slightly faster with a better CLR than mdq.ngrams which was the motivation behind this post. The Linq version Con Alexis posted, however, was notably slower.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • g.britton

    SSChampion

    Points: 13686

    Alan.B (6/23/2016)[snip]

    Thanks for all the detail, Alan. It really helps. I suspect the mdq.ngrams function could easily be improved (could be multi-threaded, e.g.) though it's probably not a high-priority for the team. I'm also thinking towards the goal of matching and lookup using ngrams (and other, better algorithms). For that goal, I think I'd recommend the mdq.Similarity function over anything in pure T-SQL since it has more algorithm choices and gets to that goal.

    I'll probably stick with doing most of the work in SSIS with the fuzzy matching transformations or in MDS/DQS for other stuff.

    YMMV!

    Gerald Britton, MCSE-DP, MVPToronto PASS Chapter[/url]

  • johnmackintosh

    SSC Rookie

    Points: 46

    Thanks for this, highly informative post.

    I had no idea this was possible in SQL Server.

    I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.

    Very excited to see the next posts in this series

  • Alan Burstein

    SSC Guru

    Points: 61067

    johnmackintosh (6/23/2016)


    Thanks for this, highly informative post.

    I had no idea this was possible in SQL Server.

    I've done some text mining work using n-grams using R, but I imagine it will be much more scalable if SQL Server is doing the heavy lifting.

    Very excited to see the next posts in this series

    Thanks John! (I'm sorry I missed this post).

    I had no idea this was possible in SQL Server.

    That was the point of the article. πŸ˜›

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 15 (of 24 total)

You must be logged in to reply to this topic. Login to reply