Nasty Fast N-Grams (Part 1): Character-Level Unigrams

  • Ed Wagner

    SSC Guru

    Points: 286979

    Very nice work, Alan. I think you explained the concept really well and used some very good examples. I have to tell you that it reminds me of a roll-your-own index to support leading wildcards in a LIKE operator.

    I'm definitely looking forward to the rest of the series.

  • Alan Burstein

    SSC Guru

    Points: 61079

    Ed Wagner (7/8/2016)


    Very nice work, Alan. I think you explained the concept really well and used some very good examples. I have to tell you that it reminds me of a roll-your-own index to support leading wildcards in a LIKE operator.

    I'm definitely looking forward to the rest of the series.

    Thanks a lot Ed!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jason A. Long

    SSC-Insane

    Points: 23649

    Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir! 
    I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.

  • Alan Burstein

    SSC Guru

    Points: 61079

    Jason A. Long - Monday, November 6, 2017 11:46 AM

    Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir! 
    I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.

    Thanks a lot Jason! 

    If you read Part 1 you're all caught up. Being a new dad has dug into my writing time but I'm trying my best to get the next couple installments out by the end of the year.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jason A. Long

    SSC-Insane

    Points: 23649

    Alan.B - Tuesday, November 7, 2017 8:51 PM

    Jason A. Long - Monday, November 6, 2017 11:46 AM

    Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir! 
    I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.

    Thanks a lot Jason! 

    If you read Part 1 you're all caught up. Being a new dad has dug into my writing time but I'm trying my best to get the next couple installments out by the end of the year.

    I had no idea... Congratulations on the new member of the family!
    The kudos are well deserved my friend. Of course you also gave me a kick in the backside to go back and re-investigate a previously abandoned idea for the table-less working days function.
    I figured out how to compress a full date (at least the ones in my working range) all the way down to BINARY(2) and bring it back w/o loss of information... 
    Then doing something similar to the NGRAMs by using a cte_tally to slide along the concatenated binary, and find the min & max row_number() values between the begin and end dates.
    Of course, the tally has me back to square one with the Cartesian product with the outer table but I'm still beating it down, one step at a time...
    I need to reread this again to see what, if anything you're doing to deal with the tally/outer table thing...
    In any case, great work!

  • lauri.pietarinen

    SSC Veteran

    Points: 236

    Hi,

    there has , in fact been some research on fast qgrams with SQL, see for instance

    http://www.cs.columbia.edu/~gravano/Papers/2001/deb01.pdf

    Lauri

  • lauri.pietarinen

    SSC Veteran

    Points: 236

  • Alan Burstein

    SSC Guru

    Points: 61079

    Thanks for posting this Lauri. I just left you a longer reply and my browser crashed. I just read both of these, very interesting stuff!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • lauri.pietarinen

    SSC Veteran

    Points: 236

    Hi Alan,

    thanks for reading them.  It would have been nice to see your longer answer.

    Microsoft has(had?) a reseearch group on data cleaning https://www.microsoft.com/en-us/research/project/data-cleaning/
    which resulted in the fuzzy-matching function in SSIS, and it is also available as an ad on for Excel.  If you look at the research papers they reference the papers I mentioned in my post.

    The SSIS function works actually quite well, I used it in a project to match customers with names and addresses.  Used iteratively and interactively it works quite well.

    Lauri

Viewing 9 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply