Nasty Fast N-Grams (Part 1): Character-Level Unigrams

Question

Nasty Fast N-Grams (Part 1): Character-Level Unigrams

Viewing 9 posts - 16 through 23 (of 23 total)

You must be logged in to reply to this topic. Login to reply

Ed Wagner SSC Guru Points: 287026 More actions · Answer 1

Very nice work, Alan. I think you explained the concept really well and used some very good examples. I have to tell you that it reminds me of a roll-your-own index to support leading wildcards in a LIKE operator.

I'm definitely looking forward to the rest of the series.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Alan Burstein SSC Guru Points: 61136 More actions · Answer 2

Ed Wagner (7/8/2016)
Very nice work, Alan. I think you explained the concept really well and used some very good examples. I have to tell you that it reminds me of a roll-your-own index to support leading wildcards in a LIKE operator.
I'm definitely looking forward to the rest of the series.

Thanks a lot Ed!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jason A. Long SSC-Insane Points: 23784 More actions · Answer 3

Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir!
I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.

Alan Burstein SSC Guru Points: 61136 More actions · Answer 4

Jason A. Long - Monday, November 6, 2017 11:46 AM
Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir!
I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.

Thanks a lot Jason!

If you read Part 1 you're all caught up. Being a new dad has dug into my writing time but I'm trying my best to get the next couple installments out by the end of the year.

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

Jason A. Long SSC-Insane Points: 23784 More actions · Answer 5

Alan.B - Tuesday, November 7, 2017 8:51 PM
Jason A. Long - Monday, November 6, 2017 11:46 AM
Alan - I know I'm a little late to the party (can't believe I'm finding this just now), excellent work sir!
I haven't had the opportunity to give it the attention it deserves... I plan on doing that (and reading the full series) this evening.
Thanks a lot Jason!
If you read Part 1 you're all caught up. Being a new dad has dug into my writing time but I'm trying my best to get the next couple installments out by the end of the year.

I had no idea... Congratulations on the new member of the family!
The kudos are well deserved my friend. Of course you also gave me a kick in the backside to go back and re-investigate a previously abandoned idea for the table-less working days function.
I figured out how to compress a full date (at least the ones in my working range) all the way down to BINARY(2) and bring it back w/o loss of information...
Then doing something similar to the NGRAMs by using a cte_tally to slide along the concatenated binary, and find the min & max row_number() values between the begin and end dates.
Of course, the tally has me back to square one with the Cartesian product with the outer table but I'm still beating it down, one step at a time...
I need to reread this again to see what, if anything you're doing to deal with the tally/outer table thing...
In any case, great work!

lauri.pietarinen SSC Veteran Points: 236 More actions · Answer 6

Hi,

there has , in fact been some research on fast qgrams with SQL, see for instance

http://www.cs.columbia.edu/~gravano/Papers/2001/deb01.pdf

Lauri

lauri.pietarinen SSC Veteran Points: 236 More actions · Answer 7

Here is another one:

https://www.researchgate.net/profile/Panos_Ipeirotis/publication/2922719_Approximate_String_Joins_in_a_Database_Almost_for_Free/links/0deec530b377b82785000000/Approximate-String-Joins-in-a-Database-Almost-for-Free.pdf

Lauri

Alan Burstein SSC Guru Points: 61136 More actions · Answer 8

lauri.pietarinen - Thursday, November 22, 2018 1:47 AM
Here is another one:
https://www.researchgate.net/profile/Panos_Ipeirotis/publication/2922719_Approximate_String_Joins_in_a_Database_Almost_for_Free/links/0deec530b377b82785000000/Approximate-String-Joins-in-a-Database-Almost-for-Free.pdf
Lauri

Thanks for posting this Lauri. I just left you a longer reply and my browser crashed. I just read both of these, very interesting stuff!

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001

lauri.pietarinen SSC Veteran Points: 236 More actions · Answer 9

Hi Alan,

thanks for reading them. It would have been nice to see your longer answer.

Microsoft has(had?) a reseearch group on data cleaning https://www.microsoft.com/en-us/research/project/data-cleaning/
which resulted in the fuzzy-matching function in SSIS, and it is also available as an ad on for Excel. If you look at the research papers they reference the papers I mentioned in my post.

The SSIS function works actually quite well, I used it in a project to match customers with names and addresses. Used iteratively and interactively it works quite well.

Lauri