RE: Best way to filter columns

SSC Guru

Points: 1003865

June 27, 2016 at 8:25 am

frontrunner148 (6/27/2016)
Jeff Moden (6/25/2016)
My first inclination would be to use the UNIQUEIDENTIFIER datatype for the GUID instead of VARCHAR but that may be contrary to formatting of the data received. If it's not a correctly formatted GUID, it will cause rejection of the row during import.
Shifting gears to the subject at hand and with performance in mind, I'd be tempted to do a word level split of the title and description into a table that would take the GUID and the split-out word for each word, possibly not including such noise words as "and", "or", "a", "the", etc. Then it would be a simple matter to join that table to a "black-ball" list of words in a high performance fashion to return a unique list of GUIDs that contained black-balled words. Think of it as an easy-to-maintain, poor-mans version of "FULL TEXT" lookups that you can have some pretty exquisite control over.
You could, of course, do a LIKE '%____%" for every black-balled word but that will become really slow as the number of black-balled words and number of RSS feeds continues to grow.
If you want to try a really fast, T-SQL only splitter, look for the "DelimitedSplitN4K" splitter in the "Resources" section of the following article. The only thing that would be faster is a well-written, properly designed/operating SQLCLR to do the splits with. Sadly, many such SQLCLRs leave out many nuances as to how to provide correct returns for such splitters so be really careful in your testing if you decide to go that route. Even the SQLCLR splitter in the article I reference has such problems.
Here's the link to the article...
http://www.sqlservercentral.com/articles/Tally+Table/72993/
Thanks jeff...
It's like indexing each row in the two columns.
Great idea... and fast...

Thanks for the feedback. Actually, though, it's more like indexing each word in the RSS feed titles and description. And, as you imply, it's nasty fast.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)