Hints for searching text?

  • I've got a table with like 6-7 million rows. Sometimes I'd like to search for certain text in a few columns. I can't index those columns because of text-length so I've to figure something out (because right now I can go on a really long coffebreak while doing a search).

    Searching using EQUAL is okay:

    WHERE Message='......'

    I'm trying to figure out something to slightly speed things up without recoding everything.

    I've came up with the idea of including a computed-CHECKSUM column (which I *can* index) on the textfields which seems to work better. However I'm curious if anyone got any clever ideas I should look into instead?

    • This topic was modified 1 year, 2 months ago by  oRBIT.
  • You have a some options at least to help speed things up.  I'm sure someone can think of others.  (Entries are numbered only to ease possible discussion of them, not necessarily to suggest that one is a preferred method over another).

    1. Use some form of trigram (or other n-gram) search.

    2. If you are ok with just being able to search for full words, or matches to full words, you can use full-text indexing.

    3. If you are ok with only searching the first 900 or so bytes of the column, you could index on that.

    4. You could store the long text broken apart in another table, limited to ~900 bytes per row, and search that instead (that is the most complex option).

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Full text indexes are an option. Although, in all honesty, I hate 'em. Hard to maintain and frequently don't actually speed things up much, if at all. Extra storage. Ugh. Anyhow, it is something to test for large text searches.

    Alternatively, if you're storing massive amounts of text in SQL Server, stop. It's not its forte. Instead look at something like ElasticSearch or similar text focused data stores. Tool for the job & all that.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I did some experiments with BINARY_CHECKSUM and I think that'll do for me and for my needs, it's good enough and simple to implement.

  • Always a fan of "whatever works". Glad you found a solution.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • oRBIT wrote:

    I did some experiments with BINARY_CHECKSUM and I think that'll do for me and for my needs, it's good enough and simple to implement.

    That's not going to help you find individual words in a column, which is what it sounds like you're trying to do.  It won't even help you with entries that "start with" something.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • oRBIT wrote:

    I did some experiments with BINARY_CHECKSUM and I think that'll do for me and for my needs, it's good enough and simple to implement.

    Also, if that solution works for you, the I agree with Grant... whatever gets the job done.

    Also, two way street here... can you share your solution with a bit more detail so that someone else might be able to do the same as you?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'm aware I'm only able to use EQUAL's here, not words, but it'll do for me. What I did was to implement a persisted computed column:

    ALTER TABLE MyTable Add MessageChecksum AS (BINARY_CHECKSUM([Message])) PERSISTED

    Then I added an index for that column, including the relevant columns (only one, in my case)

    CREATE NONCLUSTERED INDEX [IX_MyTable_MessageChecksum] ON [dbo].[MyTable]
    (
    [MessageChecksum] ASC
    )
    INCLUDE([Time]) WITH (PAD_INDEX = OFF, FILLFACTOR = 80, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON, OPTIMIZE_FOR_SEQUENTIAL_KEY = OFF) ON [PRIMARY]
    GO

    Amazing performance in tests anyway.. However, I'm slightly worried about that index, since MessageChecksum is not really incremental and would probably lead to a fragmented index in very short amounts of time..

    I decreased the FillFactor because of that but it's slightly guesswork from my side..

     

  • You'll find that logical fragmentation just doesn't matter that much.  Since you lowered the FILL FACTOR as you would for a Random GUID (which the BINARY CHECKSUM resembles in nature), I'll tell you to NEVER use REORGANIZE on that index.  It will perpetuate and actually be the cause of page splits.  I'll also tell you that you should consider doing a REBUILD when it gets to >1% fragmentation to help prevent the page splits, which are hell on the log file and concurrency.

    Since that sounds so completely heterodoxic compared to what you've likely learned about index maintenance, please see the following 81 minute video with the understanding that it's NOT just about Random GUIDs... I just use those to destroy some myths and lay waste to what people are calling "Best Practices".  Here's the link...

    https://www.youtube.com/watch?v=rvZwMNJxqVo

    Be advised that they added 3 advertisements at the 15:00, 30:15, and 45:25. The advertisements are important to help support the event where this was conducted but I wanted to let you know that they're sudden and they're loud! They WILL lift your headset!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden wrote:

    You'll find that logical fragmentation just doesn't matter that much.  Since you lowered the FILL FACTOR as you would for a Random GUID (which the BINARY CHECKSUM resembles in nature), I'll tell you to NEVER use REORGANIZE on that index.  It will perpetuate and actually be the cause of page splits.  I'll also tell you that you should consider doing a REBUILD when it gets to >1% fragmentation to help prevent the page splits, which are hell on the log file and concurrency.

    Since that sounds so completely heterodoxic compared to what you've likely learned about index maintenance, please see the following 81 minute video with the understanding that it's NOT just about Random GUIDs... I just use those to destroy some myths and lay waste to what people are calling "Best Practices".  Here's the link...

    https://www.youtube.com/watch?v=rvZwMNJxqVo

    Be advised that they added 3 advertisements at the 15:00, 30:15, and 45:25. The advertisements are important to help support the event where this was conducted but I wanted to let you know that they're sudden and they're loud! They WILL lift your headset!

    I'll check that link! I understand you have opinions about that checksum-index? How would you have designed it then? My table will only have INSERT's, I don't think ever UPDATE's or DELETE's..

     

    • This reply was modified 1 year, 2 months ago by  oRBIT.
    • This reply was modified 1 year, 2 months ago by  oRBIT.
  • I'm really curious as to the why behind adding a checksum value to the table, based on a column named "Message", while then including the column named "Time", that somehow speeds up access based on a text search...

    Also am wondering why not just create an IDENTITY column instead of checksum value...   I'm not sure I'm understanding how your scenario achieves a faster result, but I may be assuming that the column being searched is "Message", when it would kinda have to be "Time" instead, given what I've read so far.    If I am confused, I'm okay with that... just need some splainin' to take place...

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • I am skeptical that pre-fragmenting the index to 20% fragmentation with the 80% fill-factor is preferable to a relatively small amount of fragmentation that would result from changes to checksum values - that index is at most 9 bytes wide combining the checksum column and leaf node. (unsure how fragmentation would be impacted in the leaf node - I would assume only the branch node would fragment? in which case its only 4 bytes susceptible to fragmentation)

    Orbit: I assume that your situation is you know the exact text value you are searching for and are using the checksum index to reduce down the number of rows that must be scanned?

  • CreateIndexNonclustered wrote:

    I am skeptical that pre-fragmenting the index to 20% fragmentation with the 80% fill-factor is preferable to a relatively small amount of fragmentation that would result from changes to checksum values - that index is at most 9 bytes wide combining the checksum column and leaf node. (unsure how fragmentation would be impacted in the leaf node - I would assume only the branch node would fragment? in which case its only 4 bytes susceptible to fragmentation)

    Orbit: I assume that your situation is you know the exact text value you are searching for and are using the checksum index to reduce down the number of rows that must be scanned?

    The index doesn't contain only the checksum.  Also, page splits for narrow indexes are far worse than for wider indexes because there are more rows that get moved and that also means more log file and a longer period of locking.

    And, you have to remember... it's not just changes for such evenly distributed indexes.  It's the inserts.  Done correctly, there won't be a need to rebuild again for months.

     

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Right - but only the checksum column is keyed. With the number of collisions in checksum I would think you would hit a point of maximum entropy in index fragmentation pretty quickly unless there is a very high rate of new record inserts. Maybe I am overestimating the number of collisions that would occur?

    I really don't like the idea of starting out day 1 increasing the amount of physical i/o by 20%. I had a system in the past that used a clustered uniqueIdentifier that served to intentionally fragment the clustered index across file groups on separate spindles when it still ran on HDD, (and also to accommodate empty record inserts with dozens or hundreds of updates per record before the record was 'closed') and even after 10 years it was less than 20% fragmented. The only time it ever had issues with bad page splits was after having to back fill several months of data

  • sgmunson wrote:

    I'm really curious as to the why behind adding a checksum value to the table, based on a column named "Message", while then including the column named "Time", that somehow speeds up access based on a text search...

    Also am wondering why not just create an IDENTITY column instead of checksum value...   I'm not sure I'm understanding how your scenario achieves a faster result, but I may be assuming that the column being searched is "Message", when it would kinda have to be "Time" instead, given what I've read so far.    If I am confused, I'm okay with that... just need some splainin' to take place...

    I probably left out a few details.. The problem with the table in the first place, I'd like to check when certain messages appeared in time. Doing a SELECT .. WHERE Message='abc123' could take forever to perform. I implemented a computed column with checksum of Message. So if I wanted to search for 'abc123' I search for the checksum value instead which has an index that speed things up. It's obviously not exact and doesn't support partial searches but it works for my purpose right now.

    The current remaining question though is the index itself, since the checksum value will be pretty much random and cause a fragmented index. If that's an issue or not, I have yet to discover from all post comments. 🙂

     

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply