Finding one of a list of strings in a number of text rows

  • If I have a list of phrases in a table and I want to find if they occur within the text in a separate table. My tables will get large. I don't mind using disk space but I do need speed. My phrases wont grow over 3 words long. I can only think of 2 ways:

    1) break up the large text blocks into parts of up to 3 words and then do a join

    2) do a scary like '%phrase%'

    I am having difficulty finding any solutions so any thoughts and help gratefully received!

    CREATE TABLE dbo.PickPhrase

    (

    id int NOT NULL,

    phrase varchar(200) NOT NULL

    )

    insert into pickphrase (id, phrase) values (1, 'Computers')

    insert into pickphrase (id, phrase) values (2, 'Fast Computers')

    insert into pickphrase (id, phrase) values (3, 'Hopper')

    insert into pickphrase (id, phrase) values (4, 'doh')

    CREATE TABLE [dbo].[PickitFrom](

    [id] [int] NOT NULL,

    [PickFrom] [varchar](max) NOT NULL

    )

    insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')

    insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')

    insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')

  • Not knowing all that much about your application, table size etc have you looked at Full Text Searching? If not you might want to start in Books On Line at:

    Introduction to Full-Text Search

    ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/fulltxt9/html/db3747e0-db12-4c69-9d81-b5011984eb3a.htm

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • If you can't use full-text for whatever reason, another approach would be to use a trigger on the PickItFrom table to save the occurrences of a PickPhrase within each new or modified row in another table.

    You could use CHARINDEX instead of LIKE (which, despite the name, is *not* limited to searching for a particular character - it can do strings) or PATINDEX (same as CHARINDEX but with full wildcard support).

    Depending on how static the PickItFrom data is once inserted, how many PickItFrom phrases there are, and how much overhead you can tolerate in the INSERT/UPDATE process, a trigger-based solution might be worth exploring; also it should out-perform full-text, at the expense of the overhead and disk space for the lookup table.

    Of course you would need additional triggers to cater for rows being inserted, updated, or deleted in the two base tables. Depending on how frequently that happened, it could be expensive.

    A quick illustration follows, note that it is quick-and-dirty and has no indexes or anything...

    DROP TABLE dbo.PickPhrase, dbo.PickitFrom, dbo.PhraseMatch

    CREATE TABLE dbo.PickPhrase

    (

    id int NOT NULL,

    phrase varchar(200) NOT NULL

    )

    CREATE TABLE dbo.PickitFrom(

    [id] [int] NOT NULL,

    [PickFrom] [varchar](max) NOT NULL

    )

    CREATE TABLE dbo.PhraseMatch

    (

    row_idINTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,

    phrase_idINTEGER NOT NULL,

    pick_from_idINTEGER NOT NULL,

    start_posINTEGER NOT NULL,

    end_posINTEGER NOT NULL,

    );

    GO

    CREATE TRIGGER dbo.trg_PickItFrom_After_Insert

    ON dbo.PickItFrom

    AFTER INSERT

    AS

    BEGIN

    DECLARE@Rows INT;

    SET@Rows = @@rowcount;

    SETNOCOUNT ON;

    IF@Rows = 0 RETURN;

    IF@Rows > 1

    BEGIN

    RAISERROR('Single row inserts only on PickItFrom!', 16, 1);

    ROLLBACK TRANSACTION;

    RETURN;

    END;

    INSERTdbo.PhraseMatch (phrase_id, pick_from_id, start_pos, end_pos)

    SELECTP.id, i.id, CHARINDEX(P.phrase, i.PickFrom), CHARINDEX(P.phrase, i.PickFrom) + LEN(P.phrase) - 1

    FROMinsertedAS i

    JOINdbo.PickPhraseAS P

    ON(CHARINDEX(P.phrase, i.PickFrom) != 0);

    WHILE(@@rowcount > 0)

    BEGIN

    INSERTdbo.PhraseMatch (phrase_id, pick_from_id, start_pos, end_pos)

    SELECTP.id, i.id, CHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1), CHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1) + LEN(P.phrase) - 1

    FROMinsertedAS i

    JOINdbo.PhraseMatchAS M

    ON(M.pick_from_id = i.id)

    JOINdbo.PickPhraseAS P

    ON(P.id = M.phrase_id)

    WHERECHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1) > ALL

    (

    SELECTM2.start_pos

    FROMdbo.PhraseMatch AS M2

    WHEREM2.phrase_id = M.phrase_id

    ANDM2.pick_from_id = M.pick_from_id

    )

    ;

    END;

    END

    GO

    insert into pickphrase (id, phrase) values (1, 'Computers')

    insert into pickphrase (id, phrase) values (2, 'Fast Computers')

    insert into pickphrase (id, phrase) values (3, 'Hopper')

    insert into pickphrase (id, phrase) values (4, 'doh')

    insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')

    insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')

    insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')

    select * from phrasematch

    Cheers,

    Paul

    edit: added sql

  • How many phrases are you going to search for? I've got a solution that is child's play to code, but it's never been tested at large volumes. The execution plan for the code below is a simple nested loop which will scan your pickphase table once for every row in pickitfrom. If you try it with production volumes, I'd really like to hear about the result.

    Warning, if multiple phrases appear in a row, it will be returned multiple times: once for each phrase match. If needed, the duplicates can be eliminated by using a cross apply to a query rather than a cross join to the phrase table itself.

    use tempdb;

    drop table pickphrase;

    drop table pickitfrom;

    CREATE TABLE dbo.PickPhrase

    (

    id int NOT NULL,

    phrase varchar(200) NOT NULL

    )

    -- preloading % signs into pickphrase to save time building LIKE strings

    insert into pickphrase (id, phrase) values (1, '%Computers%')

    insert into pickphrase (id, phrase) values (2, '%Fast Computers%')

    insert into pickphrase (id, phrase) values (3, '%Hopper%')

    insert into pickphrase (id, phrase) values (4, '%doh%')

    CREATE TABLE [dbo].[PickitFrom](

    [id] [int] NOT NULL,

    [PickFrom] [varchar](max) NOT NULL

    )

    insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')

    insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')

    insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')

    select phrase,pickfrom

    from PickitFrom

    cross join pickphrase

    where pickfrom like phrase

    Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉 I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.

    __________________________________________________

    Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
    Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills

  • Hey thanks guys 😀

    I normally stay away from full text as I normally need good multilingual support. In this case i am only english so it would be good to get to the stemming without using my own stemmer 🙂

    I think I will give that a go to see how it goes.

    Thanks!

  • Bob Hovious (4/25/2009)


    Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉

    I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.

    Luckily I have now waterproofed my monitor. :laugh:

    I can't see a particular advantage in writing any C# for this, so I would stick with a T-SQL solution. Your solution has a neatness to it, and will be more robust than a trigger-based solution, though I would agree performance will likely not be as good.

    I suggested the trigger solution mostly because it is the fastest method I can think of, given that extra disk space is not an issue.

    I do tend to try to find non-trigger solutions because trigger solutions require extensive testing, a lot of analysis, and a guarantee that triggers will never be disabled. Coding the triggers for multiple-rows can sometimes be challenging too. The worst part would be having to rebuild the mapping table if the triggers ever became 'out of step' with the base tables.

    In summary (wait for it)...it depends!

    Paul

  • Bob Hovious (4/25/2009)


    Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉 I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.

    :laugh::laugh::laugh:

    Just let me take a coke and popcorn!

    The Splitter

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply