April 25, 2009 at 6:12 am
If I have a list of phrases in a table and I want to find if they occur within the text in a separate table. My tables will get large. I don't mind using disk space but I do need speed. My phrases wont grow over 3 words long. I can only think of 2 ways:
1) break up the large text blocks into parts of up to 3 words and then do a join
2) do a scary like '%phrase%'
I am having difficulty finding any solutions so any thoughts and help gratefully received!
CREATE TABLE dbo.PickPhrase
(
id int NOT NULL,
phrase varchar(200) NOT NULL
)
insert into pickphrase (id, phrase) values (1, 'Computers')
insert into pickphrase (id, phrase) values (2, 'Fast Computers')
insert into pickphrase (id, phrase) values (3, 'Hopper')
insert into pickphrase (id, phrase) values (4, 'doh')
CREATE TABLE [dbo].[PickitFrom](
[id] [int] NOT NULL,
[PickFrom] [varchar](max) NOT NULL
)
insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')
insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')
insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')
April 25, 2009 at 8:57 am
Not knowing all that much about your application, table size etc have you looked at Full Text Searching? If not you might want to start in Books On Line at:
Introduction to Full-Text Search
ms-help://MS.SQLCC.v9/MS.SQLSVR.v9.en/fulltxt9/html/db3747e0-db12-4c69-9d81-b5011984eb3a.htm
April 25, 2009 at 11:11 am
If you can't use full-text for whatever reason, another approach would be to use a trigger on the PickItFrom table to save the occurrences of a PickPhrase within each new or modified row in another table.
You could use CHARINDEX instead of LIKE (which, despite the name, is *not* limited to searching for a particular character - it can do strings) or PATINDEX (same as CHARINDEX but with full wildcard support).
Depending on how static the PickItFrom data is once inserted, how many PickItFrom phrases there are, and how much overhead you can tolerate in the INSERT/UPDATE process, a trigger-based solution might be worth exploring; also it should out-perform full-text, at the expense of the overhead and disk space for the lookup table.
Of course you would need additional triggers to cater for rows being inserted, updated, or deleted in the two base tables. Depending on how frequently that happened, it could be expensive.
A quick illustration follows, note that it is quick-and-dirty and has no indexes or anything...
DROP TABLE dbo.PickPhrase, dbo.PickitFrom, dbo.PhraseMatch
CREATE TABLE dbo.PickPhrase
(
id int NOT NULL,
phrase varchar(200) NOT NULL
)
CREATE TABLE dbo.PickitFrom(
[id] [int] NOT NULL,
[PickFrom] [varchar](max) NOT NULL
)
CREATE TABLE dbo.PhraseMatch
(
row_idINTEGER IDENTITY(1,1) NOT NULL PRIMARY KEY,
phrase_idINTEGER NOT NULL,
pick_from_idINTEGER NOT NULL,
start_posINTEGER NOT NULL,
end_posINTEGER NOT NULL,
);
GO
CREATE TRIGGER dbo.trg_PickItFrom_After_Insert
ON dbo.PickItFrom
AFTER INSERT
AS
BEGIN
DECLARE@Rows INT;
SET@Rows = @@rowcount;
SETNOCOUNT ON;
IF@Rows = 0 RETURN;
IF@Rows > 1
BEGIN
RAISERROR('Single row inserts only on PickItFrom!', 16, 1);
ROLLBACK TRANSACTION;
RETURN;
END;
INSERTdbo.PhraseMatch (phrase_id, pick_from_id, start_pos, end_pos)
SELECTP.id, i.id, CHARINDEX(P.phrase, i.PickFrom), CHARINDEX(P.phrase, i.PickFrom) + LEN(P.phrase) - 1
FROMinsertedAS i
JOINdbo.PickPhraseAS P
ON(CHARINDEX(P.phrase, i.PickFrom) != 0);
WHILE(@@rowcount > 0)
BEGIN
INSERTdbo.PhraseMatch (phrase_id, pick_from_id, start_pos, end_pos)
SELECTP.id, i.id, CHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1), CHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1) + LEN(P.phrase) - 1
FROMinsertedAS i
JOINdbo.PhraseMatchAS M
ON(M.pick_from_id = i.id)
JOINdbo.PickPhraseAS P
ON(P.id = M.phrase_id)
WHERECHARINDEX(P.phrase, i.PickFrom, M.end_pos + 1) > ALL
(
SELECTM2.start_pos
FROMdbo.PhraseMatch AS M2
WHEREM2.phrase_id = M.phrase_id
ANDM2.pick_from_id = M.pick_from_id
)
;
END;
END
GO
insert into pickphrase (id, phrase) values (1, 'Computers')
insert into pickphrase (id, phrase) values (2, 'Fast Computers')
insert into pickphrase (id, phrase) values (3, 'Hopper')
insert into pickphrase (id, phrase) values (4, 'doh')
insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')
insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')
insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')
select * from phrasematch
Cheers,
Paul
edit: added sql
April 25, 2009 at 1:16 pm
How many phrases are you going to search for? I've got a solution that is child's play to code, but it's never been tested at large volumes. The execution plan for the code below is a simple nested loop which will scan your pickphase table once for every row in pickitfrom. If you try it with production volumes, I'd really like to hear about the result.
Warning, if multiple phrases appear in a row, it will be returned multiple times: once for each phrase match. If needed, the duplicates can be eliminated by using a cross apply to a query rather than a cross join to the phrase table itself.
use tempdb;
drop table pickphrase;
drop table pickitfrom;
CREATE TABLE dbo.PickPhrase
(
id int NOT NULL,
phrase varchar(200) NOT NULL
)
-- preloading % signs into pickphrase to save time building LIKE strings
insert into pickphrase (id, phrase) values (1, '%Computers%')
insert into pickphrase (id, phrase) values (2, '%Fast Computers%')
insert into pickphrase (id, phrase) values (3, '%Hopper%')
insert into pickphrase (id, phrase) values (4, '%doh%')
CREATE TABLE [dbo].[PickitFrom](
[id] [int] NOT NULL,
[PickFrom] [varchar](max) NOT NULL
)
insert into PickitFrom (id, pickfrom) values (1, 'Lorem ipsum dolor sit amet, consectetur adipiscing elit. In augue. Nunc ornare est accumsan purus. Cras elit justo, feugiat ac, luctus sed, interdum eget, eros.')
insert into PickitFrom (id, pickfrom) values (2, 'Suspendisse potenti. Morbi aliquet, mi in molestie ultrices, velit justo vestibulum nunc, fast computers vitae facilisis mi turpis ac lectus.')
insert into PickitFrom (id, pickfrom) values (3, 'doh, doh, doh')
select phrase,pickfrom
from PickitFrom
cross join pickphrase
where pickfrom like phrase
Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉 I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.
__________________________________________________
Against stupidity the gods themselves contend in vain. -- Friedrich Schiller
Stop, children, what's that sound? Everybody look what's going down. -- Stephen Stills
April 25, 2009 at 5:08 pm
Hey thanks guys 😀
I normally stay away from full text as I normally need good multilingual support. In this case i am only english so it would be good to get to the stemming without using my own stemmer 🙂
I think I will give that a go to see how it goes.
Thanks!
April 25, 2009 at 7:38 pm
Bob Hovious (4/25/2009)
Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.
Luckily I have now waterproofed my monitor. :laugh:
I can't see a particular advantage in writing any C# for this, so I would stick with a T-SQL solution. Your solution has a neatness to it, and will be more robust than a trigger-based solution, though I would agree performance will likely not be as good.
I suggested the trigger solution mostly because it is the fastest method I can think of, given that extra disk space is not an issue.
I do tend to try to find non-trigger solutions because trigger solutions require extensive testing, a lot of analysis, and a guarantee that triggers will never be disabled. Coding the triggers for multiple-rows can sometimes be challenging too. The worst part would be having to rebuild the mapping table if the triggers ever became 'out of step' with the base tables.
In summary (wait for it)...it depends!
Paul
April 26, 2009 at 10:28 am
Bob Hovious (4/25/2009)
Paul, your solution is interesting and I'm guessing will perform better. I'm going to set up and test a little with this. Maybe try for a tally table solution. 😉 I'll post my results up and you can throw in some CLR if you want. This feels vaguely like Flo's parsing thread.
:laugh::laugh::laugh:
Just let me take a coke and popcorn!
The Splitter
Viewing 7 posts - 1 through 7 (of 7 total)
You must be logged in to reply to this topic. Login to reply