SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Server-Friendly Way To Do Big String Search?


Server-Friendly Way To Do Big String Search?

Author
Message
autoexcrement
autoexcrement
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1074 Visits: 888
I am working on a project for a company that offers online continuing education classes. The course content is stored in a database table, with one record per "page" of content. Each course can have up to 20-30 pages. The content field is of data type "text".

They want me to write some code that will scour all the courses to find any dead links to external sites.

I am planning on using non-SQL code to actually parse out and then check the URLs, etc. But I need to first generate that list of URLs somehow, and that's what I'd love your feedback on.

Simply running a query like this...
SELECT [content]
FROM [table]
WHERE [content] LIKE '%http%'


...seems like it would just take super long and gobble up resources on the server.

Is there any other alternative? Should I try using some sort of "WAITFOR DELAY" to process only X records at a time?

Thanks in advance for any general feedback/ideas. Smile


"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura
irobertson
irobertson
Say Hey Kid
Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)

Group: General Forum Members
Points: 703 Visits: 675
Have you thought about using a full text index? It might be helpful in identifying records with links. You'd need to be careful with which stemming characters you use to avoid breaking the links up though.
http://technet.microsoft.com/en-us/library/ms345119(v=sql.90).aspx

For parsing the links from the data, I'd be tempted to handle this outside of sql server. You might find that a parsing function built in C# or VB.net and called via CLR performs better.
http://technet.microsoft.com/en-us/library/ms345136(v=sql.90).aspx
http://www.pawlowski.cz/2010/09/string-splitting-t-sql-vs-clr/
autoexcrement
autoexcrement
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1074 Visits: 888
Thanks much for the reply. Unfortunately it's a shared server, so we don't have access to full-text indexing or other special settings.

And yes, I am going to use some non-SQL language to do the actual parsing. Probably PHP or something similar. Smile Just need to identify which records contain "http:" first.


"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura
irobertson
irobertson
Say Hey Kid
Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)Say Hey Kid (703 reputation)

Group: General Forum Members
Points: 703 Visits: 675
As far as I can see then, you're stuck with one of four options:

like
charindex
patindex
replace

The following shows each and some test results. From the looks of it, there isn't much difference between the first three.

-------------------------------
-- create temporary tally table
-------------------------------

; with e1(n) as (
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all
select 1 union all select 1 union all select 1 union all
select 1
)
, e2(n) as (select 1 from e1 a cross join e1 b)
, e4(n) as (select 1 from e2 a cross join e2 b)
select n = isnull(cast((row_number() over (order by (select null))-1) as int),0)
into #Tally
from e4

-- add clustered index to the temporary tally table
alter table #Tally add constraint pk_qt primary key clustered (n)
-- (10000 row(s) affected)

------------------------------------------
-- create temporary table with "documents"
------------------------------------------

create table #Documents (DocId int, DocText varchar(4000))

-- create 10000 "documents"
insert #Documents select n, 'Have you thought about using a full text index? It might be helpful in identifying records with links.
You''d need to be careful with which stemming characters you use to avoid breaking the links up though.
http://technet.microsoft.com/en-us/library/ms345119(v=sql.90).aspx For parsing the links from the data,
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
http://technet.microsoft.com/en-us/library/ms345136(v=sql.90).aspx
http://www.pawlowski.cz/2010/09/string-splitting-t-sql-vs-clr/
Have you thought about using a full text index? It might be helpful in identifying records with links.
You''d need to be careful with which stemming characters you use to avoid breaking the links up though.
http://technet.microsoft.com/en-us/library/ms345119(v=sql.90).aspx For parsing the links from the data,
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
http://technet.microsoft.com/en-us/library/ms345136(v=sql.90).aspx
http://www.pawlowski.cz/2010/09/string-splitting-t-sql-vs-clr/
Have you thought about using a full text index? It might be helpful in identifying records with links.
You''d need to be careful with which stemming characters you use to avoid breaking the links up though.
http://technet.microsoft.com/en-us/library/ms345119(v=sql.90).aspx For parsing the links from the data,
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
http://technet.microsoft.com/en-us/library/ms345136(v=sql.90).aspx
http://www.pawlowski.cz/2010/09/string-splitting-t-sql-vs-clr/
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.
I''d be tempted to handle this outside of sql server.
You might find that a parsing function built in C# or VB.net and called via CLR performs better.'
from #Tally
-- (10000 row(s) affected)

set statistics io on

-- like
select DocId from #Documents where DocText like '%http%'

-- charindex
select DocId from #Documents where charindex('http',DocText) > 0

-- patindex
select DocId from #Documents where patindex('%http%',DocText) > 0

-- substitution
select DocId from #Documents where len(DocText) <> len(replace(DocText,'http',''))

/*
(10000 row(s) affected)
Table '#Documents'. Scan count 1, logical reads 5000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(10000 row(s) affected)
Table '#Documents'. Scan count 1, logical reads 5000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(10000 row(s) affected)
Table '#Documents'. Scan count 1, logical reads 5000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

(10000 row(s) affected)
Table '#Documents'. Scan count 1, logical reads 5000, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
*/

set statistics io off

------------
-- time test
------------

create table #TestResults (i int, l int, c int, p int, s int)

declare @like int
, @char int
, @pat int
, @sub int
, @i int
, @start datetime
, @end datetime

set @i = 1


while @i <=50
begin

-- like
set @start = getdate()

select DocId from #Documents where DocText like '%http%'

set @end = getdate()

set @like = datediff(ms,@start,@end)

-- charindex
set @start = getdate()

select DocId from #Documents where charindex('http',DocText) > 0

set @end = getdate()

set @char = datediff(ms,@start,@end)

-- patindex
set @start = getdate()

select DocId from #Documents where patindex('%http%',DocText) > 0

set @end = getdate()

set @pat = datediff(ms,@start,@end)

-- substitution
set @start = getdate()

select DocId from #Documents where len(DocText) <> len(replace(DocText,'http',''))

set @end = getdate()

set @sub = datediff(ms,@start,@end)

insert #TestResults select @i, @like, @char, @pat, @sub

set @i = @i+1

end

-- show results
select l=avg(l),c=avg(c),p=avg(p),s=avg(s)
from #TestResults

/*
l c p s
544 557 548 5641
*/



Depending on the proportion of documents that have links, I might consider just farming out everything via CLR, regardless of whether a doc contained links or not. You could then just split the input to an array using http: as the delimiter and use the upperbound of the array to determine whether you have multiple array elements (i.e. doc contains links). Of course this relies on you being able to create CLR functions on your server. Although if you can't, you can just do it the other way I guess.
autoexcrement
autoexcrement
Ten Centuries
Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)Ten Centuries (1.1K reputation)

Group: General Forum Members
Points: 1074 Visits: 888
Thanks, great example of how to compare efficiency of various options! Smile Glad to see that the most straightforward ("LIKE") is also basically the fastest.


"If I had been drinking out of that toilet, I might have been killed." -Ace Ventura
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search