Searching for URL contents using CHARINDEX

  • I need to query the table #URLTbl created by my code below

    looking at the field "URL" to spot any occurrences of words

    contained in the second table created here #Badwords.

    I've succeeded by using the select statement

    select * from dbo.#URLTbl

    where CHARINDEX('Porn',URL) > 0 or CHARINDEX('naked',URL)

    > 1

    at finding the words hidden in string in the URL field but cannot figure out how to make this statement use the contents of the #badwords table to return rows containing matches from the #UFLTbl

    Thanks for any thoughts!

    --To create #URLTbl and #BadWords temp tables run this entire section

    IF OBJECT_ID('TempDB..#URLTbl','U') IS NOT NULL DROP TABLE #URLTbl

    CREATE TABLE #URLTbl

    (MsgDate Varchar(1024),

    Username Varchar (1024),

    URL Varchar (1024),

    Rating Varchar (1024))

    --===== Setup any special required conditions especially where dates are concerned

    SET DATEFORMAT YMD

    INSERT INTO #URLTbl(MsgDate,Username,URL,Rating)

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.porn.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-12','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-12','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-12','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-15','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-15','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-15','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-15','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-17','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-21','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-21','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-21','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-21','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-21','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-25','webuser\mattmh','http://assets3bomb.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-25','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-29','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-29','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-29','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-29','webuser\hmlavaty3398n','http://content.yieldmanagergun.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/gun/rendering/img/bing.png','news' union all

    SELECT '2010-11-07','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/porn/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/MSN_Homebombpage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/nude/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C18naked1757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.bomb.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news'

    --Create Badwords table

    IF OBJECT_ID('TempDB..#badwords','U') IS NOT NULL DROP TABLE #badwords

    CREATE TABLE #Badwords

    (Words nchar(25))

    --===== Setup any special required conditions especially where dates are concerned

    SET DATEFORMAT YMD

    --===== Insert the test data into the test table

    INSERT INTO #badwords

    (words)

    SELECT 'porn ' UNION ALL

    SELECT 'nude ' UNION ALL

    SELECT 'naked ' UNION ALL

    SELECT 'gun ' UNION ALL

    SELECT 'bomb ' UNION ALL

    SELECT 'kill ' UNION ALL

    SELECT 'murder '

    --===== Set the identity insert back to normal

    -- SET IDENTITY_INSERT #badwords ON

    select * from dbo.#badwords

    select * from dbo.#URLTbl

  • Okay, it took me a few minutes, but here's code that works with your sample:

    SELECT

    CHARINDEX(RTRIM(B.Words), U.url),

    *

    FROM

    #urltbl U

    CROSS JOIN #Badwords AS B

    WHERE

    CHARINDEX(RTRIM(B.Words), U.url) <> 0

    ORDER BY

    username,

    url

    Key points are that I'm using a CROSS JOIN so that each URL is checked against each bad word. This means that if a URL has more than 1 bad word it will be returned once for each bad word it contains. The other is that in your example you are using nchar(25) for the bad words which means that each entry is being padded out to 25 characters (50 bytes) AND you padded the bad words with spaces in the example, so I had to do an RTRIM to remove the padding in order to find matches in the URL since the bad words in the URL are not likely to be followed by spaces.

  • Wow... so simple when you know how!

    Thanks! .. and.. happy new year!

  • If I want to add a time parameter restriction criteria to this, being that the timedate fields are all Varchar types, how would I do that? For example... I wanted all records between 12/10/2010 and 1/1/2011 in the field 'MSGDATE' from he URL table ?

    Jack Corbett (12/29/2010)


    Okay, it took me a few minutes, but here's code that works with your sample:

    SELECT

    CHARINDEX(RTRIM(B.Words), U.url),

    *

    FROM

    #urltbl U

    CROSS JOIN #Badwords AS B

    WHERE

    CHARINDEX(RTRIM(B.Words), U.url) <> 0

    ORDER BY

    username,

    url

    Key points are that I'm using a CROSS JOIN so that each URL is checked against each bad word. This means that if a URL has more than 1 bad word it will be returned once for each bad word it contains. The other is that in your example you are using nchar(25) for the bad words which means that each entry is being padded out to 25 characters (50 bytes) AND you padded the bad words with spaces in the example, so I had to do an RTRIM to remove the padding in order to find matches in the URL since the bad words in the URL are not likely to be followed by spaces.

  • Just add that to the Where Clause.

    SELECT

    CHARINDEX(RTRIM(B.Words), U.url),

    *

    FROM

    #urltbl U

    CROSS JOIN #Badwords AS B

    WHERE

    MSGDATE >= '12/10/2010' AND

    /* using the next day because if you want all of 1/1 and there are any times

    included you need to do less than the next day */

    MSGDATE < '1/2/2011' AND

    CHARINDEX(RTRIM(B.Words), U.url) <> 0

    ORDER BY

    username,

    url

  • Yes.... I had tried that, but I was searching on a field containing data/time data, but the datatype it was set to was Varchar (30) .. which caused the search to bomb.

    This brings up a datatype question... I'm searching here on field "URL" and field "time" and want to create an index (clustered) for faster searching, but I think I need to minimize the size of the URL field to enable more than one index if possible. What type and sized datatype would be best to contain URL data and allow efficient searching, while perhaps leaving enough space to create an index on the date field "TIME"? I'm thinking Varchar (1024) is bigger than it needs to be, but also wonder if "TEXT" would allow for any decent searching... ?

    My goal is to be able to search on the time field (datatype "datetime") and the URL field (current datatype Varchar 1024).

    Jack Corbett (1/3/2011)


    Just add that to the Where Clause.

    SELECT

    CHARINDEX(RTRIM(B.Words), U.url),

    *

    FROM

    #urltbl U

    CROSS JOIN #Badwords AS B

    WHERE

    MSGDATE >= '12/10/2010' AND

    /* using the next day because if you want all of 1/1 and there are any times

    included you need to do less than the next day */

    MSGDATE < '1/2/2011' AND

    CHARINDEX(RTRIM(B.Words), U.url) <> 0

    ORDER BY

    username,

    url

  • I didn't spend anytime looking at your data types, but now that I have I do have some suggestions.

    1. If the column contains dates then it should be one of the date data types. If you keep it a character data type you are only asking for trouble. If you are forced to use varchar() for it then you should be doing date validation before you insert/update that column AND I suggest a computed column that IS a date data type or a view that converts that column to a date data type. Ideally you'll put the correct data type on the column.

    2. You definitely do NOT want to use TEXT data type because it is a BLOP data type, not character, AND it has been deprecated by Microsoft.

    3. An index on msgdate while it is a character column won't give you the results you think as it will be sorting in alphbetical order NOT date order. So 1/1/01 will come AFTER 01/2/02.

    4. An index on the url column will not help this query. Because you have to use a function, CHARINDEX, on the column SQL Server MUST do a scan of every row. You might be better off looking into Full Text indexing if you need to speed up this query. I am NOT a full text expert, so I can't really help or tell you if it is the right technology for this problem.

    I hope this has helped, a little.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply