Retun list of top domain names?

  • Is it even possible to query a table containing web URL strings as in the table built by the code below and return from the URL field a list of each seperate root domain listed in the table?

    Example... the final output might be

    msnbc.com - 200 or http://msnbc.com 200 ..

    for root domain "msnbc.com" hit 200 times?

    --To create #URLTbl

    IF OBJECT_ID('TempDB..#URLTbl','U') IS NOT NULL DROP TABLE #URLTbl

    CREATE TABLE #URLTbl

    (MsgDate Varchar(1024),

    Username Varchar (1024),

    URL Varchar (1024),

    Rating Varchar (1024))

    --===== Setup any special required conditions especially where dates are concerned

    SET DATEFORMAT YMD

    INSERT INTO #URLTbl(MsgDate,Username,URL,Rating)

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.porn.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-12','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-12','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-12','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-15','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-15','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-15','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-15','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-17','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-21','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-21','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-21','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-21','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-21','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all

    SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-25','webuser\mattmh','http://assets3bomb.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-25','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-29','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-29','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-29','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-29','webuser\hmlavaty3398n','http://content.yieldmanagergun.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-29','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/gun/rendering/img/bing.png','news' union all

    SELECT '2010-11-07','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/porn/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all

    SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/MSN_Homebombpage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all

    SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/nude/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL

    SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C18naked1757/large.jpg?0.5824313560398167','web search' UNION ALL

    SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.bomb.yieldmanager.com/ak/q.gif','internet services' UNION ALL

    SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news'

  • I would say yes, but the code to crack the root domain out is going to be slow. And since you are using 7 or 2000 you can't use persisted computed columns which would really help for later queries.

    I will have to think about a cracking process to make it viable..

    CEWII

  • Very quick way to get the full domain out is:

    SELECT Domain = SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )

    I'm still working on taking something like view.atdmt.com or cdn.unicast.msn.com down to atdmt.com or msn.com.. Thats a bit tougher..

    The above code makes 2 assumptions, first that the URL starts with "http://" and that the domain portion ends at the first "/" following the "http://". As I said quick.. Additional checks like https:// or ftp:// or something else will cause it to run slower, but I'll look..

    Also, just thought of this.. If you are using Enterprise edition you could build an indexed view and create an index on this calculated column, that would have the effect of materializing it. Basically you pay for the calculation ONCE and it gets stored as part of the index.. However, if you don't have Enterprise Edition I think you can still create indexed views but the Query Optimizer won't use them automatically, I think you have to give it query hints.. By materializing the data the calculation occurs once, gets stored, and then when you query for the count it uses the index and doesn't need to go to the base table to query it.

    CEWII

  • Actually.. I'm usinig SQL2005 .. should have mentioned that...

    When I ran your code got

    Msg 207, Level 16, State 1, Line 1

    Invalid column name 'URL'.

    Msg 207, Level 16, State 1, Line 1

    Invalid column name 'URL'.

  • That wasn't a full select statement, it was only enough to demonstrate the syntax.. Add:

    FROM #URLTbl

    It should work..

    CEWII

  • Now read this ... after all November 2011 is close approaching.

    The change from IPV4 to IPV6

    http://www.bbc.co.uk/news/technology-12183098

    And then to test your own system readiness follow this link

    http://test-ipv6.com/

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • Thanks.. it seems to work and returns results, but upon completion also displays the error message...

    Msg 536, Level 16, State 5, Line 1

    Invalid length parameter passed to the SUBSTRING function.

    Elliott Whitlow (1/13/2011)


    That wasn't a full select statement, it was only enough to demonstrate the syntax.. Add:

    FROM #URLTbl

    It should work..

    CEWII

  • If you get that message, then somewhere in there you have a pattern that isn't being caught or isn't handled well. The Charindex doesn't find something and returns a 0.

  • Agreed Steve..

    The row is obviously violating one of the assumptions I made..

    My guess is https instead of http..

    CEWII

  • Yes... I am using Enterprise edition.. so I'll look at the indexed view suggestion.. and as this is the output from a webfilter, there are HTTPS entries, but we block FTP so that should not be a big issue.

    Thanks!

  • Its pretty limited and no pretty.. But it works for http/https/ftp:

    SELECT URLCnt = COUNT(URL),

    Domain = CASE SUBSTRING( URL, 1, 5 )

    WHEN 'http:'

    THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )

    WHEN 'https'

    THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )

    WHEN 'ftp:/'

    THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )

    ELSE URL

    END

    FROM #URLTbl

    GROUP BY CASE SUBSTRING( URL, 1, 5 )

    WHEN 'http:'

    THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )

    WHEN 'https'

    THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )

    WHEN 'ftp:/'

    THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )

    ELSE URL

    END

    GO

    Assumption: only the three types are handled.. I would not want to run this on a big table very often.

    CEWII

  • Only 60 million records as of today ... should be fine eh?

  • That is a lot of rows.. I would definitely look at the indexed view to persist the values otherwise it has to recalculate the values everytime you query them and 60m rows is just too painful to do every time..

    Also keep in mind the COUNT(URL) that is in that query CANNOT be in the view, I just put that there to show it working.. The indexed view cannot contain aggregates or non-deterministic functions.

    CEWII

  • Just got back to work where I could try this, and I get the error

    Msg 536, Level 16, State 5, Line 1

    Invalid length parameter passed to the SUBSTRING function.

    When I run it on real data.....

  • Forgot to quote the source code when I replied to you.. but when I run this I get the error below...

    Msg 536, Level 16, State 5, Line 1

    Invalid length parameter passed to the SUBSTRING function.

    Any thoughts on this would be appreciated....

    Thanks...

    Elliott Whitlow (1/13/2011)


    Its pretty limited and no pretty.. But it works for http/https/ftp:

    SELECT URLCnt = COUNT(URL),

    Domain = CASE SUBSTRING( URL, 1, 5 )

    WHEN 'http:'

    THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )

    WHEN 'https'

    THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )

    WHEN 'ftp:/'

    THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )

    ELSE URL

    END

    FROM #URLTbl

    GROUP BY CASE SUBSTRING( URL, 1, 5 )

    WHEN 'http:'

    THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )

    WHEN 'https'

    THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )

    WHEN 'ftp:/'

    THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )

    ELSE URL

    END

    GO

    Assumption: only the three types are handled.. I would not want to run this on a big table very often.

    CEWII

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply