January 12, 2011 at 3:37 pm
Is it even possible to query a table containing web URL strings as in the table built by the code below and return from the URL field a list of each seperate root domain listed in the table?
Example... the final output might be
msnbc.com - 200 or http://msnbc.com 200 ..
for root domain "msnbc.com" hit 200 times?
--To create #URLTbl
IF OBJECT_ID('TempDB..#URLTbl','U') IS NOT NULL DROP TABLE #URLTbl
CREATE TABLE #URLTbl
(MsgDate Varchar(1024),
Username Varchar (1024),
URL Varchar (1024),
Rating Varchar (1024))
--===== Setup any special required conditions especially where dates are concerned
SET DATEFORMAT YMD
INSERT INTO #URLTbl(MsgDate,Username,URL,Rating)
SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.porn.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all
SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-12','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-12','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-12','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-15','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-15','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-15','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-15','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-17','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-17','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-21','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-21','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all
SELECT '2010-11-21','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-21','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-21','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-10','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-10','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-10','webuser\hmlavaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news' union all
SELECT '2010-11-10','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-10','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-25','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-25','webuser\mattmh','http://assets3bomb.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-25','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-25','webuser\patrm','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-25','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-25','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-29','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/porn/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-29','webuser\mhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-29','webuser\hlavaty3398n','http://content.murder.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-29','webuser\matth','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-29','webuser\patr','http://view.atdmt.com/action/kill/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-29','webuser\mmhoag','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-29','webuser\hmlavaty3398n','http://content.yieldmanagergun.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-29','webuser\mamtth','http://assets3b.msnbc.msn.com/fight/gun/rendering/img/bing.png','news' union all
SELECT '2010-11-07','webuser\patmr','http://view.atdmt.com/action/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/porn/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-07','webuser\hlavamty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news'union all
SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/MSN_Homebombpage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C181757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-07','webuser\mattmh','http://assets3b.msnbc.msn.com/rendering/img/bing.png','news' union all
SELECT '2010-11-07','webuser\patrm','http://view.atdmt.com/action/nude/MSN_Homepage_Remessaging_111808/nc?a=1','web banners' UNION ALL
SELECT '2010-11-07','webuser\mhoamg','http://cdn.unicast.msn.com/assets/A1/N6/M8379/C18naked1757/large.jpg?0.5824313560398167','web search' UNION ALL
SELECT '2010-11-07','webuser\hlavmaty3398n','http://content.bomb.yieldmanager.com/ak/q.gif','internet services' UNION ALL
SELECT '2010-11-10','webuser\mattmh','http://assets3b.msnbc.msn.com/fight/rendering/img/bing.png','news'
January 12, 2011 at 3:44 pm
I would say yes, but the code to crack the root domain out is going to be slow. And since you are using 7 or 2000 you can't use persisted computed columns which would really help for later queries.
I will have to think about a cracking process to make it viable..
CEWII
January 12, 2011 at 3:59 pm
Very quick way to get the full domain out is:
SELECT Domain = SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )
I'm still working on taking something like view.atdmt.com or cdn.unicast.msn.com down to atdmt.com or msn.com.. Thats a bit tougher..
The above code makes 2 assumptions, first that the URL starts with "http://" and that the domain portion ends at the first "/" following the "http://". As I said quick.. Additional checks like https:// or ftp:// or something else will cause it to run slower, but I'll look..
Also, just thought of this.. If you are using Enterprise edition you could build an indexed view and create an index on this calculated column, that would have the effect of materializing it. Basically you pay for the calculation ONCE and it gets stored as part of the index.. However, if you don't have Enterprise Edition I think you can still create indexed views but the Query Optimizer won't use them automatically, I think you have to give it query hints.. By materializing the data the calculation occurs once, gets stored, and then when you query for the count it uses the index and doesn't need to go to the base table to query it.
CEWII
January 12, 2011 at 4:04 pm
Actually.. I'm usinig SQL2005 .. should have mentioned that...
When I ran your code got
Msg 207, Level 16, State 1, Line 1
Invalid column name 'URL'.
Msg 207, Level 16, State 1, Line 1
Invalid column name 'URL'.
January 13, 2011 at 8:45 am
That wasn't a full select statement, it was only enough to demonstrate the syntax.. Add:
FROM #URLTbl
It should work..
CEWII
January 13, 2011 at 9:27 am
Now read this ... after all November 2011 is close approaching.
The change from IPV4 to IPV6
http://www.bbc.co.uk/news/technology-12183098
And then to test your own system readiness follow this link
January 13, 2011 at 10:23 am
Thanks.. it seems to work and returns results, but upon completion also displays the error message...
Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.
Elliott Whitlow (1/13/2011)
That wasn't a full select statement, it was only enough to demonstrate the syntax.. Add:FROM #URLTbl
It should work..
CEWII
January 13, 2011 at 10:39 am
If you get that message, then somewhere in there you have a pattern that isn't being caught or isn't handled well. The Charindex doesn't find something and returns a 0.
January 13, 2011 at 12:34 pm
Agreed Steve..
The row is obviously violating one of the assumptions I made..
My guess is https instead of http..
CEWII
January 13, 2011 at 1:06 pm
Yes... I am using Enterprise edition.. so I'll look at the indexed view suggestion.. and as this is the output from a webfilter, there are HTTPS entries, but we block FTP so that should not be a big issue.
Thanks!
January 13, 2011 at 8:22 pm
Its pretty limited and no pretty.. But it works for http/https/ftp:
SELECT URLCnt = COUNT(URL),
Domain = CASE SUBSTRING( URL, 1, 5 )
WHEN 'http:'
THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )
WHEN 'https'
THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )
WHEN 'ftp:/'
THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )
ELSE URL
END
FROM #URLTbl
GROUP BY CASE SUBSTRING( URL, 1, 5 )
WHEN 'http:'
THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )
WHEN 'https'
THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )
WHEN 'ftp:/'
THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )
ELSE URL
END
GO
Assumption: only the three types are handled.. I would not want to run this on a big table very often.
CEWII
January 13, 2011 at 11:34 pm
Only 60 million records as of today ... should be fine eh?
January 14, 2011 at 9:47 am
That is a lot of rows.. I would definitely look at the indexed view to persist the values otherwise it has to recalculate the values everytime you query them and 60m rows is just too painful to do every time..
Also keep in mind the COUNT(URL) that is in that query CANNOT be in the view, I just put that there to show it working.. The indexed view cannot contain aggregates or non-deterministic functions.
CEWII
January 18, 2011 at 3:59 pm
Just got back to work where I could try this, and I get the error
Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.
When I run it on real data.....
January 20, 2011 at 9:45 am
Forgot to quote the source code when I replied to you.. but when I run this I get the error below...
Msg 536, Level 16, State 5, Line 1
Invalid length parameter passed to the SUBSTRING function.
Any thoughts on this would be appreciated....
Thanks...
Elliott Whitlow (1/13/2011)
Its pretty limited and no pretty.. But it works for http/https/ftp:
SELECT URLCnt = COUNT(URL),
Domain = CASE SUBSTRING( URL, 1, 5 )
WHEN 'http:'
THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )
WHEN 'https'
THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )
WHEN 'ftp:/'
THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )
ELSE URL
END
FROM #URLTbl
GROUP BY CASE SUBSTRING( URL, 1, 5 )
WHEN 'http:'
THEN SUBSTRING( URL, 8, CHARINDEX( '/', URL, 8 ) - 8 )
WHEN 'https'
THEN SUBSTRING( URL, 9, CHARINDEX( '/', URL, 9 ) - 9 )
WHEN 'ftp:/'
THEN SUBSTRING( URL, 7, CHARINDEX( '/', URL, 7 ) - 7 )
ELSE URL
END
GO
Assumption: only the three types are handled.. I would not want to run this on a big table very often.
CEWII
Viewing 15 posts - 1 through 15 (of 15 total)
You must be logged in to reply to this topic. Login to reply