|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: Tuesday, May 14, 2013 12:31 PM
Points: 437,
Visits: 883
|
|
I didn't want to muddy up someone else's topic, so here's the results of parsing URL data into discrete columns.
The tally table (from http://www.sqlservercentral.com/scripts/Advanced+SQL/62486/:
--===== Create and populate the Tally table on the fly SELECT TOP 11000 --equates to more than 30 years of dates IDENTITY(INT,1,1) AS N INTO dbo.Tally FROM Master.dbo.SysColumns sc1, Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance ALTER TABLE dbo.Tally ADD CONSTRAINT PK_Tally_N PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Allow the general public to use it GRANT SELECT ON dbo.Tally TO PUBLIC The data table (provided by Jeff Moden):
drop table #SourceTable SELECT TOP 1000000 RowNum = IDENTITY(INT,1,1), u.Parameters INTO #SourceTable FROM (--===== Same two lines of data repeated SELECT CAST('Subject=Drilling&Category=Drill Header&Status=NO'+ '&IPPDM_Count=94356&Well_Count=84353&Zone_Status=NO'+ '&Prod_String_Count=95249&R_Master_W_Count=94353'+ '&BeginDate=2/3/2008&EndDate=2/5/2008' AS VARCHAR(8000)) AS Parameters UNION ALL SELECT 'Subject=Zone&Category=Zone Header&Status=YES'+ '&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO'+ '&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674') u, Master.sys.All_Columns ac1, Master.sys.All_Columns ac2
ALTER TABLE #SourceTable ADD PRIMARY KEY CLUSTERED (RowNum) Both the tally table and data table were created in db named staging. Split technique (from Jeff Moden):
use staging go SET ANSI_WARNINGS OFF SET STATISTICS TIME ON go --===== Split the data into an EAV (Entity and Value) cte and reassemble as rows containing the desired columns ;WITH cteFirstSplit AS (--==== Splits data on the "&" character SELECT h.RowNum, ColNum = (ROW_NUMBER() OVER (ORDER BY Parameters)-1)%10, ColVal = SUBSTRING(h.Parameters, t.N+1, CHARINDEX('&', h.Parameters + '&', t.N+1)-t.N-1) FROM dbo.Tally t WITH (NOLOCK) RIGHT OUTER JOIN --Necessary in case Parameters is NULL SourceTable h ON SUBSTRING(h.Parameters, t.N, 1) = '=' --AND t.N < CHARINDEX('Zone_',h.Parameters) AND t.N < len(h.Parameters) --WHERE RowNum > 900000 --WHERE RowNum <= 10000 ) SELECT RowNum, MAX(CASE WHEN ColNum = 0 THEN ColVal ELSE NULL END) AS Subject, MAX(CASE WHEN ColNum = 1 THEN ColVal ELSE NULL END) AS Category, MAX(CASE WHEN ColNum = 2 THEN ColVal ELSE NULL END) AS Status, MAX(CASE WHEN ColNum = 3 THEN ColVal ELSE NULL END) AS IPPDM_Count, MAX(CASE WHEN ColNum = 4 THEN ColVal ELSE NULL END) AS Well_Count/*, MAX(CASE WHEN ColNum = 5 THEN ColVal ELSE NULL END) AS Zone_Status, MAX(CASE WHEN ColNum = 6 THEN ColVal ELSE NULL END) AS Prod_String_Count, MAX(CASE WHEN ColNum = 7 THEN ColVal ELSE NULL END) AS R_Master_W_Count, MAX(CASE WHEN ColNum = 8 THEN ColVal ELSE NULL END) AS BeginDate, MAX(CASE WHEN ColNum = 9 THEN ColVal ELSE NULL END) AS EndDate */ INTO #Results FROM cteFirstSplit GROUP BY RowNum print @@ROWCOUNT SET STATISTICS TIME OFF go
I modified the above script to suffix an ampersand to h.Parameters since the substring would fail if BeginDate was omitted.
XML technique:
use staging go SET ANSI_WARNINGS ON SET STATISTICS TIME ON SET STATISTICS IO ON SET QUOTED_IDENTIFIER ON go declare @xml XML set @xml = ( select cast('<parm RowNum="'+ cast(rowNum as varchar(12))+ '" '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data from SourceTable --where RowNum > 900000 -- where RowNum <= 10000 for xml auto, type)
select D.parm.value('@RowNum','int') as RowNum, D.parm.value('@Subject','varchar(255)') as [Subject], D.parm.value('@Category','varchar(255)') as Category, D.parm.value('@Status','varchar(255)') as [Status], D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count, D.parm.value('@Well_Count','varchar(255)') as Well_Count /*, D.parm.value('@Zone_Status','varchar(255)') as Zone_Status, D.parm.value('@Prod_String_Count','varchar(255)') as Prod_String_Count, D.parm.value('@R_Master_W_Count','varchar(255)') as R_Master_W_Count, D.parm.value('@BeginDate','varchar(255)') as BeginDate, D.parm.value('@EndDate','varchar(255)') as EndDate */ into #results from @xml.nodes('/SourceTable/data/parm') as D(parm) go SET STATISTICS TIME OFF go I ran both scripts twice for datasets of 10K and 100K rows and 5 fields and 10 fields (fields 6-10 are commented out as is the criteria for the various batch sizes). Results:
CPU Elapsed Rows Fields Run XML Split Diff XML Split Diff 10K 5 1 3,625 1,359 62.5% 3,618 1,369 62.2% 5 2 3,641 1,422 60.9% 3,631 1,409 61.2% 100K 5 1 35,719 41,985 -17.5% 35,729 42,960 -20.2% 5 2 35,719 41,686 -16.7% 35,733 42,655 -19.4% 10K 10 1 4,891 1,750 64.2% 4,882 1,748 64.2% 10 2 4,907 1,750 64.3% 4,910 1,741 64.5% 100K 10 1 48,578 45,030 7.3% 48,595 43,770 9.9% 10 2 48,609 45,563 6.3% 48,654 44,497 8.5% 1M 10 1 486,813 466,952 4.1% 486,900 482,407 0.9%
The Split technique wins 3 out of 4 tests, but it's margin of victory was never more than 4 seconds. So, I ran another test with all 1M rows and it's practically a dead heat.
This was my second cut at a query using XML and I'm pretty sure it can be sped up by eliminating the use of the @xml intermediate variable or extracting all attributes at once.
-- this doesn't work select D.parm ... into #results from (select cast(...) as XML) as data from SourceTable for xml auto, type).nodes('/SourceTable/data/parm') as D(parm)
-- this doesn't work either probably because it's an attribute collection rather than an element collection select * from @xml.nodes('/SourceTable/data/parm/@*') as D(RowNum,Subject,Category,...)
One huge difference between the two techniques is the fact that the XML technique is not position sensitive -- it will extract the fields regardless of order. Also, the XML query is easier to grasp and should be easier to maintain.
So Jeff, I will buy you a beer, but only a Bud Light (it's practically water anyway). 
|
|
|
|
|
SSCoach
         
Group: General Forum Members
Last Login: Yesterday @ 1:55 PM
Points: 15,442,
Visits: 9,571
|
|
Antonio:
My tests with XML vs Numbers/Tally table have shown similar CPU time results, but have had interesting IO differences. Did you include IO stats in your runs? (If not, is it something you can run again to test that?)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
|
|
|
|
|
SSCoach
         
Group: General Forum Members
Last Login: Yesterday @ 1:55 PM
Points: 15,442,
Visits: 9,571
|
|
(I have to confess I wasn't as courteous as Antonio, and posted my tests on Numbers/Tally vs XML on the original thread. That's here: http://www.sqlservercentral.com/Forums/Topic478171-338-2.aspx)
- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC Property of The Thread
"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 9:57 PM
Points: 32,906,
Visits: 26,790
|
|
Sorry folks... dunno how I missed this one.
Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSCommitted
      
Group: General Forum Members
Last Login: Tuesday, May 29, 2012 11:22 AM
Points: 1,755,
Visits: 4,652
|
|
Another XML-style approach...
drop table #Results ; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable) select RowNum, Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject, Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category, Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status, Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count, Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count into #Results from a This seems to run much faster for me, but perhaps I'm misunderstanding the problem?
Ryan Randall
Solutions are easy. Understanding the problem, now, that's the hard part.
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Yesterday @ 9:21 AM
Points: 61,
Visits: 576
|
|
Jeff Moden (4/17/2008) Sorry folks... dunno how I missed this one.
Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job! Actually, I like your method for figuring out the element names in the first place:
SELECT distinct ColName = substring( h.Parameters, t.N - (charindex('&', reverse('&' + left(h.Parameters, t.N - 1))) - 1), charindex('&', reverse('&' + left(h.Parameters, t.N- 1))) - 1 ) FROM dbo.Tally t WITH (NOLOCK) RIGHT OUTER JOIN --Necessary in case Parameters is NULL #SourceTable h ON SUBSTRING(h.Parameters, t.N, 1) = '=' AND t.N < len(h.Parameters)
This runs pretty quickly for me. (yes, it's very inelegant. If anyone knows how to more directly find the last occurring character *before* a string position, I'm very interested)
I tried making the XQuery value pull dynamic, but it insists on a literal string. So, if you want to be completist about it, I think you have to go dynamic SQL.
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 9:57 PM
Points: 32,906,
Visits: 26,790
|
|
RyanRandall (5/22/2008)
Another XML-style approach... drop table #Results ; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable) select RowNum, Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject, Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category, Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status, Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count, Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count into #Results from a This seems to run much faster for me, but perhaps I'm misunderstanding the problem?
Nice... I'll throw that into a million row test and check. Thanks, Ryan.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|