Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


XML vs Split for parsing of URL type data


XML vs Split for parsing of URL type data

Author
Message
antonio.collins
antonio.collins
SSC-Addicted
SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)SSC-Addicted (454 reputation)

Group: General Forum Members
Points: 454 Visits: 921
I didn't want to muddy up someone else's topic, so here's the results of parsing URL data into discrete columns.

The tally table (from http://www.sqlservercentral.com/scripts/Advanced+SQL/62486/:
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2

--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC


The data table (provided by Jeff Moden):
drop table #SourceTable
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
u.Parameters
INTO #SourceTable
FROM (--===== Same two lines of data repeated
SELECT CAST('Subject=Drilling&Category=Drill Header&Status=NO'+
'&IPPDM_Count=94356&Well_Count=84353&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94353'+
'&BeginDate=2/3/2008&EndDate=2/5/2008' AS VARCHAR(8000)) AS Parameters
UNION ALL
SELECT 'Subject=Zone&Category=Zone Header&Status=YES'+
'&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674') u,

Master.sys.All_Columns ac1,
Master.sys.All_Columns ac2

ALTER TABLE #SourceTable
ADD PRIMARY KEY CLUSTERED (RowNum)


Both the tally table and data table were created in db named staging.
Split technique (from Jeff Moden):
use staging
go
SET ANSI_WARNINGS OFF
SET STATISTICS TIME ON
go
--===== Split the data into an EAV (Entity and Value) cte and reassemble as rows containing the desired columns
;WITH
cteFirstSplit AS
(--==== Splits data on the "&" character
SELECT h.RowNum,
ColNum = (ROW_NUMBER() OVER (ORDER BY Parameters)-1)%10,
ColVal = SUBSTRING(h.Parameters, t.N+1, CHARINDEX('&', h.Parameters + '&', t.N+1)-t.N-1)
FROM dbo.Tally t WITH (NOLOCK)
RIGHT OUTER JOIN --Necessary in case Parameters is NULL
SourceTable h
ON SUBSTRING(h.Parameters, t.N, 1) = '='
   --AND t.N < CHARINDEX('Zone_',h.Parameters)
AND t.N < len(h.Parameters)
   --WHERE RowNum > 900000
   --WHERE RowNum <= 10000
)
SELECT RowNum,
MAX(CASE WHEN ColNum = 0 THEN ColVal ELSE NULL END) AS Subject,
MAX(CASE WHEN ColNum = 1 THEN ColVal ELSE NULL END) AS Category,
MAX(CASE WHEN ColNum = 2 THEN ColVal ELSE NULL END) AS Status,
MAX(CASE WHEN ColNum = 3 THEN ColVal ELSE NULL END) AS IPPDM_Count,
MAX(CASE WHEN ColNum = 4 THEN ColVal ELSE NULL END) AS Well_Count/*,
MAX(CASE WHEN ColNum = 5 THEN ColVal ELSE NULL END) AS Zone_Status,
MAX(CASE WHEN ColNum = 6 THEN ColVal ELSE NULL END) AS Prod_String_Count,
MAX(CASE WHEN ColNum = 7 THEN ColVal ELSE NULL END) AS R_Master_W_Count,
MAX(CASE WHEN ColNum = 8 THEN ColVal ELSE NULL END) AS BeginDate,
MAX(CASE WHEN ColNum = 9 THEN ColVal ELSE NULL END) AS EndDate */
INTO #Results
FROM cteFirstSplit
GROUP BY RowNum
print @@ROWCOUNT
SET STATISTICS TIME OFF
go


I modified the above script to suffix an ampersand to h.Parameters since the substring would fail if BeginDate was omitted.

XML technique:
use staging
go
SET ANSI_WARNINGS ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
SET QUOTED_IDENTIFIER ON
go
declare @xml XML
set @xml = (
select cast('<parm RowNum="'+ cast(rowNum as varchar(12))+ '" '+
replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data
   from SourceTable
   --where RowNum > 900000
   -- where RowNum <= 10000
   for xml auto, type)

select    D.parm.value('@RowNum','int') as RowNum,
   D.parm.value('@Subject','varchar(255)') as [Subject],
   D.parm.value('@Category','varchar(255)') as Category,
   D.parm.value('@Status','varchar(255)') as [Status],
   D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,
   D.parm.value('@Well_Count','varchar(255)') as Well_Count /*,
   D.parm.value('@Zone_Status','varchar(255)') as Zone_Status,
   D.parm.value('@Prod_String_Count','varchar(255)') as Prod_String_Count,
   D.parm.value('@R_Master_W_Count','varchar(255)') as R_Master_W_Count,
   D.parm.value('@BeginDate','varchar(255)') as BeginDate,
   D.parm.value('@EndDate','varchar(255)') as EndDate */
into #results
from @xml.nodes('/SourceTable/data/parm') as D(parm)
go
SET STATISTICS TIME OFF
go


I ran both scripts twice for datasets of 10K and 100K rows and 5 fields and 10 fields (fields 6-10 are commented out as is the criteria for the various batch sizes).
Results:

CPU Elapsed
Rows Fields Run XML Split Diff XML Split Diff
10K 5 1 3,625 1,359 62.5% 3,618 1,369 62.2%
5 2 3,641 1,422 60.9% 3,631 1,409 61.2%
100K 5 1 35,719 41,985 -17.5% 35,729 42,960 -20.2%
5 2 35,719 41,686 -16.7% 35,733 42,655 -19.4%
10K 10 1 4,891 1,750 64.2% 4,882 1,748 64.2%
10 2 4,907 1,750 64.3% 4,910 1,741 64.5%
100K 10 1 48,578 45,030 7.3% 48,595 43,770 9.9%
10 2 48,609 45,563 6.3% 48,654 44,497 8.5%
1M 10 1 486,813 466,952 4.1% 486,900 482,407 0.9%

The Split technique wins 3 out of 4 tests, but it's margin of victory was never more than 4 seconds. So, I ran another test with all 1M rows and it's practically a dead heat.

This was my second cut at a query using XML and I'm pretty sure it can be sped up by eliminating the use of the @xml intermediate variable or extracting all attributes at once.

-- this doesn't work
select D.parm ... into #results
from (select cast(...) as XML) as data
from SourceTable
for xml auto, type).nodes('/SourceTable/data/parm') as D(parm)

-- this doesn't work either probably because it's an attribute collection rather than an element collection
select *
from @xml.nodes('/SourceTable/data/parm/@*') as D(RowNum,Subject,Category,...)


One huge difference between the two techniques is the fact that the XML technique is not position sensitive -- it will extract the fields regardless of order. Also, the XML query is easier to grasp and should be easier to maintain.

So Jeff, I will buy you a beer, but only a Bud Light (it's practically water anyway). ;-)
Attachments
results.xls (52 views, 40.00 KB)
results.jpg (100 views, 69.00 KB)
GSquared
GSquared
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14363 Visits: 9729
Antonio:

My tests with XML vs Numbers/Tally table have shown similar CPU time results, but have had interesting IO differences. Did you include IO stats in your runs? (If not, is it something you can run again to test that?)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
GSquared
GSquared
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14363 Visits: 9729
(I have to confess I wasn't as courteous as Antonio, and posted my tests on Numbers/Tally vs XML on the original thread. That's here: http://www.sqlservercentral.com/Forums/Topic478171-338-2.aspx)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)

Group: General Forum Members
Points: 44774 Visits: 39845
Sorry folks... dunno how I missed this one.

Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
     Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is usually not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
RyanRandall
RyanRandall
SSCommitted
SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)SSCommitted (1.8K reputation)

Group: General Forum Members
Points: 1761 Visits: 4652
Another XML-style approach...

drop table #Results
; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)
select RowNum,
Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,
Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,
Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,
Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,
Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count
into #Results
from a



This seems to run much faster for me, but perhaps I'm misunderstanding the problem?

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.
dfalso
dfalso
Valued Member
Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)

Group: General Forum Members
Points: 67 Visits: 782
Jeff Moden (4/17/2008)
Sorry folks... dunno how I missed this one.

Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

Actually, I like your method for figuring out the element names in the first place:

SELECT distinct
   ColName =
      substring(
         h.Parameters,
         t.N - (charindex('&', reverse('&' + left(h.Parameters, t.N - 1))) - 1),
         charindex('&', reverse('&' + left(h.Parameters, t.N- 1))) - 1
      Wink
FROM dbo.Tally t WITH (NOLOCK)
RIGHT OUTER JOIN --Necessary in case Parameters is NULL
#SourceTable h
ON
   SUBSTRING(h.Parameters, t.N, 1) = '='
   AND t.N < len(h.Parameters)



This runs pretty quickly for me.
(yes, it's very inelegant. If anyone knows how to more directly find the last occurring character *before* a string position, I'm very interested)

I tried making the XQuery value pull dynamic, but it insists on a literal string. So, if you want to be completist about it, I think you have to go dynamic SQL.
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)SSC-Forever (44K reputation)

Group: General Forum Members
Points: 44774 Visits: 39845
RyanRandall (5/22/2008)
Another XML-style approach...

drop table #Results
; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)
select RowNum,
Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,
Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,
Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,
Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,
Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count
into #Results
from a



This seems to run much faster for me, but perhaps I'm misunderstanding the problem?


Nice... I'll throw that into a million row test and check. Thanks, Ryan.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
     Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is usually not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search