XML vs Split for parsing of URL type data

Question

XML vs Split for parsing of URL type data

antonio.collins

SSCrazy Eights

Points: 9600
More actions
April 2, 2008 at 1:16 pm

#180524

I didn't want to muddy up someone else's topic, so here's the results of parsing URL data into discrete columns.
The tally table (from http://www.sqlservercentral.com/scripts/Advanced+SQL/62486/:
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2
--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100
--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC
The data table (provided by Jeff Moden):
drop table #SourceTable
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
u.Parameters
INTO #SourceTable
FROM (--===== Same two lines of data repeated
SELECT CAST('Subject=Drilling&Category=Drill Header&Status=NO'+
'&IPPDM_Count=94356&Well_Count=84353&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94353'+
'&BeginDate=2/3/2008&EndDate=2/5/2008' AS VARCHAR(8000)) AS Parameters
UNION ALL
SELECT 'Subject=Zone&Category=Zone Header&Status=YES'+
'&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674') u,
Master.sys.All_Columns ac1,
Master.sys.All_Columns ac2
ALTER TABLE #SourceTable
ADD PRIMARY KEY CLUSTERED (RowNum)
Both the tally table and data table were created in db named staging.
Split technique (from Jeff Moden):
use staging
go
SET ANSI_WARNINGS OFF
SET STATISTICS TIME ON
go
--===== Split the data into an EAV (Entity and Value) cte and reassemble as rows containing the desired columns
;WITH
cteFirstSplit AS
(--==== Splits data on the "&" character
SELECT h.RowNum,
ColNum = (ROW_NUMBER() OVER (ORDER BY Parameters)-1)%10,
ColVal = SUBSTRING(h.Parameters, t.N+1, CHARINDEX('&', h.Parameters + '&', t.N+1)-t.N-1)
FROM dbo.Tally t WITH (NOLOCK)
RIGHT OUTER JOIN --Necessary in case Parameters is NULL
SourceTable h
ON SUBSTRING(h.Parameters, t.N, 1) = '='
--AND t.N < CHARINDEX('Zone_',h.Parameters)
AND t.N < len(h.Parameters)
--WHERE RowNum > 900000
--WHERE RowNum <= 10000
)
SELECT RowNum,
MAX(CASE WHEN ColNum = 0 THEN ColVal ELSE NULL END) AS Subject,
MAX(CASE WHEN ColNum = 1 THEN ColVal ELSE NULL END) AS Category,
MAX(CASE WHEN ColNum = 2 THEN ColVal ELSE NULL END) AS Status,
MAX(CASE WHEN ColNum = 3 THEN ColVal ELSE NULL END) AS IPPDM_Count,
MAX(CASE WHEN ColNum = 4 THEN ColVal ELSE NULL END) AS Well_Count/*,
MAX(CASE WHEN ColNum = 5 THEN ColVal ELSE NULL END) AS Zone_Status,
MAX(CASE WHEN ColNum = 6 THEN ColVal ELSE NULL END) AS Prod_String_Count,
MAX(CASE WHEN ColNum = 7 THEN ColVal ELSE NULL END) AS R_Master_W_Count,
MAX(CASE WHEN ColNum = 8 THEN ColVal ELSE NULL END) AS BeginDate,
MAX(CASE WHEN ColNum = 9 THEN ColVal ELSE NULL END) AS EndDate */
INTO #Results
FROM cteFirstSplit
GROUP BY RowNum
print @@ROWCOUNT
SET STATISTICS TIME OFF
go
I modified the above script to suffix an ampersand to h.Parameters since the substring would fail if BeginDate was omitted.
XML technique:
use staging
go
SET ANSI_WARNINGS ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
SET QUOTED_IDENTIFIER ON
go
declare @xml XML
set @xml = (
select cast('<parm RowNum="'+ cast(rowNum as varchar(12))+ '" '+
replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data
from SourceTable
--where RowNum > 900000
-- where RowNum <= 10000
for xml auto, type)
select D.parm.value('@RowNum','int') as RowNum,
D.parm.value('@Subject','varchar(255)') as [Subject],
D.parm.value('@Category','varchar(255)') as Category,
D.parm.value('@Status','varchar(255)') as [Status],
D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,
D.parm.value('@Well_Count','varchar(255)') as Well_Count /*,
D.parm.value('@Zone_Status','varchar(255)') as Zone_Status,
D.parm.value('@Prod_String_Count','varchar(255)') as Prod_String_Count,
D.parm.value('@R_Master_W_Count','varchar(255)') as R_Master_W_Count,
D.parm.value('@BeginDate','varchar(255)') as BeginDate,
D.parm.value('@EndDate','varchar(255)') as EndDate */
into #results
from @xml.nodes('/SourceTable/data/parm') as D(parm)
go
SET STATISTICS TIME OFF
go
I ran both scripts twice for datasets of 10K and 100K rows and 5 fields and 10 fields (fields 6-10 are commented out as is the criteria for the various batch sizes).
Results:
[font="Courier New"]
CPU Elapsed
Rows Fields Run XML Split Diff XML Split Diff
10K 5 1 3,625 1,359 62.5% 3,618 1,369 62.2%
5 2 3,641 1,422 60.9% 3,631 1,409 61.2%
100K 5 1 35,719 41,985 -17.5% 35,729 42,960 -20.2%
5 2 35,719 41,686 -16.7% 35,733 42,655 -19.4%
10K 10 1 4,891 1,750 64.2% 4,882 1,748 64.2%
10 2 4,907 1,750 64.3% 4,910 1,741 64.5%
100K 10 1 48,578 45,030 7.3% 48,595 43,770 9.9%
10 2 48,609 45,563 6.3% 48,654 44,497 8.5%
1M 10 1 486,813 466,952 4.1% 486,900 482,407 0.9%
[/font]
The Split technique wins 3 out of 4 tests, but it's margin of victory was never more than 4 seconds. So, I ran another test with all 1M rows and it's practically a dead heat.
This was my second cut at a query using XML and I'm pretty sure it can be sped up by eliminating the use of the @xml intermediate variable or extracting all attributes at once.
-- this doesn't work
select D.parm ... into #results
from (select cast(...) as XML) as data
from SourceTable
for xml auto, type).nodes('/SourceTable/data/parm') as D(parm)
-- this doesn't work either probably because it's an attribute collection rather than an element collection
select *
from @xml.nodes('/SourceTable/data/parm/@*') as D(RowNum,Subject,Category,...)
One huge difference between the two techniques is the fact that the XML technique is not position sensitive -- it will extract the fields regardless of order. Also, the XML query is easier to grasp and should be easier to maintain.
So Jeff, I will buy you a beer, but only a Bud Light (it's practically water anyway). 😉

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Antonio:

My tests with XML vs Numbers/Tally table have shown similar CPU time results, but have had interesting IO differences. Did you include IO stats in your runs? (If not, is it something you can run again to test that?)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

GSquared SSC Guru Points: 260824 More actions · Answer 2

(I have to confess I wasn't as courteous as Antonio, and posted my tests on Numbers/Tally vs XML on the original thread. That's here: http://www.sqlservercentral.com/Forums/Topic478171-338-2.aspx)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 3

Sorry folks... dunno how I missed this one.

Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

RyanRandall SSChampion Points: 13623 More actions · Answer 4

Another XML-style approach...

drop table #Results

; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)

select RowNum,

Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,

Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,

Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,

Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,

Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count

into #Results

from a

This seems to run much faster for me, but perhaps I'm misunderstanding the problem?

Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.

dfalso SSC-Addicted Points: 491 More actions · Answer 5

Jeff Moden (4/17/2008)
Sorry folks... dunno how I missed this one.
Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

Actually, I like your method for figuring out the element names in the first place:

SELECT distinct

ColName =

substring(

h.Parameters,

t.N - (charindex('&', reverse('&' + left(h.Parameters, t.N - 1))) - 1),

charindex('&', reverse('&' + left(h.Parameters, t.N- 1))) - 1

)

FROM dbo.Tally t WITH (NOLOCK)

RIGHT OUTER JOIN --Necessary in case Parameters is NULL

#SourceTable h

ON

SUBSTRING(h.Parameters, t.N, 1) = '='

AND t.N < len(h.Parameters)

This runs pretty quickly for me.

(yes, it's very inelegant. If anyone knows how to more directly find the last occurring character *before* a string position, I'm very interested)

I tried making the XQuery value pull dynamic, but it insists on a literal string. So, if you want to be completist about it, I think you have to go dynamic SQL.

Jeff Moden SSC Guru Points: 1003863 More actions · Answer 6

RyanRandall (5/22/2008)
Another XML-style approach...
drop table #Results
; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)
select RowNum,
Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,
Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,
Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,
Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,
Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count
into #Results
from a
This seems to run much faster for me, but perhaps I'm misunderstanding the problem?

Nice... I'll throw that into a million row test and check. Thanks, Ryan.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)