Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

XML vs Split for parsing of URL type data Expand / Collapse
Author
Message
Posted Wednesday, April 2, 2008 1:16 PM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Wednesday, October 8, 2014 8:52 AM
Points: 438, Visits: 909
I didn't want to muddy up someone else's topic, so here's the results of parsing URL data into discrete columns.

The tally table (from http://www.sqlservercentral.com/scripts/Advanced+SQL/62486/:
--===== Create and populate the Tally table on the fly
SELECT TOP 11000 --equates to more than 30 years of dates
IDENTITY(INT,1,1) AS N
INTO dbo.Tally
FROM Master.dbo.SysColumns sc1,
Master.dbo.SysColumns sc2

--===== Add a Primary Key to maximize performance
ALTER TABLE dbo.Tally
ADD CONSTRAINT PK_Tally_N
PRIMARY KEY CLUSTERED (N) WITH FILLFACTOR = 100

--===== Allow the general public to use it
GRANT SELECT ON dbo.Tally TO PUBLIC

The data table (provided by Jeff Moden):
drop table #SourceTable
SELECT TOP 1000000
RowNum = IDENTITY(INT,1,1),
u.Parameters
INTO #SourceTable
FROM (--===== Same two lines of data repeated
SELECT CAST('Subject=Drilling&Category=Drill Header&Status=NO'+
'&IPPDM_Count=94356&Well_Count=84353&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94353'+
'&BeginDate=2/3/2008&EndDate=2/5/2008' AS VARCHAR(8000)) AS Parameters
UNION ALL
SELECT 'Subject=Zone&Category=Zone Header&Status=YES'+
'&IPPDM_Count=94356&Well_Count=94356&Zone_Status=NO'+
'&Prod_String_Count=95249&R_Master_W_Count=94356&AddlZoneInterest=107674') u,

Master.sys.All_Columns ac1,
Master.sys.All_Columns ac2

ALTER TABLE #SourceTable
ADD PRIMARY KEY CLUSTERED (RowNum)

Both the tally table and data table were created in db named staging.
Split technique (from Jeff Moden):
use staging
go
SET ANSI_WARNINGS OFF
SET STATISTICS TIME ON
go
--===== Split the data into an EAV (Entity and Value) cte and reassemble as rows containing the desired columns
;WITH
cteFirstSplit AS
(--==== Splits data on the "&" character
SELECT h.RowNum,
ColNum = (ROW_NUMBER() OVER (ORDER BY Parameters)-1)%10,
ColVal = SUBSTRING(h.Parameters, t.N+1, CHARINDEX('&', h.Parameters + '&', t.N+1)-t.N-1)
FROM dbo.Tally t WITH (NOLOCK)
RIGHT OUTER JOIN --Necessary in case Parameters is NULL
SourceTable h
ON SUBSTRING(h.Parameters, t.N, 1) = '='
--AND t.N < CHARINDEX('Zone_',h.Parameters)
AND t.N < len(h.Parameters)
--WHERE RowNum > 900000
--WHERE RowNum <= 10000
)
SELECT RowNum,
MAX(CASE WHEN ColNum = 0 THEN ColVal ELSE NULL END) AS Subject,
MAX(CASE WHEN ColNum = 1 THEN ColVal ELSE NULL END) AS Category,
MAX(CASE WHEN ColNum = 2 THEN ColVal ELSE NULL END) AS Status,
MAX(CASE WHEN ColNum = 3 THEN ColVal ELSE NULL END) AS IPPDM_Count,
MAX(CASE WHEN ColNum = 4 THEN ColVal ELSE NULL END) AS Well_Count/*,
MAX(CASE WHEN ColNum = 5 THEN ColVal ELSE NULL END) AS Zone_Status,
MAX(CASE WHEN ColNum = 6 THEN ColVal ELSE NULL END) AS Prod_String_Count,
MAX(CASE WHEN ColNum = 7 THEN ColVal ELSE NULL END) AS R_Master_W_Count,
MAX(CASE WHEN ColNum = 8 THEN ColVal ELSE NULL END) AS BeginDate,
MAX(CASE WHEN ColNum = 9 THEN ColVal ELSE NULL END) AS EndDate */
INTO #Results
FROM cteFirstSplit
GROUP BY RowNum
print @@ROWCOUNT
SET STATISTICS TIME OFF
go

I modified the above script to suffix an ampersand to h.Parameters since the substring would fail if BeginDate was omitted.

XML technique:
use staging
go
SET ANSI_WARNINGS ON
SET STATISTICS TIME ON
SET STATISTICS IO ON
SET QUOTED_IDENTIFIER ON
go
declare @xml XML
set @xml = (
select cast('<parm RowNum="'+ cast(rowNum as varchar(12))+ '" '+
replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as XML) as data
from SourceTable
--where RowNum > 900000
-- where RowNum <= 10000
for xml auto, type)

select D.parm.value('@RowNum','int') as RowNum,
D.parm.value('@Subject','varchar(255)') as [Subject],
D.parm.value('@Category','varchar(255)') as Category,
D.parm.value('@Status','varchar(255)') as [Status],
D.parm.value('@IPPDM_Count','varchar(255)') as IPPDM_Count,
D.parm.value('@Well_Count','varchar(255)') as Well_Count /*,
D.parm.value('@Zone_Status','varchar(255)') as Zone_Status,
D.parm.value('@Prod_String_Count','varchar(255)') as Prod_String_Count,
D.parm.value('@R_Master_W_Count','varchar(255)') as R_Master_W_Count,
D.parm.value('@BeginDate','varchar(255)') as BeginDate,
D.parm.value('@EndDate','varchar(255)') as EndDate */
into #results
from @xml.nodes('/SourceTable/data/parm') as D(parm)
go
SET STATISTICS TIME OFF
go

I ran both scripts twice for datasets of 10K and 100K rows and 5 fields and 10 fields (fields 6-10 are commented out as is the criteria for the various batch sizes).
Results:

CPU Elapsed
Rows Fields Run XML Split Diff XML Split Diff
10K 5 1 3,625 1,359 62.5% 3,618 1,369 62.2%
5 2 3,641 1,422 60.9% 3,631 1,409 61.2%
100K 5 1 35,719 41,985 -17.5% 35,729 42,960 -20.2%
5 2 35,719 41,686 -16.7% 35,733 42,655 -19.4%
10K 10 1 4,891 1,750 64.2% 4,882 1,748 64.2%
10 2 4,907 1,750 64.3% 4,910 1,741 64.5%
100K 10 1 48,578 45,030 7.3% 48,595 43,770 9.9%
10 2 48,609 45,563 6.3% 48,654 44,497 8.5%
1M 10 1 486,813 466,952 4.1% 486,900 482,407 0.9%

The Split technique wins 3 out of 4 tests, but it's margin of victory was never more than 4 seconds. So, I ran another test with all 1M rows and it's practically a dead heat.

This was my second cut at a query using XML and I'm pretty sure it can be sped up by eliminating the use of the @xml intermediate variable or extracting all attributes at once.
-- this doesn't work
select D.parm ... into #results
from (select cast(...) as XML) as data
from SourceTable
for xml auto, type).nodes('/SourceTable/data/parm') as D(parm)

-- this doesn't work either probably because it's an attribute collection rather than an element collection
select *
from @xml.nodes('/SourceTable/data/parm/@*') as D(RowNum,Subject,Category,...)


One huge difference between the two techniques is the fact that the XML technique is not position sensitive -- it will extract the fields regardless of order. Also, the XML query is easier to grasp and should be easier to maintain.

So Jeff, I will buy you a beer, but only a Bud Light (it's practically water anyway).



  Post Attachments 
results.xls (51 views, 40.50 KB)
results.jpg (98 views, 69.84 KB)
Post #478799
Posted Wednesday, April 2, 2008 1:37 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Monday, November 17, 2014 12:50 PM
Points: 13,872, Visits: 9,598
Antonio:

My tests with XML vs Numbers/Tally table have shown similar CPU time results, but have had interesting IO differences. Did you include IO stats in your runs? (If not, is it something you can run again to test that?)


- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Post #478812
Posted Wednesday, April 2, 2008 2:06 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Monday, November 17, 2014 12:50 PM
Points: 13,872, Visits: 9,598
(I have to confess I wasn't as courteous as Antonio, and posted my tests on Numbers/Tally vs XML on the original thread. That's here: http://www.sqlservercentral.com/Forums/Topic478171-338-2.aspx)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon
Post #478827
Posted Thursday, April 17, 2008 10:20 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:00 AM
Points: 35,547, Visits: 32,137
Sorry folks... dunno how I missed this one.

Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #486891
Posted Thursday, May 22, 2008 8:33 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Tuesday, May 29, 2012 11:22 AM
Points: 1,755, Visits: 4,652
Another XML-style approach...

drop table #Results
; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)
select RowNum,
Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,
Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,
Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,
Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,
Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count
into #Results
from a

This seems to run much faster for me, but perhaps I'm misunderstanding the problem?



Ryan Randall

Solutions are easy. Understanding the problem, now, that's the hard part.
Post #505212
Posted Thursday, May 22, 2008 2:47 PM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: 2 days ago @ 9:04 AM
Points: 63, Visits: 670
Jeff Moden (4/17/2008)
Sorry folks... dunno how I missed this one.

Antonio... I sure do appreciate the attention to detail you put into the tests you did and the reports that followed. Absolutely awesome job!

Actually, I like your method for figuring out the element names in the first place:
SELECT distinct 
ColName =
substring(
h.Parameters,
t.N - (charindex('&', reverse('&' + left(h.Parameters, t.N - 1))) - 1),
charindex('&', reverse('&' + left(h.Parameters, t.N- 1))) - 1
)
FROM dbo.Tally t WITH (NOLOCK)
RIGHT OUTER JOIN --Necessary in case Parameters is NULL
#SourceTable h
ON
SUBSTRING(h.Parameters, t.N, 1) = '='
AND t.N < len(h.Parameters)

This runs pretty quickly for me.
(yes, it's very inelegant. If anyone knows how to more directly find the last occurring character *before* a string position, I'm very interested)

I tried making the XQuery value pull dynamic, but it insists on a literal string. So, if you want to be completist about it, I think you have to go dynamic SQL.
Post #505489
Posted Thursday, May 22, 2008 6:33 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:00 AM
Points: 35,547, Visits: 32,137
RyanRandall (5/22/2008)
Another XML-style approach...

drop table #Results
; with a as (select RowNum, cast('<parm '+ replace(replace(Parameters,'=','="'),'&','" ')+ '"/>' as xml) as Xml from #SourceTable)
select RowNum,
Xml.value('(parm[1]/@Subject)[1]', 'varchar(255)') as Subject,
Xml.value('(parm[1]/@Category)[1]', 'varchar(255)') as Category,
Xml.value('(parm[1]/@Status)[1]', 'varchar(255)') as Status,
Xml.value('(parm[1]/@IPPDM_Count)[1]', 'varchar(255)') as IPPDM_Count,
Xml.value('(parm[1]/@Well_Count)[1]', 'varchar(255)') as Well_Count
into #Results
from a

This seems to run much faster for me, but perhaps I'm misunderstanding the problem?


Nice... I'll throw that into a million row test and check. Thanks, Ryan.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #505576
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse