Importing Strange XML File

Question

Post reply

Importing Strange XML File

Mark Derryberry

SSC Enthusiast

Points: 124
More actions
February 24, 2011 at 12:59 pm

#104044

I'm new to using XML with SQL and I've been given a XML file that doesn't fit the normal examples, that I've read, for importing.
<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>
<wcsxml allrelations="false" batch="1" createtime="12/29/2010 11:21:30" delimiter="|" destination="MAS" source="145" version="4.04" whslrno="32711">
<app id="MOB" />
<app id="IBS" />
<relation allrows="true" columns="CustomerID|CustomerGroupID|InvoiceNum|InvoiceDate|InvoiceAmt" name="account" />
<relation allrows="true" columns="GroupCode|GroupDesc|SequenceNum" name="group">
<row values="10|MINIMUM 34 YRS|1" />
<row values="11|ENTRY LEVEL/CA|2" />
<row values="12|28-34 YEARS|3" />
<row values="20|35+ YEARS|4" />
<row values="21|35-49 YEARS|5" />
<row values="22|50+ YEARS|6" />
<row values="30|AGE NOT IDENT.|7" />
</relation>
<relation allrows="true" columns="ContentCode|ContentDesc|NonFlag|SequenceNbr" name="content">
<row values="AP000| .0 Percent|Y|1" />
</relation>
</wcsxml>
Any help on getting me started would be greatly appreciated.
Mark

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

GSquared SSC Guru Points: 260824 More actions · Answer 1

Are you familiar with the T-SQL version of XQuery?

With that, it's pretty easy to shred this XML.

This should get you started:

DECLARE @XML XML = '<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

<wcsxml allrelations="false" batch="1" createtime="12/29/2010 11:21:30" delimiter="|" destination="MAS" source="145" version="4.04" whslrno="32711">

<app id="MOB" />

<app id="IBS" />

<relation allrows="true" columns="CustomerID|CustomerGroupID|InvoiceNum|InvoiceDate|InvoiceAmt" name="account" />

<relation allrows="true" columns="GroupCode|GroupDesc|SequenceNum" name="group">

<row values="10|MINIMUM 34 YRS|1" />

<row values="11|ENTRY LEVEL/CA|2" />

<row values="12|28-34 YEARS|3" />

<row values="20|35+ YEARS|4" />

<row values="21|35-49 YEARS|5" />

<row values="22|50+ YEARS|6" />

<row values="30|AGE NOT IDENT.|7" />

</relation>

<relation allrows="true" columns="ContentCode|ContentDesc|NonFlag|SequenceNbr" name="content">

<row values="AP000| .0 Percent|Y|1" />

</relation>

</wcsxml>';

SELECT X.Y.query('.')

FROM @XML.nodes('/wcsxml/app') X(Y);

SELECT X.Y.query('.')

FROM @XML.nodes('/wcsxml/relation') X(Y);

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mark Derryberry SSC Enthusiast Points: 124 More actions · Answer 2

Awesome! Thank you for your assistance. I'm now able to shred table names out of that mess by modifying it a little so that it looks like:

SELECT T.C.value('@name','VARCHAR(64)')

FROM @XML.nodes('/wcsxml/relation') AS T(C)

I can go even further and use:

SELECT T.C.value('@name','VARCHAR(64)')

FROM @XML.nodes('/wcsxml/relation[1]') AS T(C)

This pulls out a single table name.

Now comes my last problem; the column names and values. I've read up on tsql xquery and have played around a bit in an attempt to shred out columns and values, but I've been unsuccessful so far.

Could I trouble you as to a hint on how to shred out the delimited column names? I've come up with a rather clunky way of doing it by use of a loop and parsing each line, but this is VERY slow. :crazy:

GSquared SSC Guru Points: 260824 More actions · Answer 3

Is this what you're looking for?

SELECT T.C.value('@name','VARCHAR(64)'),

T.C.value('local-name(.)','VARCHAR(64)')

FROM @XML.nodes('/wcsxml/relation') AS T(C)

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Mark Derryberry SSC Enthusiast Points: 124 More actions · Answer 4

Unfortunately no. I was able to get that far, but the issue is how the file is designed. Forgive me if you already have figured out the following.

The columns are identified in a pipe-delimited format within the relation tag.

For instance the line:

This indicates the table name is "content" and the columns are, ContentCode, ContentDesc, NonFlag, and SequenceNbr.

The row tags indicated a pipe-delimited value set for those columns.

The optimal way it should have been designed is:

<ContentDesc> .0 Percent</ContentDesc>

</content>

I now have the un-enviable task of figuring out how to parse the delimters and import it correctly. As I said in my earlier email, I have this working through a loop, but it's slow as the actual file is much larger than my example.

Mark

LutzM SSC Guru Points: 107049 More actions · Answer 5

Here's my approach to shred the xml file:

I'd apply a fast split string function to the shredded data. It might help performance to insert the shredded data into a staging table before applying the split function. Needs to be tested and verified.

; WITH cte AS

(

SELECT

ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

X.Y.value('@columns[1]','varchar(200)') AS col,

U.V.value('@values[1]','varchar(200)') AS val

FROM @XML.nodes('/wcsxml/relation') X(Y)

CROSS APPLY X.Y.nodes('row') U(V)

)

SELECT cc.row, cols.item AS col, vals.item AS val

FROM cte cc

CROSS APPLY dbo.DelimitedSplit8K(col,'|') cols

CROSS APPLY dbo.DelimitedSplit8K(val,'|') vals

WHERE cols.ItemNumber = vals.ItemNumber

Once you have the data shredded as shown you could either use a CrossTab query (as described in the related link in my signature) if you know the column names or use a DynamicCrossTab (also referenced in my signature).

Depending on the data volume it might help to store the split results into another (indexed) temp table. Again, this needs to be tested.

Regarding the DelimitedSplit8K function: ~~please search this site for "DelimitedSplit8K" and you'll find the related code.~~ see the last link in my signature (totally forgot about it - shame on me...)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281344 More actions · Answer 6

I think This will get you closer

SELECT X.Y.value('(../@name)[1]','varchar(20)') as TableName

,x.y.value('(../@columns)[1]','varchar(128)') as ColumnList

,x.y.value('(@values)[1]','varchar(128)') as ColumnValues

FROM @XML.nodes('//relation/row') X(Y);

I'm still working through it, but my thought was to use the pivot and a string shred to normalize the data.

This will create two columns with | delimited lists

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

SQLRNNR SSC Guru Points: 281344 More actions · Answer 7

LutzM (2/25/2011)
Here's my approach to shred the xml file:
I'd apply a fast split string function to the shredded data. It might help performance to insert the shredded data into a staging table before applying the split function. Needs to be tested and verified.
; WITH cte AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,
X.Y.value('@columns[1]','varchar(200)') AS col,
U.V.value('@values[1]','varchar(200)') AS val
FROM @XML.nodes('/wcsxml/relation') X(Y)
CROSS APPLY X.Y.nodes('row') U(V)
)
SELECT cc.row, cols.item AS col, vals.item AS val
FROM cte cc
CROSS APPLY dbo.DelimitedSplit8K(col,'|') cols
CROSS APPLY dbo.DelimitedSplit8K(val,'|') vals
WHERE cols.ItemNumber = vals.ItemNumber
Once you have the data shredded as shown you could either use a CrossTab query (as described in the related link in my signature) if you know the column names or use a DynamicCrossTab (also referenced in my signature).
Depending on the data volume it might help to store the split results into another (indexed) temp table. Again, this needs to be tested.
Regarding the DelimitedSplit8K function: ~~please search this site for "DelimitedSplit8K" and you'll find the related code.~~ see the last link in my signature (totally forgot about it - shame on me...)

Nix my response - didn't see Lutz's answer while working on a solution.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Mark Derryberry SSC Enthusiast Points: 124 More actions · Answer 8

Thanks for the help! This looks promising and I'll play with it some more over the weekend. This has become less of a task and more of a personal challenge now!

SQLRNNR SSC Guru Points: 281344 More actions · Answer 9

I played with it some more. Using combo from what lutz and I did already.

; WITH cte AS

(

SELECT

ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

X.Y.value('(@name)[1]','varchar(20)') as TableName,

X.Y.value('@columns[1]','varchar(200)') AS col,

U.V.value('@values[1]','varchar(200)') AS val

FROM @XML.nodes('//relation') X(Y)

CROSS APPLY X.Y.nodes('row') U(V)

), split as (

SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val

FROM cte cc

CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols

CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals

WHERE cols.ItemNumber = vals.ItemNumber

)

Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr

From (Select TableName, val,row,col

From split) as SourceTable

Pivot

( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)

) as Pivt;

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

LutzM SSC Guru Points: 107049 More actions · Answer 10

CirquedeSQLeil (2/25/2011)
I played with it some more. Using combo from what lutz and I did already.
; WITH cte AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,
X.Y.value('(@name)[1]','varchar(20)') as TableName,
X.Y.value('@columns[1]','varchar(200)') AS col,
U.V.value('@values[1]','varchar(200)') AS val
FROM @XML.nodes('//relation') X(Y)
CROSS APPLY X.Y.nodes('row') U(V)
), split as (
SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val
FROM cte cc
CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols
CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals
WHERE cols.ItemNumber = vals.ItemNumber
)
Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr
From (Select TableName, val,row,col
From split) as SourceTable
Pivot
( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)
) as Pivt;

I would have used CrossTab instead of Pivot, but the concept would have been the same.

Nice job, Jason!

@mark-3:

As long as yon know the column names you'll need to deal with the solution Jason posted will work just fine. But if you need to deal with unknown column names you might want to consider using the CrossTab method. I, personally, think it's easier to convert at static CrossTab into a dynamic version than doing the same with the PIVOT operator (I'm not saying it's impossible. I'm just not used to it. And since there is very little -if any- benefit when using Pivot I've decided to stay with CrossTab...)

Lutz
A pessimist is an optimist with experience.

How to get fast answers to your question[/url]
How to post performance related questions[/url]
Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

SQLRNNR SSC Guru Points: 281344 More actions · Answer 11

LutzM (2/26/2011)
CirquedeSQLeil (2/25/2011)
I played with it some more. Using combo from what lutz and I did already.
; WITH cte AS
(
SELECT
ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,
X.Y.value('(@name)[1]','varchar(20)') as TableName,
X.Y.value('@columns[1]','varchar(200)') AS col,
U.V.value('@values[1]','varchar(200)') AS val
FROM @XML.nodes('//relation') X(Y)
CROSS APPLY X.Y.nodes('row') U(V)
), split as (
SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val
FROM cte cc
CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols
CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals
WHERE cols.ItemNumber = vals.ItemNumber
)
Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr
From (Select TableName, val,row,col
From split) as SourceTable
Pivot
( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)
) as Pivt;
I would have used CrossTab instead of Pivot, but the concept would have been the same.
Nice job, Jason!
@mark-3:
As long as yon know the column names you'll need to deal with the solution Jason posted will work just fine. But if you need to deal with unknown column names you might want to consider using the CrossTab method. I, personally, think it's easier to convert at static CrossTab into a dynamic version than doing the same with the PIVOT operator (I'm not saying it's impossible. I'm just not used to it. And since there is very little -if any- benefit when using Pivot I've decided to stay with CrossTab...)

I would recommend going that route too frankly. I just didn't have the time to figure out the dynamic crosstab at the moment. Both methods are topics I would need to practice more with to get more comfortable in them.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Mark Derryberry SSC Enthusiast Points: 124 More actions · Answer 12

Thanks loads guys!!

The whole process still takes about 2 1/2 hours, but it's a whole lot better than before! Most of the time is because of the structure of that file! There are actually 49 tables and over a million records associated with it..., so it's not the most efficient method, but it's the only method.

I've requested they change the XML format, but they are extremely reluctant.

Again, thanks for the help.

SQLRNNR SSC Guru Points: 281344 More actions · Answer 13

Mark Derryberry (3/1/2011)
Thanks loads guys!!
The whole process still takes about 2 1/2 hours, but it's a whole lot better than before! Most of the time is because of the structure of that file! There are actually 49 tables and over a million records associated with it..., so it's not the most efficient method, but it's the only method.
I've requested they change the XML format, but they are extremely reluctant.
Again, thanks for the help.

You are welcome. Glad to have been of some help.

How much of an improvement did you see?

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Mark Derryberry SSC Enthusiast Points: 124 More actions · Answer 14

Over 50% from my initial attempts! I was pushing about 6 hours and the methods you all proposed and some additional things I had to do given the scope. It's down to about 2 1/2 on a server. I say server because my desktop dev machine almost choked to death. 🙂