Importing Strange XML File

  • I'm new to using XML with SQL and I've been given a XML file that doesn't fit the normal examples, that I've read, for importing.

    <?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

    <wcsxml allrelations="false" batch="1" createtime="12/29/2010 11:21:30" delimiter="|" destination="MAS" source="145" version="4.04" whslrno="32711">

    <app id="MOB" />

    <app id="IBS" />

    <relation allrows="true" columns="CustomerID|CustomerGroupID|InvoiceNum|InvoiceDate|InvoiceAmt" name="account" />

    <relation allrows="true" columns="GroupCode|GroupDesc|SequenceNum" name="group">

    <row values="10|MINIMUM 34 YRS|1" />

    <row values="11|ENTRY LEVEL/CA|2" />

    <row values="12|28-34 YEARS|3" />

    <row values="20|35+ YEARS|4" />

    <row values="21|35-49 YEARS|5" />

    <row values="22|50+ YEARS|6" />

    <row values="30|AGE NOT IDENT.|7" />

    </relation>

    <relation allrows="true" columns="ContentCode|ContentDesc|NonFlag|SequenceNbr" name="content">

    <row values="AP000| .0 Percent|Y|1" />

    </relation>

    </wcsxml>

    Any help on getting me started would be greatly appreciated.

    Mark

  • Are you familiar with the T-SQL version of XQuery?

    With that, it's pretty easy to shred this XML.

    This should get you started:

    DECLARE @XML XML = '<?xml version="1.0" encoding="UTF-8" standalone="yes" ?>

    <wcsxml allrelations="false" batch="1" createtime="12/29/2010 11:21:30" delimiter="|" destination="MAS" source="145" version="4.04" whslrno="32711">

    <app id="MOB" />

    <app id="IBS" />

    <relation allrows="true" columns="CustomerID|CustomerGroupID|InvoiceNum|InvoiceDate|InvoiceAmt" name="account" />

    <relation allrows="true" columns="GroupCode|GroupDesc|SequenceNum" name="group">

    <row values="10|MINIMUM 34 YRS|1" />

    <row values="11|ENTRY LEVEL/CA|2" />

    <row values="12|28-34 YEARS|3" />

    <row values="20|35+ YEARS|4" />

    <row values="21|35-49 YEARS|5" />

    <row values="22|50+ YEARS|6" />

    <row values="30|AGE NOT IDENT.|7" />

    </relation>

    <relation allrows="true" columns="ContentCode|ContentDesc|NonFlag|SequenceNbr" name="content">

    <row values="AP000| .0 Percent|Y|1" />

    </relation>

    </wcsxml>';

    SELECT X.Y.query('.')

    FROM @XML.nodes('/wcsxml/app') X(Y);

    SELECT X.Y.query('.')

    FROM @XML.nodes('/wcsxml/relation') X(Y);

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Awesome! Thank you for your assistance. I'm now able to shred table names out of that mess by modifying it a little so that it looks like:

    SELECT T.C.value('@name','VARCHAR(64)')

    FROM @XML.nodes('/wcsxml/relation') AS T(C)

    I can go even further and use:

    SELECT T.C.value('@name','VARCHAR(64)')

    FROM @XML.nodes('/wcsxml/relation[1]') AS T(C)

    This pulls out a single table name.

    Now comes my last problem; the column names and values. I've read up on tsql xquery and have played around a bit in an attempt to shred out columns and values, but I've been unsuccessful so far.

    Could I trouble you as to a hint on how to shred out the delimited column names? I've come up with a rather clunky way of doing it by use of a loop and parsing each line, but this is VERY slow. :crazy:

  • Is this what you're looking for?

    SELECT T.C.value('@name','VARCHAR(64)'),

    T.C.value('local-name(.)','VARCHAR(64)')

    FROM @XML.nodes('/wcsxml/relation') AS T(C)

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Unfortunately no. I was able to get that far, but the issue is how the file is designed. Forgive me if you already have figured out the following.

    The columns are identified in a pipe-delimited format within the relation tag.

    For instance the line:

    <relation allrows="true" columns="ContentCode|ContentDesc|NonFlag|SequenceNbr" name="content">

    This indicates the table name is "content" and the columns are, ContentCode, ContentDesc, NonFlag, and SequenceNbr.

    The row tags indicated a pipe-delimited value set for those columns.

    <row values="AP000| .0 Percent|Y|1" />

    The optimal way it should have been designed is:

    <content>

    <ContentCode>AP000</ContentCode>

    <ContentDesc> .0 Percent</ContentDesc>

    <NonFlag>Y</NonFlag>

    <SequenceNbr>1</SequenceNbr>

    </content>

    I now have the un-enviable task of figuring out how to parse the delimters and import it correctly. As I said in my earlier email, I have this working through a loop, but it's slow as the actual file is much larger than my example.

    Mark

  • Here's my approach to shred the xml file:

    I'd apply a fast split string function to the shredded data. It might help performance to insert the shredded data into a staging table before applying the split function. Needs to be tested and verified.

    ; WITH cte AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

    X.Y.value('@columns[1]','varchar(200)') AS col,

    U.V.value('@values[1]','varchar(200)') AS val

    FROM @XML.nodes('/wcsxml/relation') X(Y)

    CROSS APPLY X.Y.nodes('row') U(V)

    )

    SELECT cc.row, cols.item AS col, vals.item AS val

    FROM cte cc

    CROSS APPLY dbo.DelimitedSplit8K(col,'|') cols

    CROSS APPLY dbo.DelimitedSplit8K(val,'|') vals

    WHERE cols.ItemNumber = vals.ItemNumber

    Once you have the data shredded as shown you could either use a CrossTab query (as described in the related link in my signature) if you know the column names or use a DynamicCrossTab (also referenced in my signature).

    Depending on the data volume it might help to store the split results into another (indexed) temp table. Again, this needs to be tested.

    Regarding the DelimitedSplit8K function: please search this site for "DelimitedSplit8K" and you'll find the related code. see the last link in my signature (totally forgot about it - shame on me...)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • I think This will get you closer

    SELECT X.Y.value('(../@name)[1]','varchar(20)') as TableName

    ,x.y.value('(../@columns)[1]','varchar(128)') as ColumnList

    ,x.y.value('(@values)[1]','varchar(128)') as ColumnValues

    FROM @XML.nodes('//relation/row') X(Y);

    I'm still working through it, but my thought was to use the pivot and a string shred to normalize the data.

    This will create two columns with | delimited lists

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • LutzM (2/25/2011)


    Here's my approach to shred the xml file:

    I'd apply a fast split string function to the shredded data. It might help performance to insert the shredded data into a staging table before applying the split function. Needs to be tested and verified.

    ; WITH cte AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

    X.Y.value('@columns[1]','varchar(200)') AS col,

    U.V.value('@values[1]','varchar(200)') AS val

    FROM @XML.nodes('/wcsxml/relation') X(Y)

    CROSS APPLY X.Y.nodes('row') U(V)

    )

    SELECT cc.row, cols.item AS col, vals.item AS val

    FROM cte cc

    CROSS APPLY dbo.DelimitedSplit8K(col,'|') cols

    CROSS APPLY dbo.DelimitedSplit8K(val,'|') vals

    WHERE cols.ItemNumber = vals.ItemNumber

    Once you have the data shredded as shown you could either use a CrossTab query (as described in the related link in my signature) if you know the column names or use a DynamicCrossTab (also referenced in my signature).

    Depending on the data volume it might help to store the split results into another (indexed) temp table. Again, this needs to be tested.

    Regarding the DelimitedSplit8K function: please search this site for "DelimitedSplit8K" and you'll find the related code. see the last link in my signature (totally forgot about it - shame on me...)

    Nix my response - didn't see Lutz's answer while working on a solution.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thanks for the help! This looks promising and I'll play with it some more over the weekend. This has become less of a task and more of a personal challenge now!

  • I played with it some more. Using combo from what lutz and I did already.

    ; WITH cte AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

    X.Y.value('(@name)[1]','varchar(20)') as TableName,

    X.Y.value('@columns[1]','varchar(200)') AS col,

    U.V.value('@values[1]','varchar(200)') AS val

    FROM @XML.nodes('//relation') X(Y)

    CROSS APPLY X.Y.nodes('row') U(V)

    ), split as (

    SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val

    FROM cte cc

    CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols

    CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals

    WHERE cols.ItemNumber = vals.ItemNumber

    )

    Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr

    From (Select TableName, val,row,col

    From split) as SourceTable

    Pivot

    ( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)

    ) as Pivt;

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • CirquedeSQLeil (2/25/2011)


    I played with it some more. Using combo from what lutz and I did already.

    ; WITH cte AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

    X.Y.value('(@name)[1]','varchar(20)') as TableName,

    X.Y.value('@columns[1]','varchar(200)') AS col,

    U.V.value('@values[1]','varchar(200)') AS val

    FROM @XML.nodes('//relation') X(Y)

    CROSS APPLY X.Y.nodes('row') U(V)

    ), split as (

    SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val

    FROM cte cc

    CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols

    CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals

    WHERE cols.ItemNumber = vals.ItemNumber

    )

    Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr

    From (Select TableName, val,row,col

    From split) as SourceTable

    Pivot

    ( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)

    ) as Pivt;

    I would have used CrossTab instead of Pivot, but the concept would have been the same.

    Nice job, Jason!

    @mark-3:

    As long as yon know the column names you'll need to deal with the solution Jason posted will work just fine. But if you need to deal with unknown column names you might want to consider using the CrossTab method. I, personally, think it's easier to convert at static CrossTab into a dynamic version than doing the same with the PIVOT operator (I'm not saying it's impossible. I'm just not used to it. And since there is very little -if any- benefit when using Pivot I've decided to stay with CrossTab...)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • LutzM (2/26/2011)


    CirquedeSQLeil (2/25/2011)


    I played with it some more. Using combo from what lutz and I did already.

    ; WITH cte AS

    (

    SELECT

    ROW_NUMBER() OVER(ORDER BY (SELECT 1) ) AS row,

    X.Y.value('(@name)[1]','varchar(20)') as TableName,

    X.Y.value('@columns[1]','varchar(200)') AS col,

    U.V.value('@values[1]','varchar(200)') AS val

    FROM @XML.nodes('//relation') X(Y)

    CROSS APPLY X.Y.nodes('row') U(V)

    ), split as (

    SELECT cc.row,cc.TableName as TableName, cols.item AS col, vals.item AS val

    FROM cte cc

    CROSS APPLY admin.dbo.DelimitedSplit8K(col,'|') cols

    CROSS APPLY admin.dbo.DelimitedSplit8K(val,'|') vals

    WHERE cols.ItemNumber = vals.ItemNumber

    )

    Select (TableName), GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr

    From (Select TableName, val,row,col

    From split) as SourceTable

    Pivot

    ( min(val) for Col in (GroupCode,GroupDesc,SequenceNum,ContentCode,ContentDesc,NonFlag,SequenceNbr)

    ) as Pivt;

    I would have used CrossTab instead of Pivot, but the concept would have been the same.

    Nice job, Jason!

    @mark-3:

    As long as yon know the column names you'll need to deal with the solution Jason posted will work just fine. But if you need to deal with unknown column names you might want to consider using the CrossTab method. I, personally, think it's easier to convert at static CrossTab into a dynamic version than doing the same with the PIVOT operator (I'm not saying it's impossible. I'm just not used to it. And since there is very little -if any- benefit when using Pivot I've decided to stay with CrossTab...)

    I would recommend going that route too frankly. I just didn't have the time to figure out the dynamic crosstab at the moment. Both methods are topics I would need to practice more with to get more comfortable in them.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Thanks loads guys!!

    The whole process still takes about 2 1/2 hours, but it's a whole lot better than before! Most of the time is because of the structure of that file! There are actually 49 tables and over a million records associated with it..., so it's not the most efficient method, but it's the only method.

    I've requested they change the XML format, but they are extremely reluctant.

    Again, thanks for the help.

  • Mark Derryberry (3/1/2011)


    Thanks loads guys!!

    The whole process still takes about 2 1/2 hours, but it's a whole lot better than before! Most of the time is because of the structure of that file! There are actually 49 tables and over a million records associated with it..., so it's not the most efficient method, but it's the only method.

    I've requested they change the XML format, but they are extremely reluctant.

    Again, thanks for the help.

    You are welcome. Glad to have been of some help.

    How much of an improvement did you see?

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Over 50% from my initial attempts! I was pushing about 6 hours and the methods you all proposed and some additional things I had to do given the scope. It's down to about 2 1/2 on a server. I say server because my desktop dev machine almost choked to death. 🙂

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply