XML parsing/shredding

  • cmoy

    Ten Centuries

    Points: 1029

    Hello, not sure if this is the right subform to post this in, but anyhow...

    I have a few tables that all has a column with an XML datatype in it, with lots of data this column. Each of these tables has about 10-20 million rows.

    The XML data structure is pretty simple but has about 100 nodes, i.e:

    <profile>

    <entity name="ADDRESS">

    <attr name="NAME1">WHITE HOUSE</attr>

    <attr name="STREET">1600 Pennsylvania</attr>

    <attr name="CITY1">Washington</attr>

    <attr name="REGION">DC</attr>

    <attr name="POST_CODE1">01234</attr>

    .

    .

    </entity>

    </profile>

    Anyhow, I would like to take this XML and move this into a new table that is extracted out into proper columns.  However, using conventional XQuery methodologies to extract out (querying using XMLColumnName.value), I'm averaging a scant 10 rows a second to write rows to a temp table, using read batches of 1000 records.  I would need to average about 200 rows a second at a bare minimum for this conversion to be able to run on time.

    Questions -

    1. can someone provide a sample of using OPENXML to query a table XML column?  Every example I've variable seen queries a variable.
    2. Any ideas on what other technique might be faster?  I'm guessing that would involve writing in C# or Python.  I've read where converting to JSON and using JSON_QUERY might be faster, but I'm afraid that the conversion process to JSON would eat up any gains.
  • Mike01

    SSChampion

    Points: 11282

    can you provide a little bigger sample of the XML and some DDL would help.  Also, can you provide the query you are running?  What does the execution plan say?  Are you exporting all to varchar(max) columns or are they sized properly?

    For better, quicker answers, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • cmoy

    Ten Centuries

    Points: 1029

    The XML is really simple - it would probably be another 100 nodes more than what's on there.

    Query:

    SELECT

    A.SOURCE_PKEY,

    XmlColumn.value('(profile/entity[@name="ADDRESS"]/attr[@name="NAME1"])[1]', 'varchar(255)') NAME1,

    XmlColumn.value('(profile/entity[@name="ADDRESS"]/attr[@name="STREET"])[1]', 'varchar(255)') STREET,

    XmlColumn.value('(profile/entity[@name="ADDRESS"]/attr[@name="CITY1"])[1]', 'varchar(255)') CITY1,

    XmlColumn.value('(profile/entity[@name="ADDRESS"]/attr[@name="REGION"])[1]', 'varchar(255)') REGION,

    XmlColumn.value('(profile/entity[@name="ADDRESS"]/attr[@name="POST_CODE1"])[1]', 'varchar(255)') POST_CODE1

    .

    FROM Table

     

    where XmlColumn is the name of the XML column in the table.

    The execution plan is fine - all the time involved is via the shredding itself.

    I'm about to go in a different direction now - just exporting the data to a flat file and using C# to shred it that way.  Very inconvenient.

    • This reply was modified 1 month, 1 week ago by  cmoy.
  • Jeff Moden

    SSC Guru

    Points: 996832

    Is there more than one "Address" element per XML or is there always just 1 (regardless of how many attributes that node contains)?

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • cmoy

    Ten Centuries

    Points: 1029

    in this case, there's only one Address element per XML column row.

    The issue is that this is unstandard XML, it should have been all defined as all elements in this case, so I've more or less given up and just going to use write something custom in C# which I didn't particularly want to do.  Unfortunately, even after a few improvements to the SQL script, I was able to only insert 850K rows for a 24 hour period, running my extract out the columns from XML and load into a table using XQuery, and this table has almost 20 million rows with this XML column, and literally 99% of the execution plan time is on the shredding.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply