Extracting data from XML is slow and filling tempdb transaction log

  • I am trying to extract data from an XML column within a table on SQL Server 2005. The table as 2 columns:

    (guid guid,

    LogXML XML)

    The XML contains data for application and an application can consist of multiple applicants and each applicant as data associated to them. The cut down version of this data looks like:

    <applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>45</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    </applicant>

    <applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>48</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    </applicant>

    Note there are over 900 summaryData elements for each applicant.

    What I want to do is extract this data in the flat table with 900+ columns (yes I know this is a bad idea but bear with me as this is only for a staging area) and the data will look like:

    id, ApplicantNo, Age, Lastname, ...

    I’ve worked out how to do this, the problem is it doing it very slowly and is blowing up the tempdb transaction log.

    TO summarise my solution:

    First I’m using a dense rank function to break up the data first in to applicants

    INSERT INTO @applicant

    SELECT t1.Guid

    ,DENSE_RANK() OVER (ORDER BY b_node)-1 AS arrayOffset

    ,b.b_node.query('.') as app

    FROM

    (

    SELECT L.Guid as Guid

    ,L.LogXML.query('(//*[local-name()="applicant"])') as app

    FROMdbo.LOGFILEXML L

    ) as t1

    CROSS APPLY t1.app.nodes('./*') AS b (b_node);

    Then I break the data in the Key value pairs into another table variable

    INSERT INTO @DerivedStats900 (

    [Guid]

    ,arrayOffset

    ,dataName

    ,dataValue

    )

    SELECT Guid

    , arrayOffset

    , c.value('dataName[1]', 'varchar(50)') as dataName

    , c.value('dataValue[1]', 'varchar(50)') as dataValue

    -- ,c.value('.', 'varchar(50)') as kvp

    FROM (

    SELECT Guid

    , arrayOffset

    ,app.query('(//*[local-name()="summaryData"])') as kvp

    FROM @applicant

    ) x

    CROSS APPLY kvp.nodes('//*') as T(c)

    WHERE x.kvp IS NOT NULL;

    Once I have the data in this format I PIVOT to give me the 900 column table. The problem I’m having is with the second query

    The first query in Prod will produce about 10,000 applicant record. It’s not quick but acceptable. The second query is taking forever and blowing up the tempdb transaction log. Is there a better way to do this

  • I don't think you're doing yourself any favors doing a blind seek for nodes and then filtering them in the xml.query() method.

    I suspect you'd get a LOT more out of

    INSERT INTO @DerivedStats900 (

    [Guid]

    ,arrayOffset

    ,dataName

    ,dataValue

    )

    SELECT Guid

    , arrayOffset

    , c.value('dataName[1]', 'varchar(50)') as dataName

    , c.value('dataValue[1]', 'varchar(50)') as dataValue

    -- ,c.value('.', 'varchar(50)') as kvp

    FROM (

    SELECT Guid

    , arrayOffset

    ,app.query('.') as kvp

    FROM @applicant

    ) x

    CROSS APPLY kvp.nodes('/applicant/summaryData') as T(c)

    WHERE x.kvp IS NOT NULL;

    You've already established the specific node names and the path to them from the previous queries, so no real sense in using a wildcard (i.e. //*).

    Also - since you're talking about 9 million entries from 2nd query (900 * 10,000 entries) you should definitely consider using temp tables over table variables. I suspect you're not seeing any statistics against your table variables, so the query is presuming a rowcount=1 (i.e. bad exec plan). I'd frankly consider indexing the temp table once it's loaded - should make the pivot MUCH faster.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • This is a very simple but effective approach using the nodes method to shred the XML

    😎

    DECLARE @XML XML = N'<applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>45</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    <summaryData>

    <dataName>NotThis</dataName>

    <dataValue>Value</dataValue>

    </summaryData>

    </applicant>

    <applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>48</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    </applicant>';

    DECLARE @APPLICANT TABLE

    (

    guid INT IDENTITY(1,1) NOT NULL

    ,LogXML XML NOT NULL

    );

    INSERT INTO @APPLICANT(LogXML)

    SELECT @XML;

    ;WITH APPLICANT_ENTRY AS

    (

    SELECT

    AP.guid

    ,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID

    ,APPL.ICANT.query('(.)') AS ANCHOR

    FROM @APPLICANT AP

    OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)

    )

    ,APPL_SUMMARY AS

    (

    SELECT

    APEN.APPL_RID

    ,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName

    ,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue

    FROM APPLICANT_ENTRY APEN

    OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)

    WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN

    (N'Age',N'LastName')

    )

    SELECT

    *

    FROM APPL_SUMMARY APSU

    PIVOT (MAX(SD_dataValue)

    FOR SD_dataName IN ([Age],[LastName])) AS PTBL

    ;

    Results

    APPL_RID Age LastName

    --------- ---- ---------

    1 45 Jones

    2 48 Jones

  • I was going to go down the route of a CTE as well, but the OP mentioned 2005.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (#4) (4/23/2014)


    I was going to go down the route of a CTE as well, but the OP mentioned 2005.

    My bad, then the option is to nest it

    😎

    DECLARE @XML XML = N'<applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>45</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    <summaryData>

    <dataName>NotThis</dataName>

    <dataValue>Value</dataValue>

    </summaryData>

    </applicant>

    <applicant>

    <summaryData>

    <dataName>Age</dataName>

    <dataValue>48</dataValue>

    </summaryData>

    <summaryData>

    <dataName>LastName</dataName>

    <dataValue>Jones</dataValue>

    </summaryData>

    </applicant>';

    DECLARE @APPLICANT TABLE

    (

    guid INT IDENTITY(1,1) NOT NULL

    ,LogXML XML NOT NULL

    );

    INSERT INTO @APPLICANT(LogXML)

    SELECT @XML;

    SELECT

    *

    FROM

    (

    SELECT

    APEN.APPL_RID

    ,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName

    ,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue

    FROM

    (

    SELECT

    AP.guid

    ,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID

    ,APPL.ICANT.query('(.)') AS ANCHOR

    FROM @APPLICANT AP

    OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)

    ) AS APEN

    OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)

    WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN

    (N'Age',N'LastName')

    ) AS APSU

    PIVOT (MAX(SD_dataValue)

    FOR SD_dataName IN ([Age],[LastName])) AS PTBL

    ;

  • Ok I've managed to get this performing reasonably well with a combination of changes. I've kept the 3 queries separate as I wasn't getting anywhere combining them in a single query.

    I've kept the first query as is. I believe I can get further performance improvement out of this but for now it will do.

    The first performance improvement was in relation to the data itself. It turns out not all Applicant have summaryData (about 25%) therefore I was able to delete these before the next query.

    The next performance improvement was for Matt Millers suggestion of not using the wildcard (. //*). This improved performance quite a bit. Additionally I am now loading the second query into a temp table rather than a table variable. This does not improve performance of the load but as great benefit in the final PIVOT query.

    I could not find a useful Index to apply to the temp table. In fact it seemed to hinder performance. I noticed the without a index the Pivot query was choosing a parallel plan, but with one it was not. This is the crux of the final performanance improvement. It turns out that if you run the query with a table variable you do not get the parallel plan, but you do with temp table, taking the query for about 50mins to 7mins.

    Thanks for the help,

    Paul

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply