Extracting data from XML is slow and filling tempdb transaction log

Question

Extracting data from XML is slow and filling tempdb transaction log

Paul Sims-405749

Mr or Mrs. 500

Points: 527
More actions
April 22, 2014 at 6:40 pm

#100183

I am trying to extract data from an XML column within a table on SQL Server 2005. The table as 2 columns:
(guid guid,
LogXML XML)
The XML contains data for application and an application can consist of multiple applicants and each applicant as data associated to them. The cut down version of this data looks like:
<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>45</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>
<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>48</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>
Note there are over 900 summaryData elements for each applicant.
What I want to do is extract this data in the flat table with 900+ columns (yes I know this is a bad idea but bear with me as this is only for a staging area) and the data will look like:
id, ApplicantNo, Age, Lastname, ...
I’ve worked out how to do this, the problem is it doing it very slowly and is blowing up the tempdb transaction log.
TO summarise my solution:
First I’m using a dense rank function to break up the data first in to applicants
INSERT INTO @applicant
SELECT t1.Guid
,DENSE_RANK() OVER (ORDER BY b_node)-1 AS arrayOffset
,b.b_node.query('.') as app
FROM
(
SELECT L.Guid as Guid
,L.LogXML.query('(//*[local-name()="applicant"])') as app
FROMdbo.LOGFILEXML L
) as t1
CROSS APPLY t1.app.nodes('./*') AS b (b_node);
Then I break the data in the Key value pairs into another table variable
INSERT INTO @DerivedStats900 (
[Guid]
,arrayOffset
,dataName
,dataValue
)
SELECT Guid
, arrayOffset
, c.value('dataName[1]', 'varchar(50)') as dataName
, c.value('dataValue[1]', 'varchar(50)') as dataValue
-- ,c.value('.', 'varchar(50)') as kvp
FROM (
SELECT Guid
, arrayOffset
,app.query('(//*[local-name()="summaryData"])') as kvp
FROM @applicant
) x
CROSS APPLY kvp.nodes('//*') as T(c)
WHERE x.kvp IS NOT NULL;
Once I have the data in this format I PIVOT to give me the 900 column table. The problem I’m having is with the second query
The first query in Prod will produce about 10,000 applicant record. It’s not quick but acceptable. The second query is taking forever and blowing up the tempdb transaction log. Is there a better way to do this

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 1

I don't think you're doing yourself any favors doing a blind seek for nodes and then filtering them in the xml.query() method.

I suspect you'd get a LOT more out of

INSERT INTO @DerivedStats900 (

[Guid]

,arrayOffset

,dataName

,dataValue

)

SELECT Guid

, arrayOffset

, c.value('dataName[1]', 'varchar(50)') as dataName

, c.value('dataValue[1]', 'varchar(50)') as dataValue

-- ,c.value('.', 'varchar(50)') as kvp

FROM (

SELECT Guid

, arrayOffset

,app.query('.') as kvp

FROM @applicant

) x

CROSS APPLY kvp.nodes('/applicant/summaryData') as T(c)

WHERE x.kvp IS NOT NULL;

You've already established the specific node names and the path to them from the previous queries, so no real sense in using a wildcard (i.e. //*).

Also - since you're talking about 9 million entries from 2nd query (900 * 10,000 entries) you should definitely consider using temp tables over table variables. I suspect you're not seeing any statistics against your table variables, so the query is presuming a rowcount=1 (i.e. bad exec plan). I'd frankly consider indexing the temp table once it's loaded - should make the pivot MUCH faster.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 2

This is a very simple but effective approach using the nodes method to shred the XML

😎

DECLARE @XML XML = N'<applicant>

<summaryData>

<dataName>Age</dataName>

<dataValue>45</dataValue>

</summaryData>

<summaryData>

<dataName>LastName</dataName>

<dataValue>Jones</dataValue>

</summaryData>

<summaryData>

<dataName>NotThis</dataName>

<dataValue>Value</dataValue>

</summaryData>

</applicant>

<applicant>

<summaryData>

<dataName>Age</dataName>

<dataValue>48</dataValue>

</summaryData>

<summaryData>

<dataName>LastName</dataName>

<dataValue>Jones</dataValue>

</summaryData>

</applicant>';

DECLARE @APPLICANT TABLE

(

guid INT IDENTITY(1,1) NOT NULL

,LogXML XML NOT NULL

);

INSERT INTO @APPLICANT(LogXML)

SELECT @XML;

;WITH APPLICANT_ENTRY AS

(

SELECT

AP.guid

,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID

,APPL.ICANT.query('(.)') AS ANCHOR

FROM @APPLICANT AP

OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)

)

,APPL_SUMMARY AS

(

SELECT

APEN.APPL_RID

,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName

,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue

FROM APPLICANT_ENTRY APEN

OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)

WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN

(N'Age',N'LastName')

)

SELECT

*

FROM APPL_SUMMARY APSU

PIVOT (MAX(SD_dataValue)

FOR SD_dataName IN ([Age],[LastName])) AS PTBL

;

Results

APPL_RID Age LastName

--------- ---- ---------

1 45 Jones

2 48 Jones

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 3

I was going to go down the route of a CTE as well, but the OP mentioned 2005.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 4

Matt Miller (#4) (4/23/2014)
I was going to go down the route of a CTE as well, but the OP mentioned 2005.

My bad, then the option is to nest it

😎

DECLARE @XML XML = N'<applicant>

<summaryData>

<dataName>Age</dataName>

<dataValue>45</dataValue>

</summaryData>

<summaryData>

<dataName>LastName</dataName>

<dataValue>Jones</dataValue>

</summaryData>

<summaryData>

<dataName>NotThis</dataName>

<dataValue>Value</dataValue>

</summaryData>

</applicant>

<applicant>

<summaryData>

<dataName>Age</dataName>

<dataValue>48</dataValue>

</summaryData>

<summaryData>

<dataName>LastName</dataName>

<dataValue>Jones</dataValue>

</summaryData>

</applicant>';

DECLARE @APPLICANT TABLE

(

guid INT IDENTITY(1,1) NOT NULL

,LogXML XML NOT NULL

);

INSERT INTO @APPLICANT(LogXML)

SELECT @XML;

SELECT

*

FROM

(

SELECT

APEN.APPL_RID

,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName

,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue

FROM

(

SELECT

AP.guid

,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID

,APPL.ICANT.query('(.)') AS ANCHOR

FROM @APPLICANT AP

OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)

) AS APEN

OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)

WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN

(N'Age',N'LastName')

) AS APSU

PIVOT (MAX(SD_dataValue)

FOR SD_dataName IN ([Age],[LastName])) AS PTBL

;

Paul Sims-405749 Mr or Mrs. 500 Points: 527 More actions · Answer 5

Ok I've managed to get this performing reasonably well with a combination of changes. I've kept the 3 queries separate as I wasn't getting anywhere combining them in a single query.

I've kept the first query as is. I believe I can get further performance improvement out of this but for now it will do.

The first performance improvement was in relation to the data itself. It turns out not all Applicant have summaryData (about 25%) therefore I was able to delete these before the next query.

The next performance improvement was for Matt Millers suggestion of not using the wildcard (. //*). This improved performance quite a bit. Additionally I am now loading the second query into a temp table rather than a table variable. This does not improve performance of the load but as great benefit in the final PIVOT query.

I could not find a useful Index to apply to the temp table. In fact it seemed to hinder performance. I noticed the without a index the Pivot query was choosing a parallel plan, but with one it was not. This is the crux of the final performanance improvement. It turns out that if you run the query with a table variable you do not get the parallel plan, but you do with temp table, taking the query for about 50mins to 7mins.

Thanks for the help,

Paul