Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Extracting data from XML is slow and filling tempdb transaction log Expand / Collapse
Author
Message
Posted Tuesday, April 22, 2014 6:40 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, June 23, 2014 3:59 PM
Points: 7, Visits: 45
I am trying to extract data from an XML column within a table on SQL Server 2005. The table as 2 columns:

(guid guid,
LogXML XML)

The XML contains data for application and an application can consist of multiple applicants and each applicant as data associated to them. The cut down version of this data looks like:

<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>45</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>
<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>48</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>

Note there are over 900 summaryData elements for each applicant.

What I want to do is extract this data in the flat table with 900+ columns (yes I know this is a bad idea but bear with me as this is only for a staging area) and the data will look like:

id, ApplicantNo, Age, Lastname, ...

I’ve worked out how to do this, the problem is it doing it very slowly and is blowing up the tempdb transaction log.
TO summarise my solution:

First I’m using a dense rank function to break up the data first in to applicants


INSERT INTO @applicant
SELECT t1.Guid
,DENSE_RANK() OVER (ORDER BY b_node)-1 AS arrayOffset
,b.b_node.query('.') as app
FROM
(
SELECT L.Guid as Guid
,L.LogXML.query('(//*[local-name()="applicant"])') as app
FROM dbo.LOGFILEXML L
) as t1
CROSS APPLY t1.app.nodes('./*') AS b (b_node);
Then I break the data in the Key value pairs into another table variable
INSERT INTO @DerivedStats900 (
[Guid]
,arrayOffset
,dataName
,dataValue
)
SELECT Guid
, arrayOffset
, c.value('dataName[1]', 'varchar(50)') as dataName
, c.value('dataValue[1]', 'varchar(50)') as dataValue
-- ,c.value('.', 'varchar(50)') as kvp
FROM (
SELECT Guid
, arrayOffset
,app.query('(//*[local-name()="summaryData"])') as kvp
FROM @applicant
) x
CROSS APPLY kvp.nodes('//*') as T(c)
WHERE x.kvp IS NOT NULL;
Once I have the data in this format I PIVOT to give me the 900 column table. The problem I’m having is with the second query
The first query in Prod will produce about 10,000 applicant record. It’s not quick but acceptable. The second query is taking forever and blowing up the tempdb transaction log. Is there a better way to do this
Post #1564068
Posted Tuesday, April 22, 2014 9:19 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 7,121, Visits: 15,024
I don't think you're doing yourself any favors doing a blind seek for nodes and then filtering them in the xml.query() method.

I suspect you'd get a LOT more out of

INSERT INTO @DerivedStats900 ( 
[Guid]
,arrayOffset
,dataName
,dataValue
)
SELECT Guid
, arrayOffset
, c.value('dataName[1]', 'varchar(50)') as dataName
, c.value('dataValue[1]', 'varchar(50)') as dataValue
-- ,c.value('.', 'varchar(50)') as kvp
FROM (
SELECT Guid
, arrayOffset
,app.query('.') as kvp
FROM @applicant
) x
CROSS APPLY kvp.nodes('/applicant/summaryData') as T(c)
WHERE x.kvp IS NOT NULL;

You've already established the specific node names and the path to them from the previous queries, so no real sense in using a wildcard (i.e. //*).


Also - since you're talking about 9 million entries from 2nd query (900 * 10,000 entries) you should definitely consider using temp tables over table variables. I suspect you're not seeing any statistics against your table variables, so the query is presuming a rowcount=1 (i.e. bad exec plan). I'd frankly consider indexing the temp table once it's loaded - should make the pivot MUCH faster.


----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Post #1564079
Posted Wednesday, April 23, 2014 12:20 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 4:15 AM
Points: 1,322, Visits: 3,781
This is a very simple but effective approach using the nodes method to shred the XML


DECLARE @XML XML = N'<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>45</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
<summaryData>
<dataName>NotThis</dataName>
<dataValue>Value</dataValue>
</summaryData>
</applicant>
<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>48</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>';
DECLARE @APPLICANT TABLE
(
guid INT IDENTITY(1,1) NOT NULL
,LogXML XML NOT NULL
);
INSERT INTO @APPLICANT(LogXML)
SELECT @XML;
;WITH APPLICANT_ENTRY AS
(
SELECT
AP.guid
,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID
,APPL.ICANT.query('(.)') AS ANCHOR
FROM @APPLICANT AP
OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)
)
,APPL_SUMMARY AS
(
SELECT
APEN.APPL_RID
,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName
,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue
FROM APPLICANT_ENTRY APEN
OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)
WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN
(N'Age',N'LastName')
)
SELECT
*
FROM APPL_SUMMARY APSU
PIVOT (MAX(SD_dataValue)
FOR SD_dataName IN ([Age],[LastName])) AS PTBL
;

Results
APPL_RID  Age  LastName
--------- ---- ---------
1 45 Jones
2 48 Jones
Post #1564107
Posted Wednesday, April 23, 2014 6:30 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 5:24 PM
Points: 7,121, Visits: 15,024
I was going to go down the route of a CTE as well, but the OP mentioned 2005.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?
Post #1564227
Posted Wednesday, April 23, 2014 7:08 AM
Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Today @ 4:15 AM
Points: 1,322, Visits: 3,781
Matt Miller (#4) (4/23/2014)
I was going to go down the route of a CTE as well, but the OP mentioned 2005.


My bad, then the option is to nest it

DECLARE @XML XML = N'<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>45</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
<summaryData>
<dataName>NotThis</dataName>
<dataValue>Value</dataValue>
</summaryData>
</applicant>
<applicant>
<summaryData>
<dataName>Age</dataName>
<dataValue>48</dataValue>
</summaryData>
<summaryData>
<dataName>LastName</dataName>
<dataValue>Jones</dataValue>
</summaryData>
</applicant>';
DECLARE @APPLICANT TABLE
(
guid INT IDENTITY(1,1) NOT NULL
,LogXML XML NOT NULL
);
INSERT INTO @APPLICANT(LogXML)
SELECT @XML;

SELECT
*
FROM
(
SELECT
APEN.APPL_RID
,SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') AS SD_dataName
,SUMMARY.DATA.value('./dataValue[1]','NVARCHAR(250)') AS SD_dataValue
FROM
(
SELECT
AP.guid
,ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS APPL_RID
,APPL.ICANT.query('(.)') AS ANCHOR
FROM @APPLICANT AP
OUTER APPLY AP.LogXML.nodes('applicant') AS APPL(ICANT)
) AS APEN
OUTER APPLY APEN.ANCHOR.nodes('applicant/summaryData') AS SUMMARY(DATA)
WHERE SUMMARY.DATA.value('./dataName[1]','NVARCHAR(250)') IN
(N'Age',N'LastName')
) AS APSU
PIVOT (MAX(SD_dataValue)
FOR SD_dataName IN ([Age],[LastName])) AS PTBL
;


Post #1564244
Posted Sunday, April 27, 2014 11:10 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, June 23, 2014 3:59 PM
Points: 7, Visits: 45
Ok I've managed to get this performing reasonably well with a combination of changes. I've kept the 3 queries separate as I wasn't getting anywhere combining them in a single query.

I've kept the first query as is. I believe I can get further performance improvement out of this but for now it will do.

The first performance improvement was in relation to the data itself. It turns out not all Applicant have summaryData (about 25%) therefore I was able to delete these before the next query.

The next performance improvement was for Matt Millers suggestion of not using the wildcard (. //*). This improved performance quite a bit. Additionally I am now loading the second query into a temp table rather than a table variable. This does not improve performance of the load but as great benefit in the final PIVOT query.

I could not find a useful Index to apply to the temp table. In fact it seemed to hinder performance. I noticed the without a index the Pivot query was choosing a parallel plan, but with one it was not. This is the crux of the final performanance improvement. It turns out that if you run the query with a table variable you do not get the parallel plan, but you do with temp table, taking the query for about 50mins to 7mins.

Thanks for the help,
Paul
Post #1565457
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse