Issues and performance of importing XML into SQL Server 2008

Question

Post reply

Issues and performance of importing XML into SQL Server 2008

Jason

SSC-Addicted

Points: 425
More actions
January 20, 2012 at 9:52 am

#255217

I know this long, but I think my question requires a lot of background information. The simple question is - what is the best way to bring complex XML documents into SQL Server? I am considering completely abandoning SSIS and the Microsoft stack because of the performance and development nightmares I've encountered.
I'm desperate for help in this area. Here is the story:
My team has spent the last 5 months taking XML data generated by one of our source systems and attempting to bring the data into SQL Server 2008 R2. Each step of the way we've been blocked by problems with SSIS and performance.
There are 5 different types of XML documents, each with its own schema. The schemas for these documents are complex - multiple parent/child nodes, with sometimes hundreds of attributes on each node.
The thought was to read each XML doc using an XML Source and take each output and store it as a separate table (landing tables going towards our data warehouse). Going this route, we encountered issues where the XML was too complex. It became impossible to use SSIS, for as soon as you had more than 5 outputs from the XML Source, SSIS would crap out during development. It was too complex for SSIS to handle. We wrote an XSLT transform to simplify the XML a bit. Still no dice.
We broke down the XML further (two of the 5 types could be broken into 4 subtypes each). Even then, development was difficult. You couldn't actually run the packages through BIDS, and needed to run it using DTExec. Memory problems galore. The code was in a For Each Loop Container, which cycled through each XML doc, copied it to a specific location, and ran the data flow containing the XML Source. It looks like SSIS does a horrific job of memory cleanup after each XML doc was processed, and it constantly ran out of memory. We took the data flow task itself and put it into its own package, and that at least stopped the memory problems, as each time one doc was finished, the data flow package finished and SSIS did a better job of memory cleanup.
For our initial load, there will be around 400k XML documents to load, though in our development/test systems we are controlling the number of docs we import each time. At an average or 5 seconds (docs take between 3-10 seconds each, depending upon the schema), we are talking 23 days of data import. Around the clock. Against our production system.
We are now stuck. Back to the original question: what is the best way to bring complex XML documents into SQL Server? Other options I've been considering (with no guarantees that this will perform better cycling through each XML document than SSIS did):
1. Use XML bulk loader. However, the schema we are using didn't have annotations, which is required by the bulk loader. The schema was created by the source system that also created the XML docs.
2. Code from hand. No guarantees on performance, and I need a different team to make this happen.
3. Switch to Informatica, Datastage, or some other ETL tool. No guarantees there on performance.
4. Write code so that the source system directly spits out relational tables instead of XML. This is a project timeline nightmare to make this happen, though still a possibility if XML just simply won't perform well.
5. Take each XML doc and insert it into an XML data type field, and then write SQL Server code (open rowset? bulk single_blob? .nodes()?) to take the data out of the xml and put it into relational tables. Again, performance?
6. Concatenate all XML docs of the same type into a single file and then process that single file.
I'm up for other options, but what I really need is a guaranteed method/technology that will import the XML data quickly. I'm up for any and all ideas.
Thanks in advance for your advice and for reading through my lengthy post.
Jason
Vice President, Information Systems
PURE Insurance

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply

TedT Right there with Babe Points: 796 More actions · Answer 1

Hi Jason,

How did the package perform when run from the command line?

There are overheads associated with running packages from BIDS and dtexec which will impact performance. Before you throw in the towel on SSIS, I'd recommend adding the package to an agent job, and then running that job.

Alternatively, run from the command line using something like dtexec.exe.

I had a process which imported some pretty complex XML files (each file could be up to 100mb), performed some maniupulation using XLTs, loaded into SSIS and then parsed out into relational tables which struggled to run using Dtexec.

Don't forget - it's called Business Intelligence DEVELOPMENT studio, not Business Intelligence OPERATIONAL Studio 🙂

An alternative is to load the xml files into an xml variable within SSIS, and then parse it out using C#?

_____________________________________________________________________
Disclaimer - The opinions expressed by the mouth are not necessarily those of the brain

Jason SSC-Addicted Points: 425 More actions · Answer 2

TedT,

It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.

As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:

DECLARE @xml_table TABLE(xml_data xml)

SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'

INSERT INTO @xml_table EXEC (@sqlstmt)

Then, once the XML doc is in the table:

INSERT INTO MyTable

(

[Field1]

,[Field2]

,[Field3]

)

SELECT

x.value('(Field1/text())[1]', 'nvarchar(255)') Field1

,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2

,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3

FROM

@xml_table tbl

CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)

This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.

We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.

Kamran.shirazi Newbie Points: 3 More actions · Answer 3

Did you get any luck with improving performance for XML processing on SSIS?

Jeff Moden SSC Guru Points: 1003872 More actions · Answer 4

Sideout1972 (1/31/2012)
TedT,
It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.
As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:
DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)
Then, once the XML doc is in the table:
INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)
This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.
We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.

Not that you need confirmation of your success but I avoid SSIS entirely in favor of things like you've done above. Well done!

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason SSC-Addicted Points: 425 More actions · Answer 5

XML structure that goes more than 2 levels deep performs horribly with SSIS during the development cycle. Drop-downs where you are picking a node could take minutes, and even hours to load if the XML has a complex structure.

The only reasonable performance comes with using openrowset, XPATH, etc.

Using SSIS best practices isn't always the best performing way of accomplishing a task.

Jeff Moden SSC Guru Points: 1003872 More actions · Answer 6

Sideout1972 (9/28/2015)
XML structure that goes more than 2 levels deep performs horribly with SSIS during the development cycle. Drop-downs where you are picking a node could take minutes, and even hours to load if the XML has a complex structure.
The only reasonable performance comes with using openrowset, XPATH, etc.
Using SSIS best practices isn't always the best performing way of accomplishing a task.

Amen to that. Same goes for XML in general, IMHO. Between the tag bloat and the other problems associated with it, it's just easier to go with a nice, properly normalized, properly delimited flat file.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jason SSC-Addicted Points: 425 More actions · Answer 7

If you can take the XML and actually turn it into a well formatted and normalized text file, you should have avoided XML in the first place.

With that said, when you need to transmit a complex data file which has multiple nodes that need to go multiple nodes deep, XML is a reasonable format. Attempting to normalize into flat files can require hundreds of flat files to support the structure.

Just stay away from the XML source type in SSIS to parse the XML file.

Jeff Moden SSC Guru Points: 1003872 More actions · Answer 8

Sideout1972 (10/7/2015)
With that said, when you need to transmit a complex data file which has multiple nodes that need to go multiple nodes deep, XML is a reasonable format. Attempting to normalize into flat files can require hundreds of flat files to support the structure.

Not really. You just need 1 file per level and can usually include any attributes in the same file at the same level or the file that represents the level above the current one.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Alan Burstein SSC Guru Points: 61136 More actions · Answer 9

Sideout1972 (1/31/2012)
TedT,
It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.
As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:
DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)
Then, once the XML doc is in the table:
INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)
This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.
We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.

What you are doing here is exactly what I would have recommended. SSIS is not the fastest tool for this task. I would add that you should look at the query plan when running that query and see if you are getting a parallel plan. XML queries generally perform much faster (100's of times faster in some cases) with a parallel plan. If you are not getting a parallel plan do a try Adam Mechanic's make_parallel function and test your query with that.

I periodically have to load RDL (SSRS XML) data into a data mart. The XML and logic to get what I need is complex. The queries will run for a couple hours with a serial plan, 5 minutes with a parallel plan. That's my $0.02

"I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

-- Itzik Ben-Gan 2001