Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Issues and performance of importing XML into SQL Server 2008


Issues and performance of importing XML into SQL Server 2008

Author
Message
Sideout1972
Sideout1972
SSC Rookie
SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)

Group: General Forum Members
Points: 31 Visits: 116
I know this long, but I think my question requires a lot of background information. The simple question is - what is the best way to bring complex XML documents into SQL Server? I am considering completely abandoning SSIS and the Microsoft stack because of the performance and development nightmares I've encountered.

I'm desperate for help in this area. Here is the story:

My team has spent the last 5 months taking XML data generated by one of our source systems and attempting to bring the data into SQL Server 2008 R2. Each step of the way we've been blocked by problems with SSIS and performance.

There are 5 different types of XML documents, each with its own schema. The schemas for these documents are complex - multiple parent/child nodes, with sometimes hundreds of attributes on each node.

The thought was to read each XML doc using an XML Source and take each output and store it as a separate table (landing tables going towards our data warehouse). Going this route, we encountered issues where the XML was too complex. It became impossible to use SSIS, for as soon as you had more than 5 outputs from the XML Source, SSIS would crap out during development. It was too complex for SSIS to handle. We wrote an XSLT transform to simplify the XML a bit. Still no dice.

We broke down the XML further (two of the 5 types could be broken into 4 subtypes each). Even then, development was difficult. You couldn't actually run the packages through BIDS, and needed to run it using DTExec. Memory problems galore. The code was in a For Each Loop Container, which cycled through each XML doc, copied it to a specific location, and ran the data flow containing the XML Source. It looks like SSIS does a horrific job of memory cleanup after each XML doc was processed, and it constantly ran out of memory. We took the data flow task itself and put it into its own package, and that at least stopped the memory problems, as each time one doc was finished, the data flow package finished and SSIS did a better job of memory cleanup.

For our initial load, there will be around 400k XML documents to load, though in our development/test systems we are controlling the number of docs we import each time. At an average or 5 seconds (docs take between 3-10 seconds each, depending upon the schema), we are talking 23 days of data import. Around the clock. Against our production system.

We are now stuck. Back to the original question: what is the best way to bring complex XML documents into SQL Server? Other options I've been considering (with no guarantees that this will perform better cycling through each XML document than SSIS did):

1. Use XML bulk loader. However, the schema we are using didn't have annotations, which is required by the bulk loader. The schema was created by the source system that also created the XML docs.
2. Code from hand. No guarantees on performance, and I need a different team to make this happen.
3. Switch to Informatica, Datastage, or some other ETL tool. No guarantees there on performance.
4. Write code so that the source system directly spits out relational tables instead of XML. This is a project timeline nightmare to make this happen, though still a possibility if XML just simply won't perform well.
5. Take each XML doc and insert it into an XML data type field, and then write SQL Server code (open rowset? bulk single_blob? .nodes()?) to take the data out of the xml and put it into relational tables. Again, performance?
6. Concatenate all XML docs of the same type into a single file and then process that single file.

I'm up for other options, but what I really need is a guaranteed method/technology that will import the XML data quickly. I'm up for any and all ideas.

Thanks in advance for your advice and for reading through my lengthy post.

Jason
Vice President, Information Systems
PURE Insurance
TedT
TedT
SSC-Enthusiastic
SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)SSC-Enthusiastic (156 reputation)

Group: General Forum Members
Points: 156 Visits: 12028
Hi Jason,

How did the package perform when run from the command line?

There are overheads associated with running packages from BIDS and dtexec which will impact performance. Before you throw in the towel on SSIS, I'd recommend adding the package to an agent job, and then running that job.

Alternatively, run from the command line using something like dtexec.exe.

I had a process which imported some pretty complex XML files (each file could be up to 100mb), performed some maniupulation using XLTs, loaded into SSIS and then parsed out into relational tables which struggled to run using Dtexec.

Don't forget - it's called Business Intelligence DEVELOPMENT studio, not Business Intelligence OPERATIONAL Studio Smile

An alternative is to load the xml files into an xml variable within SSIS, and then parse it out using C#?

_____________________________________________________________________
Disclaimer - The opinions expressed by the mouth are not necessarily those of the brain
Sideout1972
Sideout1972
SSC Rookie
SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)

Group: General Forum Members
Points: 31 Visits: 116
TedT,

It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.

As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:

DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)

Then, once the XML doc is in the table:

INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)

This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.

We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.
Kamran.shirazi
Kamran.shirazi
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 10
Did you get any luck with improving performance for XML processing on SSIS?
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45309 Visits: 39934
Sideout1972 (1/31/2012)
TedT,

It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.

As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:

DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)

Then, once the XML doc is in the table:

INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)

This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.

We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.


Not that you need confirmation of your success but I avoid SSIS entirely in favor of things like you've done above. Well done!

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Sideout1972
Sideout1972
SSC Rookie
SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)

Group: General Forum Members
Points: 31 Visits: 116
XML structure that goes more than 2 levels deep performs horribly with SSIS during the development cycle. Drop-downs where you are picking a node could take minutes, and even hours to load if the XML has a complex structure.

The only reasonable performance comes with using openrowset, XPATH, etc.

Using SSIS best practices isn't always the best performing way of accomplishing a task.
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45309 Visits: 39934
Sideout1972 (9/28/2015)
XML structure that goes more than 2 levels deep performs horribly with SSIS during the development cycle. Drop-downs where you are picking a node could take minutes, and even hours to load if the XML has a complex structure.

The only reasonable performance comes with using openrowset, XPATH, etc.

Using SSIS best practices isn't always the best performing way of accomplishing a task.


Amen to that. Same goes for XML in general, IMHO. Between the tag bloat and the other problems associated with it, it's just easier to go with a nice, properly normalized, properly delimited flat file.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Sideout1972
Sideout1972
SSC Rookie
SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)SSC Rookie (31 reputation)

Group: General Forum Members
Points: 31 Visits: 116
If you can take the XML and actually turn it into a well formatted and normalized text file, you should have avoided XML in the first place.

With that said, when you need to transmit a complex data file which has multiple nodes that need to go multiple nodes deep, XML is a reasonable format. Attempting to normalize into flat files can require hundreds of flat files to support the structure.

Just stay away from the XML source type in SSIS to parse the XML file.
Jeff Moden
Jeff Moden
SSC-Forever
SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)SSC-Forever (45K reputation)

Group: General Forum Members
Points: 45309 Visits: 39934
Sideout1972 (10/7/2015)
With that said, when you need to transmit a complex data file which has multiple nodes that need to go multiple nodes deep, XML is a reasonable format. Attempting to normalize into flat files can require hundreds of flat files to support the structure.


Not really. You just need 1 file per level and can usually include any attributes in the same file at the same level or the file that represents the level above the current one.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
Although they tell us that they want it real bad, our primary goal is to ensure that we dont actually give it to them that way.
Although change is inevitable, change for the better is not.
Just because you can do something in PowerShell, doesnt mean you should. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Alan.B
Alan.B
SSCrazy
SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)SSCrazy (2.2K reputation)

Group: General Forum Members
Points: 2244 Visits: 7426
Sideout1972 (1/31/2012)
TedT,

It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.

As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:

DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)

Then, once the XML doc is in the table:

INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)

This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.

We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.


What you are doing here is exactly what I would have recommended. SSIS is not the fastest tool for this task. I would add that you should look at the query plan when running that query and see if you are getting a parallel plan. XML queries generally perform much faster (100's of times faster in some cases) with a parallel plan. If you are not getting a parallel plan do a try Adam Mechanic's make_parallel function and test your query with that.

I periodically have to load RDL (SSRS XML) data into a data mart. The XML and logic to get what I need is complex. The queries will run for a couple hours with a serial plan, 5 minutes with a parallel plan. That's my $0.02

-- Alan Burstein



Best practices for getting help on SQLServerCentral
Need to split a string? Try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL 2012+)
Need a pattern-based splitter? Try PatternSplitCM
Need to remove or replace those unwanted characters? Try PatExclude8K and PatReplace8K.

"I can't stress enough the importance of switching from a 'sequential files' mindset to 'set-based' thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. " -- Itzek Ben-Gan 2001
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search