Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Issues and performance of importing XML into SQL Server 2008 Expand / Collapse
Author
Message
Posted Friday, January 20, 2012 9:52 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, September 30, 2014 4:48 AM
Points: 29, Visits: 108
I know this long, but I think my question requires a lot of background information. The simple question is - what is the best way to bring complex XML documents into SQL Server? I am considering completely abandoning SSIS and the Microsoft stack because of the performance and development nightmares I've encountered.

I'm desperate for help in this area. Here is the story:

My team has spent the last 5 months taking XML data generated by one of our source systems and attempting to bring the data into SQL Server 2008 R2. Each step of the way we've been blocked by problems with SSIS and performance.

There are 5 different types of XML documents, each with its own schema. The schemas for these documents are complex - multiple parent/child nodes, with sometimes hundreds of attributes on each node.

The thought was to read each XML doc using an XML Source and take each output and store it as a separate table (landing tables going towards our data warehouse). Going this route, we encountered issues where the XML was too complex. It became impossible to use SSIS, for as soon as you had more than 5 outputs from the XML Source, SSIS would crap out during development. It was too complex for SSIS to handle. We wrote an XSLT transform to simplify the XML a bit. Still no dice.

We broke down the XML further (two of the 5 types could be broken into 4 subtypes each). Even then, development was difficult. You couldn't actually run the packages through BIDS, and needed to run it using DTExec. Memory problems galore. The code was in a For Each Loop Container, which cycled through each XML doc, copied it to a specific location, and ran the data flow containing the XML Source. It looks like SSIS does a horrific job of memory cleanup after each XML doc was processed, and it constantly ran out of memory. We took the data flow task itself and put it into its own package, and that at least stopped the memory problems, as each time one doc was finished, the data flow package finished and SSIS did a better job of memory cleanup.

For our initial load, there will be around 400k XML documents to load, though in our development/test systems we are controlling the number of docs we import each time. At an average or 5 seconds (docs take between 3-10 seconds each, depending upon the schema), we are talking 23 days of data import. Around the clock. Against our production system.

We are now stuck. Back to the original question: what is the best way to bring complex XML documents into SQL Server? Other options I've been considering (with no guarantees that this will perform better cycling through each XML document than SSIS did):

1. Use XML bulk loader. However, the schema we are using didn't have annotations, which is required by the bulk loader. The schema was created by the source system that also created the XML docs.
2. Code from hand. No guarantees on performance, and I need a different team to make this happen.
3. Switch to Informatica, Datastage, or some other ETL tool. No guarantees there on performance.
4. Write code so that the source system directly spits out relational tables instead of XML. This is a project timeline nightmare to make this happen, though still a possibility if XML just simply won't perform well.
5. Take each XML doc and insert it into an XML data type field, and then write SQL Server code (open rowset? bulk single_blob? .nodes()?) to take the data out of the xml and put it into relational tables. Again, performance?
6. Concatenate all XML docs of the same type into a single file and then process that single file.

I'm up for other options, but what I really need is a guaranteed method/technology that will import the XML data quickly. I'm up for any and all ideas.

Thanks in advance for your advice and for reading through my lengthy post.

Jason
Vice President, Information Systems
PURE Insurance
Post #1239597
Posted Tuesday, January 24, 2012 4:09 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, September 11, 2014 7:49 AM
Points: 155, Visits: 12,021
Hi Jason,

How did the package perform when run from the command line?

There are overheads associated with running packages from BIDS and dtexec which will impact performance. Before you throw in the towel on SSIS, I'd recommend adding the package to an agent job, and then running that job.

Alternatively, run from the command line using something like dtexec.exe.

I had a process which imported some pretty complex XML files (each file could be up to 100mb), performed some maniupulation using XLTs, loaded into SSIS and then parsed out into relational tables which struggled to run using Dtexec.

Don't forget - it's called Business Intelligence DEVELOPMENT studio, not Business Intelligence OPERATIONAL Studio :)

An alternative is to load the xml files into an xml variable within SSIS, and then parse it out using C#?


_____________________________________________________________________
Disclaimer - The opinions expressed by the mouth are not necessarily those of the brain
Post #1240746
Posted Tuesday, January 31, 2012 4:16 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, September 30, 2014 4:48 AM
Points: 29, Visits: 108
TedT,

It ran horribly from the command line as well. After taking steps to eliminate memory issues by putting the code that processes an XML doc into its own package, and using a for each container to call that package for each xml, I was able to eliminate some of the memory issues. While I could get it to completion, the complex XML docs took upwards of 20 seconds each to process - not good when you've got hundreds of thousands to process. Command line shaved about 10% of processing time. Good but not good enough.

As an update - I eliminated the use of the XML Source all together. We are parsing the xml using a stored procedure instead using something like this:

DECLARE @xml_table TABLE(xml_data xml)
SET @sqlstmt= 'SELECT * FROM OPENROWSET ( BULK ''' + @l_xml_path + ''', SINGLE_CLOB ) AS XMLDATA'
INSERT INTO @xml_table EXEC (@sqlstmt)

Then, once the XML doc is in the table:

INSERT INTO MyTable
(
[Field1]
,[Field2]
,[Field3]
)
SELECT
x.value('(Field1/text())[1]', 'nvarchar(255)') Field1
,x.value('(Field2/text())[1]', 'nvarchar(255)') Field2
,x.value('(Field3/text())[1]', 'nvarchar(255)') Field3
FROM
@xml_table tbl
CROSS APPLY tbl.xml_data.nodes('/root/Subnode/Subsubnode') e(x)

This technique allows us to process each XML doc, almost regardless of size, in under 2 seconds.

We are still in development, but signs are positive. I am now convinced, after months of agony, that SSIS can't process XML efficiently.
Post #1244812
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse