This is my attempt to explain something that I have wanted to figure out for a long time. I have read many articles on this very subject but they always seem to fall short for various reasons - generally by including way more detail or complexity than is necessary to answer the question or get someone going. Due to the work of others who have written similar articles on this subject in the past (particularly MVP Michael Coles), I have eventually figured this out. This is my attempt to present this material in an end-to-end yet bare-bones and hopefully easily-understandable format for any .NET or SQL developer who needs to do this.
Note: I've written a follow-up to this article based on the excellent feedback: Another Way To Insert Many Rows Very Fast From Your .NET Application
Recently, I was asked to troubleshoot the startup performance of a VB.NET application with a SQL 2008 back-end. The startup routine was taking up to a minute or so to complete in locations with a fast DB connection, and up to 15 minutes in locations with a slow DB connection.
By stepping through the application and inserting Debug.Print() statements liberally, I narrowed-down the "slow" part of the routine to a loop that was inserting product IDs into a table. Basically, the way the app worked was that when it was launched, it would figure out which products were relevant to the active user and then it would insert those into a table that was used with INNER JOINs in all the rest of the screens and reports.
There was a table in the SQL database that looked a bit like this (pretend there is only ever one user for this example):
CREATE TABLE ProductsAccess(
ProductID INT NOT NULL
PRIMARY KEY (ProductID));
The VB.NET code that initialized the product list for the user in the DB was something like this (using typical ORM stuff):
'Delete all records in the table
'now populate all products that the user can choose in a loop
For i = 0 To objProductList.GetUpperBound(0)
Behind the scenes, this meant at least 1 call to the database per row to be inserted, plus the first call to clear out any existing rows. Since it was typical for our users to have several thousand products available to them, this could take a very long time, particularly with remote users on high-latency connections.
It wasn't the amount of data that was the problem - it was the count of the calls that was the issue. Instead of saying "Insert product 1", "Insert Product 2", "Insert Product 3", etc., I needed to say "Insert Products: 1, 2, 3," etc.
What I did:
Here is the simple version of how I changed the insert routine to cut the number of database calls to insert "n" rows from "n+1" to "always 2":
I created this stored procedure:
CREATE PROC ProductsAccessXMLInsert(@ValidXMLInput XML)
INSERT INTO ProductsAccess (ProductID)
FROM @ValidXMLInput.nodes('//Products/Product') Tab(Col)
Then I modified my .NET code to look something like this (note that I left the DeleteAll method untouched):
'Delete all records in the table
'Build a valid XML string (you should use System.Xml in real code - this is just a demo)
Dim strBuffer As New Text.StringBuilder
For i = 0 To objProductList.GetUpperBound(0)
strBuffer.Append(" <Product ID=""" & objProductList.Item(i).ProductID.ToString & """ />")
'now populate all products that the user can choose by passing a single XML document
How it works:
The XML generated in the loop would look something like this:
<Product ID="1" />
<Product ID="2" />
<Product ID="3" />
The new InsertXML() method called the new ProductsAccessXMLInsert stored procedure, passing all of the generated XML in only a single call to the DB.
Public Sub InsertXML(TheXML As String)
Using objConn As New SqlConnection(DBConnectionString())
Using objCmd As New SqlCommand("dbo.ProductsAccessXMLInsert")
objCmd.CommandType = CommandType.StoredProcedure
objCmd.Connection = objConn
objCmd.Parameters(0).Value = TheXML
Inside the ProductsAccessXMLInsert stored procedure, SQL Server does a SELECT from the input XML parameter's built-in "nodes" table-valued function using the passed XPath query ("//Products/Product"). The XML "columns" are then accessible using the Col alias (this comes from the Tab(Col) tag after the XPath query). In the SELECT clause, I referenced the ID attribute of the XML nodes, and instructed SQL Server to cast it as an Integer (these are the two parameters passed to the Col.value(,) function in the SELECT clause). XML attribute names must be prefixed with an @ symbol. (For node names you can leave the @ off.) Then I'm simply doing an insert to the ProductsAccess table from the results of my SELECT.
I did four sets of tests using insert sets of 10 rows, 50 rows, 100 rows, and 504 rows (the number of products in the AdventureWorks Production.Product table which was the test data source). I did three sets of tests for each count of rows with doing the one-by-one inserts first, and then three series of tests doing the XML inserts first. All runs were measured using the .NET Stopwatch class.
So how much better was this single call using XML versus the using the standard row-by-row INSERT call? Here are the results:
(lower milliseconds to complete is better)
The XML insert method for our scenario of needing to insert many rows into our SQL database in one shot averaged more than 97% faster than looped inserts when processing 504 rows of data, and was still more than 81% faster with a mere 10 rows of data to insert. This was averaged over 6 runs of 50 tries of each method and by my eye, the results were fairly consistent.
I hope you found this a useful introduction on one way to efficiently handle multi-insert scenarios with SQL Server using an XML parameter. I would love to hear any comments or questions you might have.
Notes and Disclaimers:
- This is merely one way to insert many rows very fast from your .NET application.
- The term "many" may mean different things for you or your application.
- To keep the performance testing fair, I first refactored the existing insert method to use a stored procedure that took the product ID as a parameter so it wasn't using wonky ORM-generated code.:
CREATE PROC DoSingleInsert(@ProductID INT)
INSERT INTO ProductsAccess (ProductID) VALUES (@ProductID);
Also the inserts were processed in a tight loop using a single SqlCommand object with a strongly-typed parameter and just updating the parameter value and calling ExecuteNonQuery(). This is about as efficient as ADO.NET typically gets, though I would love to hear if there is a way to drive a SqlCommand object harder.
Testing was done with the application and SQL Server running off two ports of the same gigabit switch to minimize the factor of network latency.
To keep this post as simple as possible, I am pretending that there is only ever one user for this application - my real table also has a UserID field, among others. Additionally, in my real solution I do a MERGE against my XML parameter which allowed me to also eliminate the DELETE call - perhaps this can be a future blog post.
Dear fellow developer: please don't do what I did above and use a StringBuilder to generate XML. The moment you have a funny character such as <, >, or &, you will be in a world of pain. You should use the features of the System.XML namespace to generate XML, however that was more complicated than I wanted to get into for this blog post.
The slow part of the original code in this application was basically the last four lines of code in a massive subroutine. I had seen dozens of things earlier that I thought might be the culprit, but actually were OK. In doing this work, I was reminded of the lesson to not do premature optimization of code; I could have spent an hour or two making all the "kind-of fast" code "super fast" and the end result would have been completely unnoticeable due to the four lines of boat-anchor DB access code which were the main problem.