Streaming Data Into SQL Server 2008 From an Application

  • I enjoyed reading your article. Thanks for sharing it and providing the material.

    I downloaded your code and tried adding an option to do an import via SQLXML parameter into a similar stored procedure.

    The xml(untyped) approach is pretty simple:

    CREATE PROCEDURE dbo.ImportDataXml (

    @inputXml xml

    )

    AS

    SET NOCOUNT ON

    INSERT INTO dbo.ImportTest (SillyIDField, MeaninglessText)

    SELECT r.n.value('@id','int'), r.n.value('@value','varchar(50)')

    FROM @inputXml.nodes('/Root/Node') as r(n)

    GO

    My testing showed it to be significantly slower using a small 30K line file. Here are my results(~ 250 ms vs 1250ms)

    [09:28 UTC-7] PS>./streamteste Full Test.txt

    5/29/2009 9:28:03 AM -- Truncating the import table, clearing the caches, etc...

    5/29/2009 9:28:03 AM -- Done setting up for the test

    5/29/2009 9:28:03 AM -- Connecting to the Database...

    5/29/2009 9:28:03 AM -- Calling proc to read and import the data...

    5/29/2009 9:28:03 AM -- Proc is done reading and importing the data via Streamed TVP (0.241 Seconds))

    5/29/2009 9:28:03 AM -- Done!

    [09:28 UTC-7] PS>./streamteste xml Test.txt

    5/29/2009 9:28:08 AM -- Truncating the import table, clearing the caches, etc...

    5/29/2009 9:28:08 AM -- Done setting up for the test

    5/29/2009 9:28:08 AM -- Connecting to the Database...

    5/29/2009 9:28:08 AM -- Calling proc to Xml import the data...

    5/29/2009 9:28:09 AM -- Proc is done importing the data via single Xml (1.296 Seconds)

    5/29/2009 9:28:09 AM -- Done!

    I attached my VS project and simple (testing) code.

    Thanks,

    John

  • john hill (5/29/2009)


    I enjoyed reading your article. Thanks for sharing it and providing the material.

    I downloaded your code and tried adding an option to do an import via SQLXML parameter into a similar stored procedure.

    My testing showed it to be significantly slower using a small 30K line file. Here are my results(~ 250 ms vs 1250ms)

    Hello John and thanks, I am glad that you liked the article.

    And thanks so much for doing that test and providing the results and code. I was curious as to the performance of the XML option. I am not surprised that it takes longer given that it has to parse the XML string. Also, did you pay attention to the memory usage on the system? I am sure that it took up a good amount given that it had to package up the entire XML string before sending it to the database. Again this is the beauty of the IEnumerable interface feature with TVPs starting in SQL Server 2008 in that the application server does not need to use large amounts of memory in order to send large amounts of data to the database.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • This is one main problem with mass data in ADO.NET.

    Most people don't know that for every command send ADO.NET needs an open Transaction.

    If there is none, it will open one and Commit or Rollback depending on the result.

    So if you open a Transaction for those mass updates yourself, you save a lot of time.

    So instead of using 3.000.000 commands and therefor 3.000.000 transactions just use one command,

    and an explicit Transaction.

    This alone cuts down the time needed in this experiment to about 25%.

    So change the OldSchool connection and command loop to

    using (dbConnection = new SqlConnection("Server=.\\sqlexpress;Database=Test;Trusted_Connection=True;"))

    {

    dbConnection.Open();

    SqlTransaction trans = dbConnection.BeginTransaction();

    Console.WriteLine("{0} -- Calling the proc iteratively ...", DateTime.Now );

    using( SqlCommand importProc = new SqlCommand("INSERT INTO dbo.ImportTest (SillyIDField, MeaninglessText) VALUES (@SillyIDField, @MeaninglessText)", dbConnection))

    {

    SqlParameter sillyIDField = importProc.Parameters.AddWithValue("@SillyIDField", (int)tempTable.Rows[0][0]);

    SqlParameter meaninglessText = importProc.Parameters.AddWithValue("@MeaninglessText", (string)tempTable.Rows[0][1]);

    importProc.Transaction = trans;

    for (int rowNum = 0; rowNum < tempTable.Rows.Count; rowNum++)

    {

    sillyIDField.Value=(int)tempTable.Rows[rowNum][0];

    meaninglessText.Value=(string)tempTable.Rows[rowNum][1];

    importProc.ExecuteNonQuery();

    }

    }

    trans.Commit();

    Console.WriteLine("{0} -- Done importing the data old-school style ({1} Seconds)", DateTime.Now, elapsedSeconds);

    }

    And by the way DataRows.ItemArray is only needed if you want to get/set the Rows value as an Object[].

  • [font="Verdana"]Interesting start to a good topic.

    I'm not sure I agree with the comment in the article that the "only way" to do things in the past has been with stored procedures that only take one row. It's been possible to pass XML to a stored procedure for some time now. I see someone has put up some code in the discussion to do precisely that. I believe that typed XML has some performance gains, so that might be a worthwhile comparison.

    If your application must "stream" data to SQL Server, have you considered using a queue of some sort (either SQL Server Broker or MSMQ for example.) The overhead for using queues is not inconsiderable, but in some cases it may well be a better approach to having to stream the data.

    Anyway, the point being that this is a new solution (and an excellent one) in SQL Server 2008, but by no means the only solution to the issue of having to pass in multiple rows of typed data to SQL Server.

    [/font]

  • @Solomon Rutzky

    Just want let you know, your article is referenced in current German SQL PASS newsletter! 😉

  • Florian Reischl (7/1/2009)


    @Solomon Rutzky

    Just want let you know, your article is referenced in current German SQL PASS newsletter! 😉

    Sweet! Danke for sharing :-).

    Zai Gezundt.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • FZelle (6/10/2009)


    Most people don't know that for every command send ADO.NET needs an open Transaction.

    If there is none, it will open one and Commit or Rollback depending on the result.

    So if you open a Transaction for those mass updates yourself, you save a lot of time.

    So instead of using 3.000.000 commands and therefor 3.000.000 transactions just use one command,

    and an explicit Transaction.

    This alone cuts down the time needed in this experiment to about 25%.

    Hello FZelle. I am sorry it has taken so long to respond to this but shortly after this post we had our first child and many things got lost in the mix.

    Thank you for pointing out the explicit transaction. Yes, I had not thought of that and I suspect that you are correct in that most people do not try this method. It is great to see such a simple update gain so much efficiency.

    In contrasting that with the method I am showing in this article, I see a benefit in the new streaming method in that it only takes less than 30 seconds as opposed to the 8 or so minutes that it would take assuming 25% of the original time it took. In certain situations to do the explicit transaction will lock the table for longer than is acceptable. In my current company we have a table that is updated in this method (many calls to a single row insert / update) with millions of rows and while doing the explicit transaction would reduce that time greatly, it would also lock the table which is not acceptable given that several processes need to write to the table at about the same time and cannot wait for 8 minutes (assuming lock escalation occurs). So there are situations where the explicit transaction would indeed reduce time but still not be a workable solution. However, as I said, this is still great information to know and helps a lot for people still on SQL Server 2005 who cannot do the streaming method.

    Lastly, to Bruce: Yes, the intention was to point out a new solution. I hope that I did not convey that this is the only solution but certainly one that should be considered.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • 1. generally, you will find that 500-1000 rows/transaction will give the optimal performance. Beyond that, you won't see much performance improvement.

    2. doing all 3 million rows in the transaction adversely impacts server resources and performance.

    3. If the upload table is going to be shared, you might also consider doing individual uploads to separate tables (one per user) and then executing an Insert statement from the upload table to the shared table -- cached upload.

  • mark hutchinson (5/17/2012)


    1. generally, you will find that 500-1000 rows/transaction will give the optimal performance. Beyond that, you won't see much performance improvement.

    Hi Mark and thanks for the reply.

    I would advise against having, or promoting, a set range of rows per transaction as there are far too many variables that play into overall performance. I have seen queries that work great at 80 rows and hang at 100 as well as other queries that are fine at the 5000 - 10,000 range. It all comes down to testing to find what is optimal for a particular situation.

    2. doing all 3 million rows in the transaction adversely impacts server resources and performance.

    Can you be more specific, please? My testing shows that using a streaming TVP for 3 million rows is actually not that impacting to CPU and Memory, so what resource(s) are you thinking of? Sure, the tran log might grow slightly faster/larger but that might be worth the trade-off. Also keep in mind that the 3 million rows is a contrived example to show the disparity between the different mechanisms used to get the data into the DB. If locking a table for 20 seconds is too long and increases blocking, then it can be batched, but that wasn't really the point of the article.

    3. If the upload table is going to be shared, you might also consider doing individual uploads to separate tables (one per user) and then executing an Insert statement from the upload table to the shared table -- cached upload.

    I don't see any particular gain here as the INSERT into the destination table should take the same amount of time either way. But the beauty of the TVP approach, even if not streaming, is that the incoming data can first go into a temporary table (if any massaging of the data needs to occur or if something needs to be indexed) and then into the shared table (possibly in batches), thus eliminating any need for additional schema (i.e. the per-user tables). Otherwise, assuming a more typical case of thousands, tens of thousands, or even several hundred thousand rows to be imported, the operation would likely happen quickly enough to not even need those additional steps. But again, testing is key and it is easy enough to add the intermediary temp table.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • @Solomon

    My transaction range was based on some testing I'd done a long time ago. I'm surprised that a batch Insert transaction would slow/hang at 100. I agree that the transaction size can be much larger, but the performance levels off after a point and the memory cost becomes a consideration.

    My statement about a 3M row transaction being ill-advised was also based on old testing. Both TVP and transactions cache the Insert data differently. Since TVPs are a more recent feature, I would hope that it is inherently efficient. My experience with huge transactions is that there is a real memory resource hit and that slows everything on the system. If you have a server with sufficient physical memory, you would not experience this hit. I would expect you to see memory use spike on a performance monitor.

    My suggestion about multiple destination staging tables was meant to address table locking for multiple simultaneous users' batch Insert activities.

    TVP is a beautiful thing. My original comment was inquiring whether any performance test had been done using transactions. If TVP is sufficiently faster or more efficient, it might cause some people to change their batch Insert code.

    Mark

Viewing 10 posts - 16 through 24 (of 24 total)

You must be logged in to reply to this topic. Login to reply