Transform Data Task performance question

  • I have a package that is scheduled to run any time a certain table in a certain database has it's data refreshed.  In that package, I have a transform data task which transfers about 21 million rows from one table in my data build environment into an "inactive" table in my production environment.  Once that copy is done, I rename the "inactive" table so that it is the "active" table.

    The table is about 40 columns wide.  It is very complex data so the 40 columns is necessary and it is about as normalized as I can get it without making the queries I need to perform on it impossibly slow.  There are several indexes on the table as well without with querying would be virtually useless.

    My problem is that the package execute window (as it counts off the number of rows transformed to the new table in the "status" column) transforms the data within 1000 rows of completion and then takes a couple of more hours to report that the other 800 rows or so have completed, therefore completing the task.  I want to know how come it gets 99% done in a short amount of time and then takes 2 or more hours to complete the other 1%.  I have already considered the fact that the execute package window is not reporting progress accurately however if that is the case, I would like to know that as well.  I would also like to know what I can do, if anything, to remedy that either by setting up the package differently, or by using a different method of copying within the package other than a transform data task.

    I have done some reasearch and have tried the following solutions.  The transform data task uses the copy column transform to move the data.  I have removed the default single-column to simgle-column transform and replaced it with a single transform for all columns.  The "Use Fast Load" option is selected as well as "Table Lock".

    Is there anything else someone can suggest to help with my situation?

    Greg

  • Check "Insert batch size" on the Options tab of the Data Transformation task properties.  I think the delay you see while watching package progress is caused by the rows inserted being committed.  This would be really apparent if the batch size is set to 0 (the default), which loads the data in a single batch. 

    I don't know what else you can do in DTS to speed it up the load.  If the destination table has indexes, you could try dropping those before loading and recreating them after. 

    Greg

    Greg

  • Great response, Greg.  Thank you.

    I looked into the Insert Batch Size option and you were right.  I changed that value from zero to 50,000.  My baseline runtime for that copy was 3hrs 35mins.  After the change, I ran it again.  30 minutes into the run, 4.5 million rows out of 21 million had copied.  Extrapolating that out, I can expect a runtime of roughly 2hrs 50mins.  That shaves about 30 minutes off of the process.  Over the next couple of days, I am going to play with reducing the insert batch size to something smaller, like 10,000.  My hope is that making the batches smaller will allow SQL to perform faster.

    The next thing was the idea of dropping the indexes, doing the copy, then re-creating the indexes.  Does anyone have any metrics regarding the effeciency between these two methods?

    1) time to truncate destination + time to insert 21 million rows with indexes

    2) time to truncate destination + time to drop indexes + time to insert 21 million rows + time to add indexes

    Thanks again for the assistance,

    Greg

  • I am curious how your test went.  I would think that the batch size wouldn't change the performance that much of the loading - but that it impacts how the DTS GUI reports back with it's progress.

    I am interested in people's responses to your question about dropping indexes before a DTS-load and then recreating them as this is something I am pondering.  I'm told that conventional wisdom is that the larger the destination table, then diminishing returns kick in and since most of my tables are large, I've not played with it myself.

    In my case I've got 140 million rows and I need to insert about 50,000 each day.  It takes 2.5 hours on some days!!!  I don't know if it's:

    a) Obtaining the results on the source

    b) Copying data across the network (I don't think this is it).

    c) The actual insert of the data

    d) Index maintenance on the insert-side (destination).

  • I am actually wrapping up the process of changing the values to see what results I get.  As you can imagine, it is a tedious process.  The more I test, the more I believe your statement about batch size not changing performance is correct.  As a matter of fact, in each of my test cases so far, I've experienced longer total runtimes for values of batch size other than the default of zero (all as 1 batch).

    I think I'm going to set batch back to 1 giant batch and then play with the dropping indexes beforehand and re-creating afterwards.  The only reason I hesitate is because it takes so stinking long to test those types of things.  I can say that if I were to simply take the database offline and then copy the MDF and LDF files across the network, it takes about 2 hours.  It seems to reason that I should be able to copy the same data within SQL in less time than 3-4 hours somehow.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply