What is best practice for loading data into tables with 5 million records plus

  • This is for SQL 2000

    We have a table with over 5 million rows in it.

    There are indexes on this table that are in place to help with querying this table.

    On the 10th of each month we receive a file with approximately 40,000 rows of data in it.

    It currently takes about 15 to 20 hours to load all 40,000 rows of data into this table.

    Now the upload process does a lot of things, like check for invalid records and flag them along with doing other things.

    I have re indexed the indexes in an effort to de-frag them and speed up the process but it did not help.

    I am considering that when the file is received, we drop the indexes, process the file, and then recreate the indexes.

    I am wondering what the smart folks on this forum think about this approach and is there a best practice for this type of thing.

    Or am I totally up a creek without a paddle?

    Thanks for any insights you provide.

    Gary

  • I would start by creating a staging table to receive the data from the file. In SQL 2005, using BULK INSERT to load the table has amazing throughput, but I've never done it in SQL 2000. See http://technet.microsoft.com/en-us/library/aa225968%28v=SQL.80%29.aspx.

    From there, you can do your validation and eliminate the rows you aren't going to load. Separating the validation to work on your staging table means that you won't have any writes to your permanent table while you're validating, updating and massaging your incoming data. When you're done, you can fire a single INSERT INTO...SELECT from your staging table to populate your permanent table.

    From there, evaluate whether your indexes need to be rebuilt or statistics updated. If you only write data to the permanent table once a month, you should be able to maintain pretty good stats on it.

    The only thing left is that the 15 to 20 hours has me a little puzzled.

    1. If you have any cursors in your load process, try to get rid of them and replace them with set-based approaches. If you're doing things that require loops (like 40,000 seeks instead of 1 scan) replace them. If a single operation takes only 2 seconds but you're doing it 40,000 times, that's 22 hours.

    2. Optimize each step in your load process. Look at the execution plans for each statement if you need to. Run it statement by statement if you have to. There might be just a few things that are taking a long time or you might have a lot of problems. It depends on what you're doing, but you have to identify the problems before you can solve the problems.

    3. I'm also wondering what indexes you have on your permanent table. If you have a lot of them, they need to be updated when you do your insert. In some cases, I found it worthwhile to drop the indexes, do my load and then recreate the indexes. This isn't always true, but sometimes it can make a real difference. If you take this approach, they won't need to be rebuilt anyway, since they were just created.

    HTH

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply