Using INSERT INTO…SELECT to Bulk Load Data with Minimal Logging

  • Please accept my apology if this post is in the wrong forum.

    I'm trying to bulk load two tables with mimimal logging. I started by setting the recovery model to Bulk-logged. Here's the pseudo-code for each table:

    TRUNCATE TABLE table_name

    INSERT INTO table_name WITH(TABLOCK)

    (Column_Names)

    SELECT

    (Column_Names)

    FROM OPENROWSET (

    BULK 'data_file_path',

    FORMATFILE = 'format_file_path') alias

    WHERE condition

    ORDER BY column_with_nonclustered_index

    I can backup the transaction log and shrink it down to 50 MB. After running the code for the two tables, the transaction log has grown to about 560 MB, which is about the same size as the primary data file (.mdf file).

    Each table has exactly one index, which is non-clustered. (There are no primary keys, and no unique constraints.) I tried dropping the index before the insert statement, and then recreating the index after the insert statement as follows:

    TRUNCATE TABLE table_name

    DROP nonclustered_index

    INSERT INTO table_name WITH(TABLOCK)

    (Column_Names)

    SELECT

    (Column_Names)

    FROM OPENROWSET (

    BULK 'data_file_path',

    FORMATFILE = 'format_file_path') alias

    WHERE condition

    ORDER BY column_with_nonclustered_index

    CREATE nonclustered_index

    Once again, I backup the transaction log and shrink it down to 50 MB. After running the code for the two tables, the transaction log stays at 50 MB.

    In my mind, the 2nd scenario achieved my goal of bulk loading the two tables with minimal logging. But I don't know why I had to drop/create the indexes. I'd prefer not to do this. This documentation got me headed in the right direction:

    http://msdn.microsoft.com/en-us/library/ms190422.aspx

    But the question remains: If the tables are truncated before the bulk load, why do I have to drop/create indexes? (I am using SQL 2005, and the documentation is for SQL 2008.)

    What am I missing?

  • Followup...

    I'm really reluctant to drop and recreate indexes. Let's say another dba or developer changes the index (add/remove columns to/from the index, change the FILLFACTOR, etc). My code would nullify those changes because I hard-coded the steps to create the index. Not good!

    A better option is to disable the index and rebuild it later. (Thanks to Andy Warren for his article about Disabling Indexes[/url]!)

    Here's my revised pseudo-code:

    TRUNCATE TABLE table_name

    ALTER INDEX nonclustered_index ON table_name DISABLE

    INSERT INTO table_name WITH(TABLOCK)

    (Column_Names)

    SELECT

    (Column_Names)

    FROM OPENROWSET (

    BULK 'data_file_path',

    FORMATFILE = 'format_file_path') alias

    WHERE condition

    ORDER BY column_with_nonclustered_index

    ALTER INDEX nonclustered_index ON table_name REBUILD

    If I get really motivated, I can programatically determine the non-clustered indexes and disable/rebuild them.

    Still, after reading the documentation on MSDN, I'm confused as to why any of this matters when the tables are truncated first...

  • Dave Mason (5/26/2010)


    Still, after reading the documentation on MSDN, I'm confused as to why any of this matters when the tables are truncated first...

    It's been a while, but in case you haven't figured it out, the reason that the log files stay so much smaller when you either DROP or DISABLE the non-clustered indexes is that if the indexes are active while you are loading data into the table SQL Server has to log all of the updates to the indexes. So as a record is inserted it has to log the page changes, and page splits. When you DROP or DISABLE the index first, and then CREATE or REBUILD it after the data is in, you are essentially creating a brand new index.

    The two main good side-effects of doing this are that:

    * the fragmentation in your indexes will be much less, making for better performance.

    * It is normally faster to insert the data and then index. (It is faster to create the index all at once than to modify it row by row to keep it up to date.)

    I hope that helps explain things.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply