Inserting a 800 million records from Staging database to EDW, how to totally disable logging

  • I need to insert 800 million records (15 years worth of data) from my staging database to my final EDW database. The transform of the data in the table is straight forward and no problem. The problem starts with the amount of time that this query takes logging and database being offline or network problems.

    The insert query (insert into edw select * from staging) takes about 2 hours to insert a years worth of data, but its' happened before that something goes wrong and a roll back happens. Now a rollback takes even longer!

    I've had the recovery on simple all the while, but even this is limiting my style.

    What I would like to do is

    1) Set logging totally off.

    2) Even if a rollback would have happened; I would like to decide to run a sub-query to insert the missing data, or maybe delete the inserted data myself and run the query again. (I would think that my delete would be quicker than a rollback in any case.)

    3) Also the EDW table is partitioned but Staging is not, if that makes any difference.

    4) I would also like to set the logging db to maybe just a few gigs with the space being reused and records being committed as and when needed, but I keep on running into (The transaction log for database is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases) error.

    I know that people will respond that the logging is needed, but the DBA does backups every day and any data that’s not in the EDW, can easily be re-loaded from the staging again. The EDW should just be a container to receive data, not something that also tries to keep track of what data is received.

    Any comments welcome

  • chris.stuart (4/3/2009)


    I need to insert 800 million records (15 years worth of data) from my staging database to my final EDW database. The transform of the data in the table is straight forward and no problem. The problem starts with the amount of time that this query takes logging and database being offline or network problems.

    The insert query (insert into edw select * from staging) takes about 2 hours to insert a years worth of data, but its' happened before that something goes wrong and a roll back happens. Now a rollback takes even longer!

    I've had the recovery on simple all the while, but even this is limiting my style.

    What I would like to do is

    1) Set logging totally off.

    You can't. It's inherent to operations within SQL Server. What you could do is to use SSIS or SQLCMD to perform a bulk insert operation. That would be minimally logged, reducing the overhead. But a straight INSERT... SELECT is going to be fully logged, no options.

    2) Even if a rollback would have happened; I would like to decide to run a sub-query to insert the missing data, or maybe delete the inserted data myself and run the query again. (I would think that my delete would be quicker than a rollback in any case.)

    Again, no real options here. You could break down the inserts so that there is a series of them and each, by itself, will cause less logging. It's a valid technique, but it will extend the time it takes to do this operation, not reduce it.

    3) Also the EDW table is partitioned but Staging is not, if that makes any difference.

    Not in terms of logging it doesn't. It might make a huge difference in terms of overall performance.

    4) I would also like to set the logging db to maybe just a few gigs with the space being reused and records being committed as and when needed, but I keep on running into (The transaction log for database is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases) error.

    I know that people will respond that the logging is needed, but the DBA does backups every day and any data that’s not in the EDW, can easily be re-loaded from the staging again. The EDW should just be a container to receive data, not something that also tries to keep track of what data is received.

    Any comments welcome

    There's just no getting away from the log. Look up the various methods of using bulk operations.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Hi Grant

    Thanks for the reply. I think that you hit the sanil on the head with this 1. I knew if I just explain it, someone will have someone will find what I'm missing actually.

    I'm actually using SSIS to populate from a flat file to the staging, and going from staging to EDW is also SSIS, with a SQL statement file to do the load into the EDW.

    I'll rather then slit up the SQL statement file into smaller pieces or SSIS components, and use the SSIS data flow's OLE Component to control my Rows per batch and Max insert commit size.

    that will sort out all of my problems with the added bonus that I can actually see how many records have been inserted (on Data Flow Path).

    I was feeling a bit uneasy about the SQL statement file in the SSIS, and now I know why.

    And thanks for the other options and suggestions as well.

    Regards

    Chris

  • Cool. Glad I could help a little.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • have you tried bulk logging? and make sure auto create and update statistics is off during the load. i've done replication snapshots of 300 million row tables and turning off the statistics creation and updates speeds things up

  • Hi I see that my auto stats was on, I've switch it off now and will see how that goes. Thanks

  • Hi,

    Not too sure if switching off the "auto stats" solved your problem.

    One of the techniques that we have found useful is to batch inserts, updates and deletes.

    Dean

  • chris.stuart (4/3/2009)


    4) I would also like to set the logging db to maybe just a few gigs with the space being reused and records being committed as and when needed, but I keep on running into (The transaction log for database is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases) error.

    lots of questions and I suffer from ADD syndrome so... let me answer the last one πŸ˜€

    Use small transaction size a.k.a. run your load in small batches.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (4/14/2010)


    chris.stuart (4/3/2009)


    4) I would also like to set the logging db to maybe just a few gigs with the space being reused and records being committed as and when needed, but I keep on running into (The transaction log for database is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases) error.

    lots of questions and I suffer from ADD syndrome so... let me answer the last one πŸ˜€

    Use small transaction size a.k.a. run your load in small batches.

    I certainly would recommend small batches.



    Alvin Ramard
    Memphis PASS Chapter[/url]

    All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

    For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

  • I don't know if this is an option for you or not, but you can also use a "select into" and that is not logged. We do that with some of our larger warehouse tables and it is much faster than reloading the existing table. After the new table has been created we just drop the old one, rename the new one and recreate the indexes. You do seem to lose some of the table dependency info when you do it like that, but there is no doubt it is faster and gets around the logging issue. It may be different in the newer versions, but it was thay way in SQL2000. Another benefit to doing it that way is you minimize the amount of down time while the table reloads. Our table happens to be hit quite frequently around the clock so this prevents issues with locking and blocking as well.

  • Well dawalker-1068762 thanks a mil, I didnt know that. When i do this again, I'll check it out!

  • if you have any indexes, disable or drop them, and recreate afterwards.

    Also, do the same from any triggers.

    Furthermore, set the database recovery mode to bulk logged and reset to full recovery after data load...

    Hope this helps πŸ™‚

  • Jonathan Mallia (4/16/2010)


    if you have any indexes, disable or drop them, and recreate afterwards.

    Also, do the same from any triggers.

    Furthermore, set the database recovery mode to bulk logged and reset to full recovery after data load...

    Hope this helps πŸ™‚

    I've found on 1 of my other tables with also about 900mil records that its actually way faster to disable and rebuild the indexes on the table. I left the query running last week to check on the insert of about 4 mil records without disabling the indexes and it took more than 24 hours, after which I canned the insert. With the indexes disabled, (In the script I disable the indexes and rebuild them when theres more than 50K records), it takes no more than 2 hours to bulk insert the data and to rebuild the 3 indexes on the table, 2 hours to 24hours+? Rather take the 2 hours

    And as for triggers, I stay as far as possible, away from them.

    I've not thought about changing the the recovery mode, but will give this a try aswell, I'm sure it will give a huge improvement as well.

    Thanks

  • You're Welcome mate... glad to help πŸ™‚

    chris.stuart (4/19/2010)


    Jonathan Mallia (4/16/2010)


    if you have any indexes, disable or drop them, and recreate afterwards.

    Also, do the same from any triggers.

    Furthermore, set the database recovery mode to bulk logged and reset to full recovery after data load...

    Hope this helps πŸ™‚

    I've found on 1 of my other tables with also about 900mil records that its actually way faster to disable and rebuild the indexes on the table. I left the query running last week to check on the insert of about 4 mil records without disabling the indexes and it took more than 24 hours, after which I canned the insert. With the indexes disabled, (In the script I disable the indexes and rebuild them when theres more than 50K records), it takes no more than 2 hours to bulk insert the data and to rebuild the 3 indexes on the table, 2 hours to 24hours+? Rather take the 2 hours

    And as for triggers, I stay as far as possible, away from them.

    I've not thought about changing the the recovery mode, but will give this a try aswell, I'm sure it will give a huge improvement as well.

    Thanks

  • Use SSIS and make sure that the destination table has no indexes. Use a data flow task and not SQL command task it will be much faster and will be minimally logged. My experience is that data flow tasks work about as fast as bulk loading, but without the headaches and has the added benefit of handling validations errors. Also, as other people have suggested, set the batch size and apply indexes only after you have loaded the table.

    Are you using SQL Server Enterprise?

    If so, then you can partition the table into intervals of (month/year) whichever you feel is most appropriate based on the # of records. You create a table with the same schema / indexes and the check constraint that aligns with your partition function. You can load tables individually on the partition interval and switch in the table into the partitioned table (this is a metadata only operation and occurs very quickly). This will allow you to perform the loading of the tables and switch into the partitioned table keeping the partition table active for whatever query processing that you might need to perform (incrementally loading).

Viewing 15 posts - 1 through 15 (of 22 total)

You must be logged in to reply to this topic. Login to reply