Issue with deleting millions of rows of data

  • Hi all,

    A client I work with has a database that has been customised by a vendor, which is causing some serious performance issues. There are a group of 12 custom tables that are all "unrelated" - no FKs or indexes or anything, but they do actually relate logically. There are header, line and link tables as well as other "related" data tables. All the data is around journals - once a row is marked as processed in the main table, it is never referred to again, therefore everything relating to it can safely be deleted.

    The problem is around the size of these tables, they are growing rapidly and the vendor has put no automatic cleanup tasks in place. When they check the long running queries in the application, most of them hit one or more of these large tables. The largest 3 tables have over 300m rows each.

    The vendor sent a script that would delete 50k processed rows from the main table, then delete from the related tables. However, this rapidly filled up the transaction log. As they are on a hosted environment, this causes issues with alerting, disk space. etc. Even reducing the amount of rows to 5k process took hours to run.

    My ideal solution would be to wait till everything is processed, back up the database, then truncate all the tables. Or, wait till a cut off point...move anything unprocessed into holding tables...truncate the tables...then move from holding back into the original tables. There is nervousness around doing this though, as they cannot test any external interfaces in the testing environment.

    So...I'm looking for any ideas or alternative solutions! It's a bit of a horrible situation, made harder by the fact that it's very difficult to organise any downtime for the application.

    Cheers in advance!

    -----------------------------------------------
    SQL Server posts at my blog, Grapefruitmoon.Net[/url]

  • grapefruitmoon - Friday, March 17, 2017 10:31 AM

    Hi all,

    A client I work with has a database that has been customised by a vendor, which is causing some serious performance issues. There are a group of 12 custom tables that are all "unrelated" - no FKs or indexes or anything, but they do actually relate logically. There are header, line and link tables as well as other "related" data tables. All the data is around journals - once a row is marked as processed in the main table, it is never referred to again, therefore everything relating to it can safely be deleted.

    The problem is around the size of these tables, they are growing rapidly and the vendor has put no automatic cleanup tasks in place. When they check the long running queries in the application, most of them hit one or more of these large tables. The largest 3 tables have over 300m rows each.

    The vendor sent a script that would delete 50k processed rows from the main table, then delete from the related tables. However, this rapidly filled up the transaction log. As they are on a hosted environment, this causes issues with alerting, disk space. etc. Even reducing the amount of rows to 5k process took hours to run.

    My ideal solution would be to wait till everything is processed, back up the database, then truncate all the tables. Or, wait till a cut off point...move anything unprocessed into holding tables...truncate the tables...then move from holding back into the original tables. There is nervousness around doing this though, as they cannot test any external interfaces in the testing environment.

    So...I'm looking for any ideas or alternative solutions! It's a bit of a horrible situation, made harder by the fact that it's very difficult to organise any downtime for the application.

    Cheers in advance!

    Two suggestions. Firstly, ensure any explicit transaction in the script is around each 50k batch so each completed batch is committed and logged, and not the whole run of batches in one go.
    Secondly, measure how many 50k batches you can safely commit in between log backups. You could then adjust your script with a WAITFOR DELAY so it plays nice with your log.

    β€œWrite the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • If you need to keep a relatively few of the billion+ rows in the combined tables I like the idea of quiescing the database, copying off what you need in to new objects, truncating existing tables and moving the data back into them. You can make it even more efficient potentially by scripting out indexes (make sure you get ALL optional setting for them!) then putting the data in and creating the indexes fresh. Note that this keeps all column-level statistics in place, which can actually gum up the works of putting the data back in. You may wish to disable autostats updates then do a full scan when done and turn it back on. Also beware of filling the tlog during any phase of this operation.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Step 1 would be to actually add some meaningful indexes... I take that back.  Step 1 would be to fire the vendor. πŸ˜‰

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the replies folks!

    TheSQLGuru - Friday, March 17, 2017 11:19 AM

    If you need to keep a relatively few of the billion+ rows in the combined tables I like the idea of quiescing the database, copying off what you need in to new objects, truncating existing tables and moving the data back into them. You can make it even more efficient potentially by scripting out indexes (make sure you get ALL optional setting for them!) then putting the data in and creating the indexes fresh. Note that this keeps all column-level statistics in place, which can actually gum up the works of putting the data back in. You may wish to disable autostats updates then do a full scan when done and turn it back on. Also beware of filling the tlog during any phase of this operation.

    This is the solution I've been trying to push, although I hadn't thoughts of disabling autostats. The client is looking to reclaim the space from these tables so a database shrink (I know, I know...!) and then a full re-index with update stats after that is likely.

    ChrisM@Work - Friday, March 17, 2017 10:55 AM

    Two suggestions. Firstly, ensure any explicit transaction in the script is around each 50k batch so each completed batch is committed and logged, and not the whole run of batches in one go.
    Secondly, measure how many 50k batches you can safely commit in between log backups. You could then adjust your script with a WAITFOR DELAY so it plays nice with your log.

    There may be scope for them to test this option this week - thanks for the WAITFOR DELAY advice, that would make a lot of sense.

    Jeff Moden - Sunday, March 19, 2017 9:13 AM

    Step 1 would be to actually add some meaningful indexes... I take that back.  Step 1 would be to fire the vendor. πŸ˜‰

    Firing the vendor is very much on the cards. πŸ™‚ This is only the tip of the iceberg when it comes to their incompetence...

    -----------------------------------------------
    SQL Server posts at my blog, Grapefruitmoon.Net[/url]

  • Note that it is pretty silly to REBUILD indexes and then update statistics on them! With REBUILD you get 100%-scan stats, the best you can have. With update stats unless you specify FULL SCAN you will get sampled, leading to less-accurate scans (in addition to all that extra work.

    Also note that you still need to update non-index stats in any case after such a repopulation effort.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru - Monday, March 20, 2017 11:23 AM

    Note that it is pretty silly to REBUILD indexes and then update statistics on them! With REBUILD you get 100%-scan stats, the best you can have. With update stats unless you specify FULL SCAN you will get sampled, leading to less-accurate scans (in addition to all that extra work.

    Also note that you still need to update non-index stats in any case after such a repopulation effort.

    Ah... not quite.  Rebuilding indexes rebuilds stats associate with the indexes.  It does NOT rebuild any column statistics.  You still have to do that separately.

    Shifting gears, I've found that rebuilding/reorganizing indexes is generally a complete waste of time and resources and such maintenance may actually work against you for both performance and storage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Monday, March 20, 2017 12:18 PM

    TheSQLGuru - Monday, March 20, 2017 11:23 AM

    Note that it is pretty silly to REBUILD indexes and then update statistics on them! With REBUILD you get 100%-scan stats, the best you can have. With update stats unless you specify FULL SCAN you will get sampled, leading to less-accurate scans (in addition to all that extra work.

    Also note that you still need to update non-index stats in any case after such a repopulation effort.

    Ah... not quite.  Rebuilding indexes rebuilds stats associate with the indexes.  It does NOT rebuild any column statistics.  You still have to do that separately.

    Hence this statement: "Also note that you still need to update non-index stats in any case after such a repopulation effort." That was meant to apply to the index rebuild part (and it not doing non-index stats), but on reading it now I can see how it could be interpreted other than how I intended.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru - Monday, March 20, 2017 12:28 PM

    Jeff Moden - Monday, March 20, 2017 12:18 PM

    TheSQLGuru - Monday, March 20, 2017 11:23 AM

    Note that it is pretty silly to REBUILD indexes and then update statistics on them! With REBUILD you get 100%-scan stats, the best you can have. With update stats unless you specify FULL SCAN you will get sampled, leading to less-accurate scans (in addition to all that extra work.

    Also note that you still need to update non-index stats in any case after such a repopulation effort.

    Ah... not quite.  Rebuilding indexes rebuilds stats associate with the indexes.  It does NOT rebuild any column statistics.  You still have to do that separately.

    Hence this statement: "Also note that you still need to update non-index stats in any case after such a repopulation effort." That was meant to apply to the index rebuild part (and it not doing non-index stats), but on reading it now I can see how it could be interpreted other than how I intended.

    Yeah... especially when I post prior to being sufficiently caffeinated for the day.  I didn't even see your good note.  My apologies for the unnecessary correction, Kevin.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply