Approach to archiving tables with foreign key constraints

  • Hi,

    I'd like to get your thoughts on the approach to archiving data from a set of tables with foreign key constraints.

    The scenario is we have a target table holding around 50 million records, and 10 other child tables that have constraints to the main table or one of the child tables.

    There's a requirement for a 1-off archive of data before a certain date. This won't be an ongoing process because more recent data has to be retained for some years to come. We're going to be archiving about 10 million records of the 50 mill total.

    The archive can be carried out during 1 or more maintenance windows of a few hours, but outside of that the time table needs not to be locked and I need to retain referential integrity.

    My thoughts are, to drop the constraints, within a transaction, copy a batch of say, 1000 records from the main table and the associated records from the child tables into archive tables, then delete the records that were copied, commit and repeat until the end of the window.

    At the end of the multiple batches, I'd then recreate the constraints.

    I'd be very grateful for any feedback. Am I on the right track or should I be considering a different approach?

    TIA

  • Having read through some of the articles on archiving on this site it looks as though the preferred approach would be to:

    drop foreign key constraints on the target tables

    rename the existing tables as archive tables

    create new tables with the original name and structure

    copy the data to be kept in the live tables from the archive to the 'new' original tables

    recreate constraints on the 'new' original tables

    delete non-archive data from the archive tables

    This seems to be a good approach, my concern is that since I'm not removing the majority of the data to archive, I'll be copying back 80% of the data into the original tables, which might take longer than the usual maintenance window.

    Any thoughts please?

  • anni00 (11/28/2014)


    Hi,

    I'd like to get your thoughts on the approach to archiving data from a set of tables with foreign key constraints.

    The scenario is we have a target table holding around 50 million records, and 10 other child tables that have constraints to the main table or one of the child tables.

    There's a requirement for a 1-off archive of data before a certain date. This won't be an ongoing process because more recent data has to be retained for some years to come. We're going to be archiving about 10 million records of the 50 mill total.

    The archive can be carried out during 1 or more maintenance windows of a few hours, but outside of that the time table needs not to be locked and I need to retain referential integrity.

    My thoughts are, to drop the constraints, within a transaction, copy a batch of say, 1000 records from the main table and the associated records from the child tables into archive tables, then delete the records that were copied, commit and repeat until the end of the window.

    At the end of the multiple batches, I'd then recreate the constraints.

    I'd be very grateful for any feedback. Am I on the right track or should I be considering a different approach?

    TIA

    I don't know how big the rows to be moved are but moving just 1,000 rows at a time will require 10,000 iterations. If the rows are huge, that might be the best thing for the log files but I'd shoot for something between 100,000 and a million. With the constraints being down, it should run right along especially if the clustered index (which you should NOT disable) on these tables is temporally based such as an IDENTITY or "date added" column.

    I wouldn't go over a million rows. Every system has a "tipping point" on these types of things where it might take just a minute to copy a certain number of rows and if you double that number of rows, the system suddenly takes hours.

    Since you're only deleting 1/5th of the table, the old trick of renaming tables probably isn't worth it time-wise here.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Jeff,

    Thanks for the feedback, it is much appreciated. The table rows are not huge for the most part. All the tables within scope have between 3 and 16 columns of mostly numeric datatypes. Only one has a field which is really big - a varchar 8000.

    I'm guessing I need to do some trials within our test environment to identify optimal batch size. I will start off at 100,000 and work up.

    Presumably the advantage of copying and renaming whole tables is only there if you are planning on archiving more than half the data in the original tables?

  • Hi anni00,

    On the surface, it would appear that way but, like everything else in SQL Server, it really does depend. DELETES can be brutal because a lot of people that do sharded deletes, such as those you're trying to do, forget that SQL Server has to find the rows to delete before they can be copied or deleted. If they've removed the clustered index, the clustered index isn't temporally based, or they just selected the wrong column to base the copy/delete on, you can end up with 2 full table scans (1 for the copy, 1 for the delete) for each and every iteration. As you know, that can take a huge amount of time and resources especially in the area of memory and physical I/O if the table isn't already cached.

    Inserts can actually work quite quickly and with little logging space if you shift to the BULK LOGGED Recovery Model and use SELECT/INTO to build the new tables. Again, there's a tipping point involved. It truly "Depends".

    You're doing the right thing... do a test or two or three and decide from there.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • p.s. I strongly recommend that you take a full backup of your test environment before you do any delete tests. If the delete tests don't work out right, you may want to get the test environment back to it's original environment to start a different set of tests. Heh... that's some experience talking. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (11/30/2014)


    p.s. I strongly recommend that you take a full backup of your test environment before you do any delete tests. If the delete tests don't work out right, you may want to get the test environment back to it's original environment to start a different set of tests. Heh... that's some experience talking. 😛

    I couldn't agree more - again, based on experience. If you find that your approach isn't going to work and you don't have a backup to let you start fresh with another approach, you're left for a moment sitting and wondering why you didn't start the whole process by taking a backup. Then you get mad at yourself for not thinking of it sooner and vow to never do it again. You'll remember this moment just after the same thing happens again next time. After that, you'll remember it beforehand. Just don't forget to get rid of your old backups when you don't need them any more.

  • I'm assuming you are deleting/archiving based on the date here. I would try to build a clustered index on your table by date if it is not there already. Take into consideration the time it takes to build this index on this table as well. Clustered on your date, non clustered on your join columns ideally.

    ----------------------------------------------------

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply