Archiving

  • Hi

    I want to archive the data of a 400G table,what is the best way of archiving?

    My table has a clustered index on 2 fields (int,datetime).We just do select on last 2 years and insert in this table,and for the old data we do select seldom for examplejust 5 in a month,and no insert.

    I want to create a table and insert the old data into it and create indexes for both tables,Is it a pointless act?

    How can I archive the old data ?

  • Big topic. You might want to start by looking up partitioning. Do you need to keep some of the records in the table "active"/editable while others are not? What kind of hardware are you working on? Single drive? multiple drives?

  • mah_j (5/4/2013)


    I want to archive the data of a 400G table,what is the best way of archiving?

    What's the reason behind "archiving" if as I understand it, all data would be online all the time?

    Either way, I agree with the previous poster - this might be a case for table partitioning, range partition by date, perhaps one partition per year.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (5/4/2013)


    What's the reason behind "archiving" if as I understand it, all data would be online all the time?

    4 words. "Backups", "Restores", and "index maintenance".

    I'm going through this right now. We have a telephone system database where we're required to keep even several-years-old data and make it available online all the time. It's not a huge database (only 200GB) but the SLA to get it back online is much less that what a restore currently takes (yep, I test these things). The SLA also states (thanks to me pounding on people) that the system can be brought up with only 3 months worth of history within the current recovery SLA and that the rest of the data can be added in an almost leisurely fashion (I'll likely do it by month).

    Table Partitioning (Enterprise Edition) would, of course, help a whole lot on index maintenance but won't do me any good for backups and restores because it requires that the partitions must be in the same database.

    Sooooo... to make a much longer story shorter, I'm going to use similar archiving techniques to move data out of the "active" database and into an "archive" database a month at a time (1 table per month). In this case, "archive" simply means "not active" and "read only". Since it's in a different database (might be more than 1. 1 database per year, 1 table per month seems logical for backup and restore purposes) I'll use the ol' partitioned view technique to make it all seem like a single table and to make it so I don't actually have to change the apps that are pointing at the current table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mah_j (5/4/2013)


    Hi

    I want to archive the data of a 400G table,what is the best way of archiving?

    My table has a clustered index on 2 fields (int,datetime).We just do select on last 2 years and insert in this table,and for the old data we do select seldom for examplejust 5 in a month,and no insert.

    I want to create a table and insert the old data into it and create indexes for both tables,Is it a pointless act?

    How can I archive the old data ?

    Sometimes physically partitioning the rows into seperate tables or partitions makes sense. However, indexing is also a form of partitioning. This problem could be solved by effective clustering of rows in the table and indexing on transaction date.

    Selecting a range of rows from a table filtered by clustered key is generally very fast, even with 100 million+ rows, unless the table is heavily fragmented. Check the level of fragmentation on the table. From what you're described, the table should be clustered based on a sequential ID or insert date/time for optimal querying and minimal fragmentation.

    Also, experiment with a filtered index on whatever column is used for transaction date. You can even periodically drop / recreate this index using a different cutoff date.

    For example:

    create index ix_current on Sales_History ( Date_Of_Sale, Product_ID ) where Date_Of_Sale >= '2012/01';

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Eric M Russell (5/6/2013)


    mah_j (5/4/2013)


    Hi

    I want to archive the data of a 400G table,what is the best way of archiving?

    My table has a clustered index on 2 fields (int,datetime).We just do select on last 2 years and insert in this table,and for the old data we do select seldom for examplejust 5 in a month,and no insert.

    I want to create a table and insert the old data into it and create indexes for both tables,Is it a pointless act?

    How can I archive the old data ?

    Sometimes physically partitioning the rows into seperate tables or partitions makes sense. However, indexing is also a form of partitioning. This problem could be solved by effective clustering of rows in the table and indexing on transaction date.

    Selecting a range of rows from a table filtered by clustered key is generally very fast, even with 100 million+ rows, unless the table is heavily fragmented. Check the level of fragmentation on the table. From what you're described, the table should be clustered based on a sequential ID or insert date/time for optimal querying and minimal fragmentation.

    Also, experiment with a filtered index on whatever column is used for transaction date. You can even periodically drop / recreate this index using a different cutoff date.

    For example:

    create index ix_current on Sales_History ( Date_Of_Sale, Product_ID ) where Date_Of_Sale >= '2012/01';

    To wit, the current clustered index on int,datetime should be reversed to be datetime,int and it should be UNIQUE, as well, to get rid of the 8 byte uniquifier that will be added without it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/6/2013)


    Eric M Russell (5/6/2013)


    mah_j (5/4/2013)


    Hi

    I want to archive the data of a 400G table,what is the best way of archiving?

    My table has a clustered index on 2 fields (int,datetime).We just do select on last 2 years and insert in this table,and for the old data we do select seldom for examplejust 5 in a month,and no insert.

    I want to create a table and insert the old data into it and create indexes for both tables,Is it a pointless act?

    How can I archive the old data ?

    Sometimes physically partitioning the rows into seperate tables or partitions makes sense. However, indexing is also a form of partitioning. This problem could be solved by effective clustering of rows in the table and indexing on transaction date.

    Selecting a range of rows from a table filtered by clustered key is generally very fast, even with 100 million+ rows, unless the table is heavily fragmented. Check the level of fragmentation on the table. From what you're described, the table should be clustered based on a sequential ID or insert date/time for optimal querying and minimal fragmentation.

    Also, experiment with a filtered index on whatever column is used for transaction date. You can even periodically drop / recreate this index using a different cutoff date.

    For example:

    create index ix_current on Sales_History ( Date_Of_Sale, Product_ID ) where Date_Of_Sale >= '2012/01';

    To wit, the current clustered index on int,datetime should be reversed to be datetime,int and it should be UNIQUE, as well, to get rid of the 8 byte uniquifier that will be added without it.

    And yet another addition... If you decide to move all of the data to its own database and make it read only, don't move all of the data at once. IT will make your log very very big very very fast 🙂 Do it in batches that make sense for the size of your log. I always start with batches of 10,000 for these things and do it in a loop (yes, one of the times a loop is helpful). We actually have our archives set up in 100GB databases on one server where we simply create a new database at night and make it the "active" database and then make the previous read_only. I can explain more if you are interested in this, but I don't believe it is a good fit for most scenarios.

    Jared
    CE - Microsoft

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply