Partitioning a HUGE table

  • I'd like to get some input, thought on a partitioning project that I am working on. I've been reading a lot of articles, white papers so I hope it will be a smooth execution once we try to implement it.

    We have a HUGE transaction table that is a good candidate for partitioning.

    Table Name: tblDetails

    Used Pages KB: 275,347,808 (Yeah, only for this single table)

    RowCount : 62,187,467

    This table has an XML column which is most likely why the table is so huge. Also, it has all current and old records but there is really no date column that can be used for partitioning in this table. The date is dependent on another table. (tblHeader). We are considering adding an int column (lets call it VaultID) that can have either a value of 0 (Default), 1, 2 and will be calculated based on the date value on the header table. This is not a calculated column however, there will just be a job that is going to update it, periodically.

    VaultID = 0 if the date in the header table is within 90 days.

    VaultID = 1 if the date is 18 months to 90days (minus one sec)

    VaultID = 2 if the date is beyond 18 months.

    This is already an existing table and doing this may take a significant amount of time and I am hoping that we can execute this at the shortest time possible. (This steps outlined may not be the shortest, if there's another way, please let me know, as a warning, I'm still a noob but always eager to learn).

    1. Prepare the database by adding the filegroups.

    2. Assigning files to each filegroup.

    3. Create the partitioning function.

    CREATE PARTITION FUNCTION pf_VaultFunction(int)

    AS RANGE RIGHT FOR VALUES (0,1,2);

    4. Create Partitioning Scheme.

    CREATE PARTITION SCHEME ps_VaultScheme

    AS PARTITION pf_VaultFunction TO (PartitionVault0, PartitionVault1, PartitionVault2, PartitionVault0)

    5. Drop Foreign keys that references to the Primary key of tblDetail. (to be re-added later)

    6. Drop Constraint with Move to the first fileGroupVault0.

    7. Add the DetailID as the Primary Key (Non-clustered).

    8. Add the VaultID column and populate the value based on the date of the Header table.

    9. Create the Clustered key with the DetailID and VaultID using the Partitionin

    CREATE CLUSTERED INDEX IX_myIndexName ON tblDetail

    (DetailID,VaultID)

    ON ps_VaultScheme(VaultID)

    10. Re-add the Foreign key constraints that were dropped from Step 5.

    Any thoughts, ideas, are welcome. Oh yeah, why am I doing this.

    1. Whenever we perform maintenance, it bloats our log files big time and it also takes a long while to do it. And so we are hoping to make the Vault1 and Vault2 to be read-only, and also, perform index maintenance only on Vault0 (and Primary as well).

    2. Hoping to do only partial restore to other environments (QA and DEV) and not have to worry if there's not enough disk space to accomodate the whole database.

    3. For performance purposes. Included is a plan to change queries to include the VaultID = 0 in the WHERE clause.

    Any thought of why we shouldn't do it this way is also welcome, and I do appreciate and welcome suggestions.

    THANKS SO MUCH!

  • I'm not a fan of the idea that you will be partitioning the data on a column that will be updated. This will bring about an overhead in the storage engine when it has to move rows of data between partitions.

    We tend to implement static partitions and switch out old/create new partitions using a scheduled job. I would consider converting the date in the header table into an int and partitioning on this value. The job can then add/remove new partitions as you reach a threshold.

  • RowCount : 62,187,467

    Our current rowcount is 1,169,466,138. Its a matter of scale. 😉

  • Creating the clustered index might take way too long and use a lot of space (you'll need at the very least the size of your table in extra free space), it might be quicker to do a bcp out, truncate, bcp in.

    It would realy help to get the date directly in the table instead of updating a column.

    With the date in the table you could use sliding partitions and require no row moving at all.

    You'll need a few partition maintenance jobs though (but that's easy to do).

    If you upgrade to SQL 2008 you could also use data compression (xml compress very well), this would end up speeding up things (depending on the amount of read/write you do on the xml field).

  • Oliiii (10/21/2010)


    Creating the clustered index might take way too long and use a lot of space (you'll need at the very least the size of your table in extra free space), it might be quicker to do a bcp out, truncate, bcp in.

    The creation of the clustered index shouldnt add any additional space. It appears to be the method of moving the data onto the partition scheme

  • Of course it does use a ton of space, creating a clustered index is like having a temp copy of your table.

    If you have a 300GB table and you need to create a clustered index on it, it'll require at least an extra 300GB of free space.

  • I see what you mean as a temporary space requirement. I had to retest this to confirm my original interpretation on space was right 🙂

    create table temp2(uid INT identity (1,1), col char(8000))

    insert into temp2 (col) values ('aaaaaaaa')

    GO 100000

    exec sp_spaceused temp2

    GO

    ALTER TABLE dbo.temp2 ADD CONSTRAINT

    PK_temp2 PRIMARY KEY CLUSTERED

    (

    uid

    ) WITH( STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

    go

    exec sp_spaceused temp2

  • Thanks a lot for all the replies.

    I really like the idea of having the date column and using it as the partition column. The first idea was actually using an existing date column already for the partitioning until when i was doing the clustered index, there's some records that have a NULL value in that column. 🙁

    But it seems, using the date is back in the drawing board then and the Sliding window strategy, (either bringing in the date from the header or using the existing one, just making sure that there's no NULL value (instead of NULL, maybe a future date can be used?).

    Just a question though on the idea of updating the column, or having a date column and use the Sliding strategy, doesn't both have an overhead in the storage engine when they move between partitions? Or does sliding basically just changes pointers and thus faster and doesn't really have the same overhead?

    The upgrade to 2008 is already being looked at, but that size of the database is just a killer especially when it needs to be replicated across different environments. 🙁 And then the size of the log that gets generated during index maintenance.

    THANKS A LOT!

  • I'm testing out the duration as well now for the bcp out, truncate and bcp in.

    With the drop constraint with Move to another partition is just taking forever as well, it's been 17 hours and it's not done yet.

  • SqlN00bie (10/20/2010)


    We have a HUGE transaction table that is a good candidate for partitioning.

    RowCount : 62,187,467

    Firstly... and not trying to hurt your feelings, as already pointed out by other posters this does not really qualifies as "huge".

    Secondly... why is it a "good candidate for partitioning"?

    Always remember, you partition a table for either one or the other - or both - good reasons:

    1- To help during data retrieval.

    2- To help during archiving/purging.

    As yourself , would partitioning will help in any one of these ways? why?

    Then decide 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • Hi Paul,

    No offense taken, any comment is really welcome.

    For the row count alone, there's probably other tables in the database that has more rows than this particular table that I'm hoping to partition, the huge factor (for me) is not just the row, but the volume as well. Just this table alone is closing to 300G. Index maintenance, back-up, restore is taking a long time (even with third party tools).

    As for the reasons that I've already stated from the first posting, the archival, retrieval, purging is probably the biggest factor. There's historical data that is wished to be available always, yet if we can put it on a dormant partition (doesn't need index maintenance, frequent backups because it doesn't change anymore, can be set to read-only). Hopefully, table and index are aligned so we can restore partially in another environment. (with that alignment said, looks like recreate the table, bcp out and bcp in may be the route)

    Also, before, this database was mirrored to another site, and when there was an index maintenance that generated a big transaction log, and couldn't synchronize it easily. Tried to break the mirror, bring over the missing log to at least catch up.... apparently, the bandwidth can't even handle a transfer of a 2G file... and well, there's still a long story to that scenario but you can already visualize it... 😀

    Given this scenario, is there other recommendation other than partitioning? I really and welcome appreciate any suggestion/comments.

    Thanks so much!

  • You certainly have a strong case for partitioning.

    As you already pointed out partitioning is a great tool for archiving/purging purposes - the fact that you can switch a whole partition in or out of a table for archiving-purging purposes added to the fact that you can just truncate a partition for purging-only purposes makes this kind of maintenance a walk in the park.

    For archiving purposes, read-only partitions or (if possible) a read-only archiving database may be of great help in terms of taking time off of backup processes, taking concurrency off the OLTP database and improving performance on archived data.

    I think you are in the right path.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • When you create the non clustered index it will immediately be invalidated by creating the clustered index and as such, useless.

    All your indexes should be created on the partition. If you don't you cant shift the partitions.

    A corrected order would look more like this:

    7. add the clustered index

    8. add the non-clustered index.

    Not clear to me that you are creating your clustered index to include the column upon which you built your partition; that's a requirement because a clustered index contains the data of the column and your partitions are built on a specific column. the structure must match up.

    In other words when you create a clustered index, you order the table and that means you set to order of rows in the partitions by the clustered index.

  • Thank you so much Paul. It helps a lot when someone (especially of your caliber) confirms that this is the right path.

    dg81328, thank you very much for pointing that out, regarding doing the Clustered first prior to the primary non-clustered. That is a very important key to this process that I'm hoping to outline.

    And thanks to all others who commented as well, its very appreciated.

    THANKS!!!

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply