Cluster Index change on a table with 1 billion rows

  • We need to change the cluster index definition on a table with 1 billion rows. This will be performed on SQL2016 Enterprise edition.

    Initially , I wanted to do this via table migration, creating a secondary table, move data on the fly, and during a 1 hr downtime window, swap the tables, rename, add missing rows , done!

    Recently, I was advised by a coworker of using ONLINE option. So instead of creating a secondary table, then move 1 billion rows, sync everything and swap during maintenance windows, I will just drop existing Indexes online then create new versions online as well.

    Which scenario , same given database, will provide the best concurrency for the client and why? I will presume the table migration method may be faster? I'm afraid the ONLINE index creation and drop may temporarily leave the table as a heap.

    Of course, I will test all this, just ordering my thoughts.

  • sql-lover (7/20/2016)


    We need to change the cluster index definition on a table with 1 billion rows. This will be performed on SQL2016 Enterprise edition.

    Initially , I wanted to do this via table migration, creating a secondary table, move data on the fly, and during a 1 hr downtime window, swap the tables, rename, add missing rows , done!

    Recently, I was advised by a coworker of using ONLINE option. So instead of creating a secondary table, then move 1 billion rows, sync everything and swap during maintenance windows, I will just drop existing Indexes online then create new versions online as well.

    Which scenario , same given database, will provide the best concurrency for the client and why? I will presume the table migration method may be faster? I'm afraid the ONLINE index creation and drop may temporarily leave the table as a heap.

    Of course, I will test all this, just ordering my thoughts.

    How many GB in the table/CI and how many non-clustered indexes are there?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/20/2016)


    sql-lover (7/20/2016)


    We need to change the cluster index definition on a table with 1 billion rows. This will be performed on SQL2016 Enterprise edition.

    Initially , I wanted to do this via table migration, creating a secondary table, move data on the fly, and during a 1 hr downtime window, swap the tables, rename, add missing rows , done!

    Recently, I was advised by a coworker of using ONLINE option. So instead of creating a secondary table, then move 1 billion rows, sync everything and swap during maintenance windows, I will just drop existing Indexes online then create new versions online as well.

    Which scenario , same given database, will provide the best concurrency for the client and why? I will presume the table migration method may be faster? I'm afraid the ONLINE index creation and drop may temporarily leave the table as a heap.

    Of course, I will test all this, just ordering my thoughts.

    How many GB in the table/CI and how many non-clustered indexes are there?

    And how much space do you have available on your drives?

  • -1 CI

    -1 NCI

    -1 PK (which is non clustered Index and surrogate key)

    So a total of 3 indexes.

    The CI is 400GB which is 6 billion records so I was wrong, is not 1 billion records.

  • I have all the space ll I need. Expandable LUNs from a new SAN we bought. I'm not worried about space, space is not a problem. I am trying to decide what's the less invasive and fastest way to do this, to minimize client's disruptions as much as I can.

  • CREATE CLUSTERED INDEX ...

    WITH (DROP_EXISTING = ON)

    will probably be the least disruptive, but the log impact will not be small (estimate 2x the size of the data), you'll need probably the same free in the data file and it won't be quick

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/20/2016)


    CREATE CLUSTERED INDEX ...

    WITH (DROP_EXISTING = ON)

    will probably be the least disruptive, but the log impact will not be small (estimate 2x the size of the data), you'll need probably the same free in the data file and it won't be quick

    Thanks Gail

    Does above command will still apply if the new cluster index is different than the original one? Let's say... existing Index is on col1, col2. New one will be only on col1.

    Also... will the DROP_EXISTING allows clients to read the table during the process?

    Good point about the Tlog space utilization. I am testing all this with the actual table on Dev in order to get an estimate.

  • sql-lover (7/20/2016)


    Does above command will still apply if the new cluster index is different than the original one? Let's say... existing Index is on col1, col2. New one will be only on col1.

    Yes, that's the point of a CREATE .. WITH DROP EXISTING.

    Also... will the DROP_EXISTING allows clients to read the table during the process?

    No, drop_existing just does what it says. Add ONLINE = ON for a mostly online index recreation

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/20/2016)


    sql-lover (7/20/2016)


    Does above command will still apply if the new cluster index is different than the original one? Let's say... existing Index is on col1, col2. New one will be only on col1.

    Yes, that's the point of a CREATE .. WITH DROP EXISTING.

    Also... will the DROP_EXISTING allows clients to read the table during the process?

    No, drop_existing just does what it says. Add ONLINE = ON for a mostly online index recreation

    Thanks

    Then I am back to square 1 and my initial question. Which one will provide the best concurrency? The table migration or the CREATE INDEX with ONLINE clause ?

    This is a very important client. We just want clients to continue using , even when will be slow, the information in that table. The ONLINE clause may provide that during the DROP and CREATE phase, but what about during the time that the table becomes a heap? After successfully dropped, I guess the table will have no cluster and querying a table with 6 billion rows and no cluster will produce time outs. In my opinion, swapping the table and adding whatever missing rows to the staging table may be less intrusive, so that's what I wanted to get a 2nd opinion on.

  • sql-lover (7/20/2016)


    GilaMonster (7/20/2016)


    sql-lover (7/20/2016)


    Does above command will still apply if the new cluster index is different than the original one? Let's say... existing Index is on col1, col2. New one will be only on col1.

    Yes, that's the point of a CREATE .. WITH DROP EXISTING.

    Also... will the DROP_EXISTING allows clients to read the table during the process?

    No, drop_existing just does what it says. Add ONLINE = ON for a mostly online index recreation

    Thanks

    Then I am back to square 1 and my initial question. Which one will provide the best concurrency? The table migration or the CREATE INDEX with ONLINE clause ?

    This is a very important client. We just want clients to continue using , even when will be slow, the information in that table. The ONLINE clause may provide that during the DROP and CREATE phase, but what about during the time that the table becomes a heap? After successfully dropped, I guess the table will have no cluster and querying a table with 6 billion rows and no cluster will produce time outs. In my opinion, swapping the table and adding whatever missing rows to the staging table may be less intrusive, so that's what I wanted to get a 2nd opinion on.

    There's still more to consider.

    CREATE INDEX is fully logged in the FULL Recovery Model. If you switch to the BULK LOGGED recovery model, you might keep from blowing out your log file to more than 400GB because CREATE INDEX is minimally logged in that and the SIMPLE Recovery Model (and I don't recommend going to the SIMPLE recovery model... not worth breaking the log chain for this). The greatly reduced logging will help the speed quite a bit.

    Also, is your system setup for "Instant File Initialization"? That's going to help performance if it is. Related non-clustered indexes will auto-magically be rebuilt only once for this evolution if you use the WITH DROP EXISTING option Gail spoke of.

    As for the ONLINE option... it's going to make this evolution a whole lot slower, which exposes it to more opportunity to fail. I'd be tempted to do what you said and do it a million or two at a time with a final "lock the table and do a true-up". The disadvantage there is that the log file would suffer. No decision on this type of large table manipulation is going to be easy. Too many caveats. Take advantage of one thing and it kills another advantage.

    Shifting gears a bit, before you consider rebuilding the CI and, inherently, all of the NCIs, a 400GB table is starting to get into the realm of "BIG TABLE'. Especially if the table is a date based audit table or a table with mostly static data (which could also include the likes of Invoice/Invoice detail types of tables), consider using this opportunity to shift to either Partitioned Tables or Partitioned Views (even though a bit more work, Partitioned Views have some serious advantages that Partitioned Tables do not, IMHO).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/20/2016)


    sql-lover (7/20/2016)


    GilaMonster (7/20/2016)


    sql-lover (7/20/2016)


    Does above command will still apply if the new cluster index is different than the original one? Let's say... existing Index is on col1, col2. New one will be only on col1.

    Yes, that's the point of a CREATE .. WITH DROP EXISTING.

    Also... will the DROP_EXISTING allows clients to read the table during the process?

    No, drop_existing just does what it says. Add ONLINE = ON for a mostly online index recreation

    Thanks

    Then I am back to square 1 and my initial question. Which one will provide the best concurrency? The table migration or the CREATE INDEX with ONLINE clause ?

    This is a very important client. We just want clients to continue using , even when will be slow, the information in that table. The ONLINE clause may provide that during the DROP and CREATE phase, but what about during the time that the table becomes a heap? After successfully dropped, I guess the table will have no cluster and querying a table with 6 billion rows and no cluster will produce time outs. In my opinion, swapping the table and adding whatever missing rows to the staging table may be less intrusive, so that's what I wanted to get a 2nd opinion on.

    There's still more to consider.

    CREATE INDEX is fully logged in the FULL Recovery Model. If you switch to the BULK LOGGED recovery model, you might keep from blowing out your log file to more than 400GB because CREATE INDEX is minimally logged in that and the SIMPLE Recovery Model (and I don't recommend going to the SIMPLE recovery model... not worth breaking the log chain for this). The greatly reduced logging will help the speed quite a bit.

    Also, is your system setup for "Instant File Initialization"? That's going to help performance if it is. Related non-clustered indexes will auto-magically be rebuilt only once for this evolution if you use the WITH DROP EXISTING option Gail spoke of.

    As for the ONLINE option... it's going to make this evolution a whole lot slower, which exposes it to more opportunity to fail. I'd be tempted to do what you said and do it a million or two at a time with a final "lock the table and do a true-up". The disadvantage there is that the log file would suffer. No decision on this type of large table manipulation is going to be easy. Too many caveats. Take advantage of one thing and it kills another advantage.

    Shifting gears a bit, before you consider rebuilding the CI and, inherently, all of the NCIs, a 400GB table is starting to get into the realm of "BIG TABLE'. Especially if the table is a date based audit table or a table with mostly static data (which could also include the likes of Invoice/Invoice detail types of tables), consider using this opportunity to shift to either Partitioned Tables or Partitioned Views (even though a bit more work, Partitioned Views have some serious advantages that Partitioned Tables do not, IMHO).

    We do use SIMPLE recovery model. Please don't ask me why, lol...

    It took me several years to convince upper management and get Enterprise. So yes, I am already discussing partitioning and compression.

    Yes, I've done the table swap/migration dozens of time at work for similar situations. It takes around 5 min to insert the missing rows. The table is heavily read, but insertions are not so frequent.

    This table is bloated because has many years of information.

    Just out of curiosity. Doing the drop online, create online, will temporary leave the table as a heap? Yes or not?

  • If you do an explicit DROP INDEX, then yes... it will cycle through a heap stage, which will cause all of the NCIs to rebuild and point at the heap. Then when you do a create, it will cause the NCIs to rebuild again to repoint to the CI.

    If you use Gail's suggestion of CREATE INDEX WITH DROP EXISTING, then you won't be made to suffer through a heap stage and the NCIs will only rebuild once (assuming you change the key columns of the new CI).

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Shifting gears, I've found that compression generally causes the queries that we use to be twice as slow. Not worth it for me even for the legacy data, which we use quite a bit. Backups should almost always be compressed, IMHO.

    Also, although you've got some great help on this thread, you should verify it all by looking up things like CREATE INDEX with all the different options. If nothing else, it's good to be able to point people to official documentation.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • sql-lover (7/20/2016)


    Then I am back to square 1 and my initial question. Which one will provide the best concurrency? The table migration or the CREATE INDEX with ONLINE clause ?

    The create with drop_existing will probably be the least disruptive, you just need to add the online clause that I mistakenly left out of my post.

    The ONLINE clause may provide that during the DROP and CREATE phase, but what about during the time that the table becomes a heap?

    What time that the table becomes a heap? I don't think anyone's suggesting dropping the cluster and then, as a second step, creating a new one.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I am very curious to learn how you proceeded once you have completed it. Please get back to us on this.

    Regards

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply