Rebuilding large database tables and indexes

  • I have come across a database system which isn't designed to work optimally. It is fairly large (~400GB) and performance of loading and querying is degrading (improper data types, fragmented indexes, non unique clustering key and other problems). So, I have quite a task in front of me, but I am up for the challenge. I figure this is not a unique situation, many of us would have come across this before, so I am looking for gotcha's and advice. I have done this before too, but only for smaller databases, some of the operations here I expect to take a couple of hours or more to complete (depending on load/infrastructure speed etc, I know).

    My plan is thus:

    + Take a full backup of the database

    + Set the recovery model of the DB to simple

    + Drop non clustered indexes

    + Drop clustered indexes

    + Remove PKs (wrong data types, too large!)

    + Narrow data types (add new column, update column in batches to old value, rename new column to old column)

    + Add PKs, which will create clustered indexes automatically based on PK ID

    + Create non clustered indexes

    + Run a SHRINKDB (normal operations I would never do this, but this is a special case, ensure log file is truncated to a logical size especially after all those table modifications...)

    + Set the recovery model of the DB to Full

    + Ensure everything works OK or better

    How is my plan? Thoughts?

  • I'm not sure why you're choosing to go with all the clustered indexes on the identity columns? I would only do this if that was also the primary path to the data. Otherwise, if you have a more common access path to the data (foreign keys maybe), I might go with that as the clustered index instead of the PK (even if I did leave the identity on there as a PK). In short, the process, from what I can see, didn't include investigation and evaluation. Instead it looks like you're just applying a single standard across all the tables.

    Also, I'd probably still leave the database in FULL recovery but increase the frequency of log backups for the time while I'm working on it. Otherwise, you're basically putting the database into a state where you can't recover except from the backup at the start of the process. If we're talking very long running processes, in the event of an emergency, it would probably help to be able to recover some of the work rather than having to redo all of it.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • While 'nuke from orbit' is a good strategy for some problems, it's not a good strategy when doing index changes, especially clustered index changes. Make such significant changes and you'll be hard pressed to fix any degradation that will happen.

    Take it one step at a time, identify the worst problems, fix the worst problems, repeat until no one's complaining. It'll take longer, but probably has a better chance of succeeding overall.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks Grant. I am up to the investigation stage really, have working tables structured the way I want (in test) now and are checking them out.

    The clustering on a FK on one table actually is a great idea, I may just do that... thanks! Although currently the relationship is not enforced, it is implied.

    More info re clustered index...

    The clustered index was previously on a single date field. The DB itself is loaded with ~100k records every day, which are then manipulated, sometimes reloaded (sometimes days later, hence putting data into the middle of the index), then used in select queries and complex self joins. The DB is kind of a cross between a DW and a OLTP system, used mostly as a DW, with a sprinkling of random updates/inserts/deletes, deleting data (mostly from the end, but sometimes from the middle) and reloading.

    I figured the clustered index didn't fit with normal design (unique/ever increasing/narrow/static) of being unique, requiring the uniqueifier to be added onto all non clustered indexes (there are about 5, which are also fragmented considerably). It appeared outlier loads (which sometimes took a few hours to finish, which would stray into the time business would want to use it) were caused by inserting data at earlier dates. This, I figure, requires moving a whole lot of data to ensure the data with a later date comes after the newly inserted. My thinking is that by putting an ever increasing clustering key in, the data will never have to be reordered when 100k loads are dumped in the middle of the index (2k at a time as data comes from different systems). Am I wrong?

    Once loaded, most of the data is searched within a single date, hence I plan to put a non clustered index on the date field. Range scans should still be possible I figure as all the data for a single date will be next to each other, hopefully not affecting IO so much. Having structured the data I want in test and checking out the before/after performance, the DB is much smaller and much faster. Saving around 70GB at the moment, with potential for another 20-30GB. About 1/3 of this is index space savings. A 25% reduction in space may account for the performance difference alone though...

  • Hi Gail, thanks, normally I would agree! This is something I am working on in "downtime" between being handed more important work. I won't go further than I need, but would like to get things in a state where they will keep working for the next 5 years or so 🙂

    People are starting to grumble and I want to set it right. This is only occurring on a couple of tables out of about 10 that could be worked on, so I am biting them off bit by bit 😀

  • Oh, one last thing I will do:

    + Add a rebuild/reorganise maintenance task which will detect and rebuild (occasionally) or reorganise indexes as needed. Current fragmentation on some of these indexes are 95%!!!!

  • If you have lots of queries which filter on that date, it may well be a very good place for the clustered index.

    Narrow, unique, ever-increasing, unchanging are guidelines, not set-in-stone rules and, of the four, the unique one is the one I break most often. I'm quite fond of clustered indexes on date columns when most of the queries filter by that date.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • My two cents , start with your biggest tables , 400GB databases usually have just a few tables that contribute to the bulk of the issues.

    Try partitioning large tables they will improve OLTP workloads and you can compress , archive old data if thats how the system works.

    Have you used an Clustered CS indexes , you might not need any other indexes on the tables in that case.

    Approach the problem just like SQL would , take a big problem and split it into smaller problems. Just addressing issues on a few tables might give you the most noticable difference that trying to do all the stuff mentioned ealrier.

    If its just to keep end users happy run a trace and find long running queries and go from there.

    Jayanth Kurup[/url]

  • Hi Jayanth,

    Unfortunately on 2008 🙁 so can't use Column Store indexes, damn. These big tables are often updated too, so column store wouldn't work (read only when put on a table). The partitioning and archiving of old data are good ideas for the future though, something I had thought about but will have to implement later when the users begin indicating they don't need old data or when the DB grows too big.

    Regarding the clustered indexes... sometimes the data is reinserted into the middle of the index, bits at a time. So my clustered index looks like this currently (each of these dates is repeated 100k times or so!):

    2015-05-01

    2015-05-02

    2015-05-03

    2015-05-04

    2015-05-06

    2015-05-07

    2015-05-08

    2015-05-05 is then loaded, meaning all the data for 06 to 08 needs to be shifted. Not just once but for every insert that occurs, maybe 20x5k each, as data is coming from different sources for the same date. This is slow and annoying. Instead I figure having a clustered index on a surrogate key only (dates provided for reference) will lead to better loading performance:

    1 2015-05-01

    2 2015-05-02

    3 2015-05-03

    4 2015-05-04

    6 2015-05-06

    7 2015-05-07

    8 2015-05-08

    2015-05-05 was deleted (a single operation), then is reloaded, meaning it should just be appended to the end:

    9 2015-05-05

    Later dates don't have to be shifted, so performance should be better.

    A NC index on the date will then store the clustering key where the date starts and ends. When a query has a where condition by date, it should scan this NC index identifying the start/end location of data and scan within it (note to self, check this does happen!).

    Is my thinking correct? I am looking for performance for loading dates in odd places in the data set, lowering storage requirements (adding a non unique key means uniquifier added to all NC indexes, if I don't have this I have less pages to read, higher performance) AND for query performance within a single date.

  • It sounds like you're largely on track.

    On the implied relationship, I'd get that enforced if possible. The more tools you give the optimizer, the better the results.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Brent Leslie (7/8/2015)


    2015-05-05 is then loaded, meaning all the data for 06 to 08 needs to be shifted.

    Err, no. You will probably get some page splits, but SQL doesn't re-insert anything greater than the currently inserted value. Index keys enforce logical order of the rows in the index, not physical.

    Instead I figure having a clustered index on a surrogate key only (dates provided for reference) will lead to better loading performance:

    Maybe slightly, because fewer page splits. But if you improve the loading performance by 10% but degrade queries by 50%, is that a success?

    A NC index on the date will then store the clustering key where the date starts and ends. When a query has a where condition by date, it should scan this NC index identifying the start/end location of data and scan within it (note to self, check this does happen!).

    No, that's not how indexes work. Firstly you don't want an index scan, that's inefficient. You'll get an index seek to locate the dates which match the query. You'll then get a key lookups (single row clustered index seek) for each row returned by the index seek. These can get very inefficient very fast. In fact, if the optimiser decides that the query will be doing key lookups of over about 1% of the table, it'll chose to *scan* the clustered index instead.

    (adding a non unique key means uniquifier added to all NC indexes, if I don't have this I have less pages to read, higher performance)

    While that is true, the uniqueifier is NULL unless there actually is a duplicate value. It's also only 4 bytes if it is present.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • I am wondering on the benefit of adding a clustered index to make sure data is sequential and then using a NonClustered index for all the queries. Since you mentioned that the data column is frequently Updated the NC will need to be rebuilt frequently as well. So why not adjust the fill factor and simply rebuild a clustered index. In fact you could improve ETL performance by partition switching so you could still get all the benefits of clustered index without the impact on write performance. I dont think rebuilding indexes by partition is an online operation in 2008 r2 but try it out anyway.

    Jayanth Kurup[/url]

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply