Partitioning and Indexes needed on huge table

  • I have a table with 24,000,000 records and 140 fields. The records can be divided into 3 years for a partitioning. I'm still quite new to doing indexes and partitioning and any tips would help. I'm thinking of course I need to do this at night when no one is on. Also I'm using SQL Server 2016. Thanks!

    I must add that every 15 minutes a stored procedure uses a MERGE routine for adding new records to the table, rarely is a record updated and if so it would be in a partition with recent records. I'm not sure how this fact may effect my choice of indexing.

    • This topic was modified 2 years, 6 months ago by  DaveBriCam.
    • This topic was modified 2 years, 6 months ago by  DaveBriCam.
  • Before you partition, you need to know, that if you get partition scans, performance will be worse than before you partitioned. So, when partitioning, we should only do it, if we can guarantee that EVERY query that runs against the system will ALWAYS use the partitioning keys in the WHERE clause (ON or HAVING counts too, whatever filters the data). If you can't guarantee that, I would recommend against partitioning.

    Certainly, do it during off hours. It's going to move all the data, so, plan for that. You'll also need a pretty big log file and a big tempdb, so plan for that too. Otherwise, it's really straight forward.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • I think there are multiple approaches to this, but if it was me, I would start by trying to define the problem you are trying to solve rather than starting at the solution.

    If you are 100% certain that partitioning and better/more indexes are the way to go, for partitioning, I would look into partitioning by date based on what you said.  That being said, I am not sure how your data is distributed and if partitioning by week, month, quarter, or year would be the best approach.  Basically, partitions allow for different lookup locations for the data.  So you can (for example), put the recent data on super fast SSD as it is accessed most frequently and put the old data on spinning disk as it doesn't get accessed as often and doesn't need the performance boost that comes from SSD.  That being said, you need to be in control of your disk subsystem for that to be beneficial.  Mind you, partitioning can help with blocking if people are mostly looking at older data - inserts can happen on one partition while reads happen on another.  But again, it depends on the problem you are trying to solve.

    My opinion, 24 million rows is not a "huge" table; 140 columns is a pretty wide table, but if most of those are bit fields, that is only 140 bit per row, which is roughly 3.4 billion bits, or 420 million bytes, or roughly 420 MB.  I expect each column isn't a bit field, but I also expect that each row is not 8 KB (max row size).

    As for indexes, you want to make sure that your clustered index is well aligned with your partitions.  The reason for this is you don't want an insert into the newest partition to cause a page split in the middle of the index as that will get very messy very quickly.

    Now, an alternate approach to your partition and index could be archiving data.  What I mean by this is if the majority of the data is never looked at and only kept around for historical purposes, why not move that data into an archive table or historical table?  These are just terms for a table, but I know we have some tables like "Inventory_Historical" which is where we keep old inventory data.  Nobody really uses the data, but sometimes auditors like to review our historical data to ensure we can get data from previous years, up to 7 years ago.  Table gets one big data dump yearly of data that can be archived (usually data from 3 years ago and older) and we don't bother with indexes on it (it is a heap) as nobody really looks at it and we want fast insert performance.

    In your case, you may need some of that data and may want to keep more records.  But it brings up the initial question - what problem are you trying to solve and are you 100% certain that partitions and indexing will solve it?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Try different clustering for the table.  If that doesn't help enough by itself, then you can also look into partitioning the table.

    The key thing for determining the best clustered index is "How do you search the table?"  That is, what is in the WHERE clause on the table?

    Do you (almost) always search by date, as it sounds?  If so, the first key in the clustering key(s) should be date.

    Data compression (PAGE compression) would also likely be very helpful here; that is available even on Standard Edition, since you're on SQL 2016.  You can verify how much page compression would help by running this command:

    EXEC sys.sp_estimate_data_compression_savings 'dbo', '<your_table_name>', 1, NULL, 'PAGE'

    As to off hours only, it depends.  If you have an identity column or a datetime column that lets you know the order of INSERTs, you can create a new version of the table, temporarily with a different name (adding "_new" to the end of the original table name is typical).  Then copy existing data to that new table in batches.  The new table will eventually replace the old table.  Create just the clustered index before the initial load -- you can create any nonclus index(es) later.

    Once all the data is copied, you briefly lock the table, do a final copy of any most-recent rows that aren't in the new table yet, then rename the original table to an "old" name and rename the new table to the original table name.  Of course you'll need enough disk space to hold both copies of the table using this method.  And if foreign keys are involved, you'll need to adjust those.

    Using data compression will reduce the disk space needed for the table considerably.  In theory you could instead just delete the original table after you've copied all the old rows to the new table, but I'd be too nervous to do that.

    I don't have enough info on your current table to provide any more details than that.

    • This reply was modified 2 years, 6 months ago by  ScottPletcher. Reason: Added "in batches" to make it clear what I intended there

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • The big thing for partitioning for me has nothing to do with queries.  As Grant indicates, partitioning almost never improves performance.  Even so called "partition elimination" is usually slower than a monolithic with the correct indexes, which is why Scott brings up the idea of simply re-examining what is being used at the clustered index.  In fact, the reason why partitioning sometimes appears to improve performance has nothing to do with partitioning itself.  It usually has everything to do with the way indexes are changed to handle the partitioning.

    To wit, the only reasons why I partition tables (and I DO like Partitioned VIEWS better than Partitioned TABLES for too many reasons than can be mentioned on a short forum post) is to (1) make index maintenance shorter, faster, and less resource intensive.  Notice that I didn't use the word "easier" in any of that because it actually is a bit more difficult although not seriously so.

    The other reason is to seriously enhance backups.  If, like the op states, the table is an "append only" table temporally speaking, then the "old" partitions can be set in the own single file file group and then the file groups can be set to "Read Only".  One final backup in that state and you will no longer have to backup the data that will never change again.  That has resulted in me being able to backup a 2.2 TB database in less than 15 minutes because about 2.0 of that 2.2 TB database is old data that never changes.

    As a bit of a sidebar, it contains 1 file/filegroup per month for the last 11 years.  It's currently a partitioned TABLE (yeah... I drank the purple Kool-Aid there) and I'm in the process of converting it to a partitioned VIEW.

     

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply