Purging data from Master and 5 child tables

  • Hi All,

    I have 6 tables which are very huge in row count and records needs to deleted which are older than 8 days.

    Little info: Every day, 300 Million records are inserted in below 7 tables. we should maintain only 8 days worth of data in below tables. Please help me to implement Purge script which can delete records in all tables in the same time and with optimized parallelism.

    Master table which has [ID],[Timestamp]

    Table Name: Sample - 2,578,106

    Child tables: Foreign key [ID] is common for all the tables. There is no timestamp column in child table. So the records needs to deleted based on Min(ID) from Sample

    dbo.ConnectionDB - 1,147,578,048

    dbo.ConnectionSS - 876,458,321

    dbo.ConnectionRT - 118,133,857

    dbo.ConnectionSample - 100,038,535

    dbo.Command - 100,032,235

  • First thing that comes to mind is to partition the tables, have you looked into that?

    😎

  • I'm trying to partition those tables with every 4 hours window. meaning the records that come in for 4 hours should in a partition. hope thats a good idea. is it?

  • If each partition holds 4 hours worth you will need 48 partitions (+1) to cover the eight days, maybe a days worth could be fine? All depends on the environment, usage and such, but probably this is the best approach.

    😎

  • You have double posted this question. What version of SQL Server are you using, SQL Server 2012 or SQL Server 2014?

  • Yes, I agree i have 2 posts for work on same table. Yet, i thought to keep this post purely for purge and the other one for Partition. How ever, if you wanna do both in the same i won't mind closing out the other one.

    I'm using SQL Server 2014 RTM version.

  • Partitioning will help with the purging.

  • Can you modify the child tables to have a datetime column?

  • here is problem. the inserts are happening using a service that is been developed 6 years ago. the person who wrote the tool is no more in the team and no one in the team knows how this inserts are designed. I joined the team a month ago and i was handed this SQL instance to optimization. I have taken care of all other best practices and left out with these 2 things. 1. Partitioning and Data Purge.

    I can try asking someone who can edit that service to also insert a timestamp.

  • Sounds like it may not be doable.

    Tell us more about the ID's in the tables. In fact, it may help if you could post the DDL for the tables including currently defined indexes.

  • Attached the DDL for all 6 tables.

  • I believe that, before I did anything else with performance in mind, I'd clean up and consolidate some indexes. For example, the ID column of the Sample table is the first column of the Clustered Index (perhaps inappropriately, is also the PK), The ID column also has a UNIQUE constraint (UniqueKey_Sample) on it and that creates another index where the leading column is the ID column. Then, it appears that an additional non-unique index ([NonClusteredIndex-20140110-131210]) has been added with an INCLUDE on the Time_Started column. That last index is certainly and fully covered by the current clustered index and can probably be dropped. That will increase the speed of both INSERTs and DELETES on that table a bit. I don't know how much quicker they would be. You'd have to test that.

    The Sample table also has an index (IX_Sample2) on the IsHistorical column, which is a BIT column, with an INCLUDE on the Time_Started column. Indexes with leading columns of the BIT (or other such low selectivity) are notorious for causing timeouts (due to page/extent splits) during INSERTs by the GUI (or WebService) especially for high hit ratios of one row at a time.

    The IX_Sample1 index is a massive covering index (lots of INCLUDE columns) which, among other things, is a huge duplication of data and the leading column of DepotID would seem to have relatively low selectivity. The width of this index at the leaf level is huge and probably out of order according to the order of INSERTs which also makes this a possible INSERT performance problem. The shear width makes it a bit of a performance problem for both INSERTs and DELETES.

    Partitioning these tables based on an ID won't be the cake walk that most would make it out to be but, owing to the brief lifetime of the data, it'll be simpler than what I've had to do in the past (removing unused space from a file). There will be a further expansion of disk space required because every unique index will have the partitioning column added to it. For all but the Sample table, I recommend that the ConnectionSampleID be the partitioning column for everything else. That means that column will be added to the UNIQUE (includes PK) indexes on the ID columns of those other tables, which makes those column pairs no longer UNIQUE. It may be better to just drop the UNIQUE indexes from the ID columns of all but the Sample table and, of course, the ConnectionSampleID table.

    Because you'd be partitioning on an IDENTITY column, you should definitely use LEFT based boundaries except for the Sample table because you won't know what the new boundary is until it happens. Using LEFT based boundaries will prevent huge amounts of data movement, in this case, and still allow you the freedom to be accurate with what the boundary value will be. The only problem there will be that you have to start thinking in terms of an "end value" instead of a "start value" when maintaining/moving your partitions but it doesn't take long to settle in that train of thought.

    The Sample table should probably be partitioned using the TimeStarted column IF it includes both date and time. If so, then use RIGHT partitioning on that. It will allow you to use a "whole" date for "border" values instead of having to mess around with "end of day" values. The ConnectionSampleID "border" values would be based off the row containing the MAX SampleID for each day represented by the TimeStarted column in the Sample table.

    If you decide to use LEFT partitioning on the Sample table, instead, that would be OK with the understanding that you'll leave a row or two behind when you SWITCH out a day and that the "borders" might not be as easy to understand. That shouldn't be a problem for the kind of data that you're working with, though. To be honest, I've not thought about which row in the ConnectionSample table would represent the "border" value in this case because I probably wouldn't use LEFT partitioning on the Sample table.

    The additional benefit of partitioning this will be the maintenance of all those bloody indexes. Since the ConnectionSampleID is an IDENTITY column, it IS somewhat temporal in nature and only the latest partition should need any index maintenance. Once a partition is created for "older" data, you should be able to do just one more reorganization of indexes on that partition and not have to do it ever again (unless there are UPDATEs that you haven't made us aware of).

    The real keys to partitioning being successful are simple. If you don't have a plan, you chances of first time success could become pretty dismal. A major key is to make this so it can be a scheduled job so that you can "set it and forget it". That brings us to the most important key of them all... test EVERYTHING with a subset copy of the data before putting it into production. Make sure the creation of new boundaries cause no data movement or it'll be as bad as doing deletes. Make sure that if a failure occurs, that the code is "self healing" so that all you have to do is re-execute the job (once you've determined what's wrong).

    Oh yeah... almost forgot. Make sure that your indexes are "aligned" so that you can manage them by partition, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Point taken. My plan is to drop all the Non Clustered indexes in all 6 tables and only add them in case by case basis.

    Can you possibly give scripts needed for partition function and partition schema for all 6 tables so that i can start the further testing.

    Sample table with TimeStarted as Partitioned key.

    dbo.ConnectionSample with SampleID as Partitioned key

    ConnectionDB with ConnectionSampleID as Partitioned key

    ConnectionSS with ConnectionSampleID as Partitioned key

    ConnectionRT with ConnectionSampleID as Partitioned key

    Command with ConnectionSampleID as Partitioned key

  • http://msdn.microsoft.com/en-us/library/ms179854.aspx explains how partition schemes and functions work. Read that first. Then maybe create a dummy table in a dummy DB to play with until you understand everything. When you're sure you have it down, try the same on a copy of your database.

  • A good place to start the white paper of Kimberly Tripp and her posts on partitioning[/url].

    😎

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply