Purging data from Master and 5 child tables

Question

Post reply

Purging data from Master and 5 child tables

Mac1986

SSCertifiable

Points: 6216
More actions
July 5, 2014 at 1:57 pm

#287578

Hi All,
I have 6 tables which are very huge in row count and records needs to deleted which are older than 8 days.
Little info: Every day, 300 Million records are inserted in below 7 tables. we should maintain only 8 days worth of data in below tables. Please help me to implement Purge script which can delete records in all tables in the same time and with optimized parallelism.
Master table which has [ID],[Timestamp]
Table Name: Sample - 2,578,106
Child tables: Foreign key [ID] is common for all the tables. There is no timestamp column in child table. So the records needs to deleted based on Min(ID) from Sample
dbo.ConnectionDB - 1,147,578,048
dbo.ConnectionSS - 876,458,321
dbo.ConnectionRT - 118,133,857
dbo.ConnectionSample - 100,038,535
dbo.Command - 100,032,235

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 1

First thing that comes to mind is to partition the tables, have you looked into that?

😎

Mac1986 SSCertifiable Points: 6216 More actions · Answer 2

I'm trying to partition those tables with every 4 hours window. meaning the records that come in for 4 hours should in a partition. hope thats a good idea. is it?

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 3

If each partition holds 4 hours worth you will need 48 partitions (+1) to cover the eight days, maybe a days worth could be fine? All depends on the environment, usage and such, but probably this is the best approach.

😎

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

You have double posted this question. What version of SQL Server are you using, SQL Server 2012 or SQL Server 2014?

Mac1986 SSCertifiable Points: 6216 More actions · Answer 5

Yes, I agree i have 2 posts for work on same table. Yet, i thought to keep this post purely for purge and the other one for Partition. How ever, if you wanna do both in the same i won't mind closing out the other one.

I'm using SQL Server 2014 RTM version.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 6

Lynn Pettis

SSC Guru

Points: 442467

July 5, 2014 at 2:21 pm

#1727180

Partitioning will help with the purging.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 7

Can you modify the child tables to have a datetime column?

Mac1986 SSCertifiable Points: 6216 More actions · Answer 8

here is problem. the inserts are happening using a service that is been developed 6 years ago. the person who wrote the tool is no more in the team and no one in the team knows how this inserts are designed. I joined the team a month ago and i was handed this SQL instance to optimization. I have taken care of all other best practices and left out with these 2 things. 1. Partitioning and Data Purge.

I can try asking someone who can edit that service to also insert a timestamp.

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 9

Sounds like it may not be doable.

Tell us more about the ID's in the tables. In fact, it may help if you could post the DDL for the tables including currently defined indexes.

Mac1986 SSCertifiable Points: 6216 More actions · Answer 10

Mac1986

SSCertifiable

Points: 6216

July 5, 2014 at 3:07 pm

#1727186

Attached the DDL for all 6 tables.

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 11

I believe that, before I did anything else with performance in mind, I'd clean up and consolidate some indexes. For example, the ID column of the Sample table is the first column of the Clustered Index (perhaps inappropriately, is also the PK), The ID column also has a UNIQUE constraint (UniqueKey_Sample) on it and that creates another index where the leading column is the ID column. Then, it appears that an additional non-unique index ([NonClusteredIndex-20140110-131210]) has been added with an INCLUDE on the Time_Started column. That last index is certainly and fully covered by the current clustered index and can probably be dropped. That will increase the speed of both INSERTs and DELETES on that table a bit. I don't know how much quicker they would be. You'd have to test that.

The Sample table also has an index (IX_Sample2) on the IsHistorical column, which is a BIT column, with an INCLUDE on the Time_Started column. Indexes with leading columns of the BIT (or other such low selectivity) are notorious for causing timeouts (due to page/extent splits) during INSERTs by the GUI (or WebService) especially for high hit ratios of one row at a time.

The IX_Sample1 index is a massive covering index (lots of INCLUDE columns) which, among other things, is a huge duplication of data and the leading column of DepotID would seem to have relatively low selectivity. The width of this index at the leaf level is huge and probably out of order according to the order of INSERTs which also makes this a possible INSERT performance problem. The shear width makes it a bit of a performance problem for both INSERTs and DELETES.

Partitioning these tables based on an ID won't be the cake walk that most would make it out to be but, owing to the brief lifetime of the data, it'll be simpler than what I've had to do in the past (removing unused space from a file). There will be a further expansion of disk space required because every unique index will have the partitioning column added to it. For all but the Sample table, I recommend that the ConnectionSampleID be the partitioning column for everything else. That means that column will be added to the UNIQUE (includes PK) indexes on the ID columns of those other tables, which makes those column pairs no longer UNIQUE. It may be better to just drop the UNIQUE indexes from the ID columns of all but the Sample table and, of course, the ConnectionSampleID table.

Because you'd be partitioning on an IDENTITY column, you should definitely use LEFT based boundaries except for the Sample table because you won't know what the new boundary is until it happens. Using LEFT based boundaries will prevent huge amounts of data movement, in this case, and still allow you the freedom to be accurate with what the boundary value will be. The only problem there will be that you have to start thinking in terms of an "end value" instead of a "start value" when maintaining/moving your partitions but it doesn't take long to settle in that train of thought.

The Sample table should probably be partitioned using the TimeStarted column IF it includes both date and time. If so, then use RIGHT partitioning on that. It will allow you to use a "whole" date for "border" values instead of having to mess around with "end of day" values. The ConnectionSampleID "border" values would be based off the row containing the MAX SampleID for each day represented by the TimeStarted column in the Sample table.

If you decide to use LEFT partitioning on the Sample table, instead, that would be OK with the understanding that you'll leave a row or two behind when you SWITCH out a day and that the "borders" might not be as easy to understand. That shouldn't be a problem for the kind of data that you're working with, though. To be honest, I've not thought about which row in the ConnectionSample table would represent the "border" value in this case because I probably wouldn't use LEFT partitioning on the Sample table.

The additional benefit of partitioning this will be the maintenance of all those bloody indexes. Since the ConnectionSampleID is an IDENTITY column, it IS somewhat temporal in nature and only the latest partition should need any index maintenance. Once a partition is created for "older" data, you should be able to do just one more reorganization of indexes on that partition and not have to do it ever again (unless there are UPDATEs that you haven't made us aware of).

The real keys to partitioning being successful are simple. If you don't have a plan, you chances of first time success could become pretty dismal. A major key is to make this so it can be a scheduled job so that you can "set it and forget it". That brings us to the most important key of them all... test EVERYTHING with a subset copy of the data before putting it into production. Make sure the creation of new boundaries cause no data movement or it'll be as bad as doing deletes. Make sure that if a failure occurs, that the code is "self healing" so that all you have to do is re-execute the job (once you've determined what's wrong).

Oh yeah... almost forgot. Make sure that your indexes are "aligned" so that you can manage them by partition, as well.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Mac1986 SSCertifiable Points: 6216 More actions · Answer 12

Point taken. My plan is to drop all the Non Clustered indexes in all 6 tables and only add them in case by case basis.

Can you possibly give scripts needed for partition function and partition schema for all 6 tables so that i can start the further testing.

Sample table with TimeStarted as Partitioned key.

dbo.ConnectionSample with SampleID as Partitioned key

ConnectionDB with ConnectionSampleID as Partitioned key

ConnectionSS with ConnectionSampleID as Partitioned key

ConnectionRT with ConnectionSampleID as Partitioned key

Command with ConnectionSampleID as Partitioned key

pietlinden SSC Guru Points: 63421 More actions · Answer 13

http://msdn.microsoft.com/en-us/library/ms179854.aspx explains how partition schemes and functions work. Read that first. Then maybe create a dummy table in a dummy DB to play with until you understand everything. When you're sure you have it down, try the same on a copy of your database.

Eirikur Eiriksson SSC Guru Points: 182881 More actions · Answer 14

A good place to start the white paper of Kimberly Tripp and her posts on partitioning[/url].

😎