Partitioning - Part 2

,

In Part 1 we did an overview of partitioning and related techniques such as

archiving. This time we'll dig deeper into archiving which is perhaps the

easiest technique to use regardless of the platform.

The common archiving strategy is to do so by a date column. To illustrate a

simple archiving strategy we'll use the Person.Contact table in Adventureworks.

Start by creating an archive table with the same structure as Person.Contact and

I like to add two additional columns; ArchiveID which will be an identity and

ArchivedDate which will be a datetime set to a default of GetUTCDate() when the

row is added to the table. You can use the technique of your choice for building

this table but I like the simplicity of select into because I can get a plain

table with no indexes.

select * into person.contactarchive from person.contact where 1=0

I then use the SSMS designer to add ArchiveID and DateAdded, and I make ArchiveID the primary key. This is the final result:

CREATE TABLE [Person].[contactarchive](
	[ContactID] [int] NOT NULL,
	[NameStyle] [dbo].[NameStyle] NOT NULL,
	[Title] [nvarchar](8) NULL,
	[FirstName] [dbo].[Name] NOT NULL,
	[MiddleName] [dbo].[Name] NULL,
	[LastName] [dbo].[Name] NOT NULL,
	[Suffix] [nvarchar](10) NULL,
	[EmailAddress] [nvarchar](50) NULL,
	[EmailPromotion] [int] NOT NULL,
	[Phone] [dbo].[Phone] NULL,
	[PasswordHash] [varchar](128) NOT NULL,
	[PasswordSalt] [varchar](10) NOT NULL,
	[rowguid] [uniqueidentifier] NOT NULL,
	[ModifiedDate] [datetime] NOT NULL,
	[ArchiveID] [int] IDENTITY(1,1) NOT NULL,
	[DateAdded] [smalldatetime] NOT NULL CONSTRAINT [DF_contactarchive_DateAdded]  DEFAULT (getutcdate()),
 CONSTRAINT [PK_contactarchive] PRIMARY KEY CLUSTERED 
(
	[ArchiveID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

A few points worth mentioning before we continue. One is that it's worthwhile

to be consistent in your naming convention. I like adding the 'archive' postfix

so that I can tell at a glance if a table has an archive relative in the

database. I also like using archive rather than history as history is often used

for different purposes - auditing being the most common, where you use triggers

to capture some or all changes to a table. I've also

elected to lightly index, which makes sense if you won't be accessing the table

at all, but in many cases you may be hitting the archive table frequently and in

that case you should index to the same degree or more than the base table, since

the only time the indexes will be affected will be during the archive operation.

Another point is that if you follow

my example and use an identity column on the archive table you have to be sure

that all upstream code is using Scope_Identity() rather than @@Identity or bad

things will happen as the incorrect ID is returned to the caller.  There

are alternatives to avoid that possibility. One is to just use a

uniqueidentifier, the other is to just continue to use ContactID as the primary

key. The latter works fine if you know you'll be archiving each record once and

only once (which is the scenario we are implementing next and is most common).

Now let's build the archiving code and for this example we'll use

ModifiedDate as our key column.

alter proc usp_PersonContactArchive @ThruDate datetime
as
set nocount on
set xact_abort on
begin transaction
--first we stabilize the set of records we want to move
create table #temp (ContactID int)
create unique index ndxMain on #Temp (ContactID)
insert into #temp (
	ContactID)
select
	ContactId
from Person.Contact
where
	ModifiedDate 

Pretty vanilla code, the goal is to make sure that we don't delete records from the source table without them being in the archive table. An alternate, and perhaps slightly more robust, implementation would be to join back to the archive table to positively identify records to delete, like this:

create proc usp_PersonContactArchive2 @ThruDate datetime
as
set nocount on
set xact_abort on
begin transaction
--put the records in the archive table
INSERT Person.contactarchive
	(ContactID
	,NameStyle
	,Title
	,FirstName
	,MiddleName
	,LastName
	,Suffix
	,EmailAddress
	,EmailPromotion
	,Phone
	,PasswordHash
	,PasswordSalt
	,rowguid
	,ModifiedDate
	)
select
   C.ContactID
   ,NameStyle
   ,Title
   ,FirstName
   ,MiddleName
   ,LastName
   ,Suffix
   ,EmailAddress
   ,EmailPromotion
   ,Phone
   ,PasswordHash
   ,PasswordSalt
   ,rowguid
   ,ModifiedDate
from Person.Contact C
where
	ModifiedDate 

All that remains is to test it and to set up a job to run it. The nice thing about this implementation is that if the job fails, the next run will catch all the records the previous run should have with no extra tracking needed. The job just needs a single step, though in practice I put all my archiving code into a single job with a separate step for each table. Our calling syntax is simple:

declare @ThruDate datetime
set @ThruDate = DateAdd(yy, -9, getutcdate())
exec usp_PersonContactArchive @ThruDate

Because we're using Adventureworks and their is no new data being added, I've

elected to archive anything more than a nice odd 9 years old each time it runs.

We can run it at whatever internal makes sense. It's less work to do it daily so

that we move smaller chunks of data, but we might have a business situation

where we want to archive once a year on Jan 1.

This is simple archiving and ignores two possibilities. One is that you'll

have foreign key constraints and the other is that the number of records to be

archived is so large that doing it in a single transaction is prohibitive.

Foreign keys can range from mild pain to severe. Let's look at a not quite so

simple situation where we our target table is part of a foreign key

relationship. I've created a table called TestFKey and populated it with data

from Person.Contact, and then added the foreign key constraint to person.contact.

CREATE TABLE [dbo].[TestFkey](
	[contactid] [int] IDENTITY(1,1) NOT NULL,
	[modifieddate] [datetime] NOT NULL,
 CONSTRAINT [PK_TestFkey] PRIMARY KEY CLUSTERED 
(
	[contactid] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]
ALTER TABLE [dbo].[TestFkey]  WITH CHECK ADD  CONSTRAINT [FK_TestFkey_Contact] FOREIGN KEY([contactid])
REFERENCES [Person].[Contact] ([ContactID])

To test, I ran this statement:

exec usp_PersonContactArchive '12/30/1998'

And this is the resulting error:

A check of our archive table shows that we're still transactionally

consistent. When the delete failed our transaction failed along with it. This

will happen if even one foreign key relationship is still valid within our

target set. The only way to resolve is to either only move records without a

foreign key relationship (not incredibly useful), or we have to archive our way

from the bottom up, starting with records in the TestFkey table. With a good

number of foreign keys and add in some business rules that are any more complex

than date based and it gets pretty complex. To resolve this one we would just

mirror our archiving process pointed at the TestFkey table, put that step ahead

of our Person.Contact step and let it run.

Let's revisit our other potential issue which is that we may have too many

records to process in one pass. Only experience can tell you what that number

is, but it's easy enough to implement. First we need to modify our archiving

proc a bit (changes are highlighted):

alter proc usp_PersonContactArchive2 @ThruDate datetime, @ArchiveCount int output

as
set nocount on
set xact_abort on
begin transaction
--put the records in the archive table
INSERT Person.contactarchive
	(ContactID
	,NameStyle
	,Title
	,FirstName
	,MiddleName
	,LastName
	,Suffix
	,EmailAddress
	,EmailPromotion
	,Phone
	,PasswordHash
	,PasswordSalt
	,rowguid
	,ModifiedDate
	)
select top 1
   C.ContactID
   ,NameStyle
   ,Title
   ,FirstName
   ,MiddleName
   ,LastName
   ,Suffix
   ,EmailAddress
   ,EmailPromotion
   ,Phone
   ,PasswordHash
   ,PasswordSalt
   ,rowguid
   ,ModifiedDate
from Person.Contact C
where
	ModifiedDate order by 
	ModifiedDate
set @ArchiveCount = @@rowcount
--delete from source
delete from person.contact
where
	ContactId in (select c.contactid from person.contact c inner join person.contactarchive ca on c.contactid = ca.contactid)
commit transaction

Then we add a higher level calling proc that will call our original proc

multiple times:

alter proc usp_PersonContactArchiveLoop as
set nocount on
declare @ThruDate datetime
declare @RecCount int
declare @MaxLoop int
set @ThruDate = DateAdd(yy, -8, getutcdate())
set @maxLoop = 10
while @MaxLoop > 0
	begin
		exec usp_PersonContactArchive2 '1/1/1999', @RecCount output
		print 'Archived ' + convert(varchar(10), @RecCount)
		set @MaxLoop = @MaxLoop - 1
		if @RecCount = 0
			break
	end		

I've designed this one to loop up to 10 times and to archive one record on

each pass. In practice we might want to let it run until completion, or have it

break once a certain time is reached so that it doesn't continue to run into

production hours.

Hopefully that will get you started thinking about how you might go about

archiving in your current situation. Archiving can lead to some performance

gains and it can delay having to purchase more expensive disk space if you move

the archive tables to cheaper storage. The downside is that the data is not

quite as easily available as it used to be. Next time we'll dig into partitioned

views!

Rate

3.25 (8)

Share

Share

Rate

3.25 (8)