Technology advances allow old-fashioned house keeping tasks
to be put off for increasing periods of time. For example, the increase in terms of storage capacity and machine speed allowing you to get away without an archiving strategy for an extended period of time, perhaps even beyond your tenure with a company.
Personally I look on such a practice as being the computing equivalent of leaving your underwear on your bedroom floor for someone else to pick up. Eventually such practice will
return to haunt you. Andy Warren identified not having an archiving strategy as
worst practice in his article Worst Practice - Not Having An Archive Plan For Each Table. This article will look at some of the problems and potential solutions to implementing an archival strategy.
Problems identifying archive material
Identifying archive material simply wasn’t a consideration in the old days because the limits on what could be retained were dictated by the technology of the day. Those barriers have largely evaporated so one of the key difficulties is actually getting the data owners to agree as to what data should be archived.
This political battle has not been made easier by legislation such as the (UK) Freedom of Information Act that requires data to be easily accessible for an extended period of time. As ever, even when the technical side is straight-forward the political side can be a Machiavellian rats nest.
The anti-archiving camp will marshal their arguments around the negatives:
- Loss of data and the resulting impact on the business
- Failure to comply with legislation
- Slow response to requests for old data
Before you counter these arguments you should sell your case by arguing the positive aspects, the main ones are as follow:
- Reduced demand on the hardware due to smaller database size, possibly leading to increases in performance
- Increased hardware longevity as capacity will be adequate for longer
- The possibility of increasing the range of applications that can be run due to the spare capacity.
Loss of data and the resulting impact on the business
We know that archiving data does not mean data loss. These days we can archive multiple copies off onto a variety of media. If you point this out your opponent will probably counter by saying that the data is in use and should not therefore be archived.
Short of demonstrating that the data in question has not been touched in years this can be a tricky one to argue. If you are going to win this battle then you need to have your facts and figures ready to hand and not wait until an action is put on you to go away and gather evidence.
The evidence will be gone by the time you have to look for
Failure to comply with legislation
In the UK we have something called the Data Protection Act and one of the many things that it states is that personal data must not be held beyond the lifetime of the purpose for which it was collected.
In other words you can argue that failure to archive material is in itself failure to comply with legislation.
In the case of Freedom of Information the law does not say that information should be instantly available on-line. Such legislation tends to say that data must be made available on application within a reasonable timescale and in an unaltered state.
Your archiving argument is not damaged by such legislation,
in fact it may even be strengthened if the driver for your archiving need is
stretched capacity. Is your capacity
under such stress that providing information within a reasonable
timescale is difficult?
Slow response to requests for old data
This circles back to the question of whether the data to be archived was being used in any case. It
is a safe bet that as soon as people know you intend to archive data they will suddenly develop a pressing need for it.
You can counter that faster response to requests for current data is a direct, and more significant, result of the archiving. For this to hold water you are going to need before and after archive benchmarks for your application. It doesn’t hurt to have a few influential application users notice the performance benefits. If there is a genuine need for old data then you need to have procedures in place to bring the data back on-line with the minimum of delay.
The easiest way is to archive the data off into a separate database, preferably onto a separate, albeit lower spec, server. The data will have always been available, but the triage of resources will have put the most current data on the most powerful equipment.
You need to understand exactly why someone needs to look at old data. I worked on a multi-gigabyte transactional database that had several million records added to it each month. People looking at older data never needed to see the transactions themselves, only an aggregate of that data. Understanding this requirement meant that not only could the old data be archived to an off-line repository but that the relevant aggregate tables could be built to store current and historic data. This benefited the users of this data because such an approach gave an immense performance boost.
Archiving home grown application data vs vendor application data
If your application has been built in-house then you have got at least a fighting chance of building an archiving solution, even so you are going come away from the archiving project wishing that an archive facility had been built into the application from day one. With a vendor application it may not even be possible to build an archive solution due to the proprietary nature of the application. It is unlikely that the vendor will want to release commercially sensitive information about the internal workings of their application.
Even if you have the technical knowledge then, unless your home built archiving solution is built within the framework of the vendor application, you are likely to be in breach of any support contract you may have with the vendor.
Let us assume that we have
- The technical knowledge to build an archiving solution
- The political backing to build such an application.
You will need to know the internal workings of the application like the back of your hand. You will also need to know how the users of the application actually do use the application. You will then face the challenge of defining business rules for archiving data. The problem is that you will find that these rules are likely to be general guidelines rather than hard and fast rules that you can apply to apply to a database.
An example from the field
I had to visit a customer’s site and they had a number of problems with a database driven web content management system. The customer authored web site content in one geographic location but the public facing part of the web site infrastructure was hosted at the other end of the country. This meant that authored content was replicated 400 miles to a remote location. Replication was done using the web CMS’s own facilities rather than SQL Server’s own facilities and the CMS facilities were the equivalent of snapshot replication. Their web site was huge, having several thousand documents created every week. Unfortunately they had reached the stage where replication could only take place over a weekend and even then it was frequently unsuccessful due to the sheer size of the snapshot.
The need for archiving database content was being driven by the limits of the technology. It was either archive or the system failed. Fortunately the nature of the content was such that it had a reasonably short shelf life though old content was used as a guideline for creating new content. The CMS didn’t have an archive facility. To get around this I have to use my knowledge of the system to come up with a general set of rules that allow me to archive content.
Within the CMS I could identify the following pieces of information:
- Whether or not a web page sits within the web site structure
- When it was created/updated and when it was due to become "Active"
- Whether a page has a live revision. The CMS has versioning and revision facilities so it is quite possible to have several revisions of a page or indeed many other objects.
- What CMS objects were embedded in the page
Working with the customer I was able to establish a set of base rules that could be used to archive content.
- If a page was older than 6 months old and had no live instances within the site structure then I could archive it.
- If an object was older than 6 months old and had no live instances or was not embedded in a template used to make a page then I could get rid of it.
The CMS does allow the use of T-SQL and stored procedures to SELECT objects from the repository however UPDATE, INSERT and DELETE from that repository would violate the support for the application. Fortunately the CMS had evolved a comprehensive API that would legitimately allow creation, manipulation and deletion of content. An archiving solution could be built using T-SQL to select archive material and the internal API to archive it.
The CMS had the concept of "template pages". These pages were effectively the rubber stamps for creating the new content pages. These pages had the following characteristics:
- They did not sit in the site structure and hence there were no live instances in the site structure.
- They were older than 6 months old
Result: The all important template pages were purged from the system. Restore development database and start again.
Although the CMS had a specific table for describing how a page was built up of templates and how objects slotted into those templates there was a special class of object used to define hyperlinks. These objects could simply point to another object within the CMS repository such as a physical page, a Shockwave file or virtually any content object. In effect the object functioned as an indirect reference to an object where as the main page/template structure table only measured direct references.
Result: A large number of objects were purged on grounds of age and not being directly embedded in a page/template. Restore development database and start again.
The API was designed for scripts that would be run from a web page such as those to render links to pages below the current page in the site structure. It was not designed for heavy duty scripts that would take several hours to run. As a result the sheer size of the initial archiving task resulted in timeouts, memory overloads and crashing of the server.
Result: When the server crashed the deletion process was often incomplete resulting in corrupted objects. We rewrote the process to batch up deletions and run periodic garbage collection. Restore development database and start again.
Some content for the site, in particular documents concerning strategy, were written months in advance. More than 6 months in advance as it turns out. We later found that such documents had a clearly identified security context that could be easily identified and included into the WHERE clause of the T-SQL as an exclusion clause.
Result: Restore development database and start again.
Simply removing data from the database did not result in a reduced database file size, or a reduced data size within the file.
A full DBCC DBREINDEX on all tables took care of this followed by a DBCC SHRINKFILE however this took a full weekend to run. This required careful scheduling in the production environment and had to be run in stages. Fortunately the production servers were considerably more powerful than the development servers so nail biting was kept to a minimum.
Success at last - concluding thoughts
The ultimate result was as follows:
- A reliable replication process that could be run nightly vs. an unreliable process that struggled to complete over a weekend
- A noticeable improvement in site performance
- Shorter backup time due to reduced database size and therefore an increased down-time window
The initial archive process carried out on several years of content was slow, but once complete the ongoing use of the archive process had a low performance overhead, including the DBCC routines afterwards.
As you can see to get this result the building the archiving process was a time consuming and iterative process with many test restores along the way. The development CMS database was a full sized copy of the live system so execution times could be extrapolated to the live servers.
My knowledge of the internal workings of the CMS had been gleaned from five years of hard graft with the product, even so I made mistakes that look obvious in hindsight that were anything but at the time. The problems that I have listed should give a big hint that a thorough test plan should be adhered to for testing an application whilst your archiving routine is being developed. Testing of the archive routine and its affects need to be extremely thorough .
In the early days of the CMS I would have had to write all of the archiving routines using standard T-SQL. Even though the CMS has undergone years of development T-SQL would still have had vastly faster execution times. This is because the API of the CMS has been designed to deal with content objects on a record by record basis where as T-SQL would have been a set based SQL DELETE or UPDATE. Performance aside however, using the API has the following advantages:
- Support terms for the CMS were not violated
- The correct referential integrity between objects in the system is maintained by the API
- The CMS API has matured and therefore calls to its methods will continue to work in future versions even if the underlying database schema changes
Finally, I hope this article has convinced you of the importance of defining and building an archive facility as one of the foundation corner stones of your database applications.