SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Never Delete Data

Should you ever delete data?  In a production environment, do the benefits of deleting old data outweigh the possible risks?

Data quality is important.  Whether you refer to it as data integrity, permanent retention, or simply maintaining a complete audit trail, it can be effectively argued that deleting data from a production database diminishes the big picture of your data.  After all, any data that is worthy of storing, backing up, optimizing, and mining is worth storing permanently.  Deleting data affects the ability to thoroughly research historical activity, and can impact reports and aggregations on the remaining data.  Storing only the rolled-up data, such as end-of-year financial reports are often not sufficient, because auditors or financial personnel may need to drill down to the lowest level of detail.  Other information, including certain healthcare data, is best kept forever (and in some cases, is legislated so) to ensure a proper legal record should it be necessary for judicial or civil proceedings.

The need to routinely delete data was far more critical when storage was more expensive, in terms of dollars and system time.  Purchasing disks for storage has never been cheaper, and with modern 15000 RPM drives and solid state disks, data access times continue to improve.  Removing data simply for the sake of saving bytes on a platter is not as critical as it was just a few years ago.  Data can be retained indefinitely, in the original store or in a separate archive (another table or a different database altogether).

To be clear, I’m not taking on DBAs who use the DELETE functionality to eliminate data.  A proper data retention policy would involve all levels of an organization, from the CXOs to the technical staff and end users.  And a competent retention policy doesn’t have to mandate that data remains in the RDBMS – information can be stored in the database, database backups, the filesystem, magnetic tape or optical disk, or a combination of several of these.  The specifics of permanent data storage should be dictated by how frequently or quickly the data would need to be accessed.

There are times when deleting data is expected and even commonplace.  When staging data in temp tables or table variables, one would expect deletion of data during that processing.  Any process that writes data out to an archive store would naturally need to delete data from the original location, though this could better be considered a move rather than a delete.  Sensitive data which would never be reported on or reused is expected for the protection of customers or clients – the deleting of credit card numbers after a charge is successfully posted would fall into this category.

Unfortunately, this decision does not reside with database administrators alone, or even with their employing organizations.  Some vendor applications will routinely delete older, less-often used data as part of a purge to better performance or decrease storage requirements.  I recently experienced this with a healthcare vendor during a conversion from their product to a newer system.  It was discovered during the planning phase of the conversion project that this vendor’s system was hard-coded to purge the detail data from old accounts.  Although we were able to reconstruct some of the data using other means, the ability to thoroughly report on that historical data has been permanently and irreversibly diminished.

The bottom line is that you should ask yourself whether you could ever need the data you are deleting.  You shouldn’t just ask whether it is likely that you will need the data again – approaching from this angle will eventually come back to bite you.  A more appropriate question would be whether you can imagine any scenario, however unlikely, that would require you to reference the data in the future.  Eventually your boss/the board/the CFO/the auditors will come calling, and you’ll be glad you have your safety net.

Tim Mitchell

Tim Mitchell is a business intelligence consultant, author, trainer, and Microsoft Data Platform MVP with over thirteen years of data management experience. He is the founder and principal of Tyleris Data Solutions.

Tim has spoken at international and local events including the SQL PASS Summit, SQLBits, SQL Connections, along with dozens of tech fests, code camps, and SQL Saturday events. He is coauthor of the book SSIS Design Patterns, and is a contributing author on MVP Deep Dives 2.

You can visit his website and blog at TimMitchell.net or follow him on Twitter at @Tim_Mitchell.


Posted by Steve Jones on 8 April 2009

Nice post and good points. You should always be careful about permanently getting rid of data that might be used for decisions, drill down, or CYA. We had one place were we even backed up monthly data onto Excel XLS sheets and stored those on the monthly tape.

The only time I had to actually restore something from more than a few days old was a legal issue where we had to go back like 5 years. Glad we had those tapes!

Posted by Wesley Brown on 11 April 2009

Having worked in the health care industry and the financial sector, never ever throwing away data is just par for the course. Usually, they find new and interesting ways to track changes to data and do analysis on it.

Hearing that you had to deal with a health care system that deleted data just scares the crap out of me.

Great post!

Posted by Dark Fat on 28 May 2009

How costly is it to delete a row of data if it is a big and busy table with quite a few indexes?

How costly is delete causing in locks in transactional tables that is often updated with holdlock?

I am trying to find a cost comparism between Select/Insert/Update/Delete, anyone can help would be very much appreciated.

I tend to not delete any data purely from a performance concern but maybe I am doing the wrong things that actually reduces the performance of the DB?

Posted by Tim Mitchell on 28 May 2009

DF, thanks for your comment and I welcome the questions. However, you'll probably have better luck, and will certainly receive more responses, if questions such as this are posted in the forums on this site.

As to your questions about the performance impact, the answer - as it so often is - is "It Depends".  The size of the table, the number of indexes, the number and types of locks will all play a role.  You can model this in your testing or dev environment, if you have one, though you'll have to get creative to emulate the user load on the table.

To be clear, my original post advocates the retention of data for retention's sake, and not to avoid the performance impact of deleting data.

Posted by Anonymous on 1 July 2009

Have you ever noticed unexpected gaps in sequences in IDENTITY columns?  Even though you’ve got

Leave a Comment

Please register or log in to leave a comment.