SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
Search:  
 
 

Tim Mitchell

Tales of my travels through SQL Server
Add to Technorati Favorites Add to Google
Author Bio
Tim Mitchell is a Microsoft SQL Server consultant, developer, speaker, and trainer. He has been working with SQL Server for over 6 years, working primarily in database development, business intelligence, ETL/SSIS, and reporting. You can find his complete profile at TimMitchell.net.
Browse by Tag : Data Quality (RSS)

Never Delete Data

By Tim Mitchell in Tim Mitchell | 04-07-2009 11:01 PM | Categories: Filed under: ,
Rating: (not yet rated) Rate this |  Discuss | 1,721 Reads | 121 Reads in Last 30 Days |5 comment(s)

Should you ever delete data?  In a production environment, do the benefits of deleting old data outweigh the possible risks?

Data quality is important.  Whether you refer to it as data integrity, permanent retention, or simply maintaining a complete audit trail, it can be effectively argued that deleting data from a production database diminishes the big picture of your data.  After all, any data that is worthy of storing, backing up, optimizing, and mining is worth storing permanently.  Deleting data affects the ability to thoroughly research historical activity, and can impact reports and aggregations on the remaining data.  Storing only the rolled-up data, such as end-of-year financial reports are often not sufficient, because auditors or financial personnel may need to drill down to the lowest level of detail.  Other information, including certain healthcare data, is best kept forever (and in some cases, is legislated so) to ensure a proper legal record should it be necessary for judicial or civil proceedings.

The need to routinely delete data was far more critical when storage was more expensive, in terms of dollars and system time.  Purchasing disks for storage has never been cheaper, and with modern 15000 RPM drives and solid state disks, data access times continue to improve.  Removing data simply for the sake of saving bytes on a platter is not as critical as it was just a few years ago.  Data can be retained indefinitely, in the original store or in a separate archive (another table or a different database altogether).

To be clear, I’m not taking on DBAs who use the DELETE functionality to eliminate data.  A proper data retention policy would involve all levels of an organization, from the CXOs to the technical staff and end users.  And a competent retention policy doesn’t have to mandate that data remains in the RDBMS – information can be stored in the database, database backups, the filesystem, magnetic tape or optical disk, or a combination of several of these.  The specifics of permanent data storage should be dictated by how frequently or quickly the data would need to be accessed.

There are times when deleting data is expected and even commonplace.  When staging data in temp tables or table variables, one would expect deletion of data during that processing.  Any process that writes data out to an archive store would naturally need to delete data from the original location, though this could better be considered a move rather than a delete.  Sensitive data which would never be reported on or reused is expected for the protection of customers or clients – the deleting of credit card numbers after a charge is successfully posted would fall into this category.

Unfortunately, this decision does not reside with database administrators alone, or even with their employing organizations.  Some vendor applications will routinely delete older, less-often used data as part of a purge to better performance or decrease storage requirements.  I recently experienced this with a healthcare vendor during a conversion from their product to a newer system.  It was discovered during the planning phase of the conversion project that this vendor’s system was hard-coded to purge the detail data from old accounts.  Although we were able to reconstruct some of the data using other means, the ability to thoroughly report on that historical data has been permanently and irreversibly diminished.

The bottom line is that you should ask yourself whether you could ever need the data you are deleting.  You shouldn’t just ask whether it is likely that you will need the data again – approaching from this angle will eventually come back to bite you.  A more appropriate question would be whether you can imagine any scenario, however unlikely, that would require you to reference the data in the future.  Eventually your boss/the board/the CFO/the auditors will come calling, and you’ll be glad you have your safety net.


Data Quality, Part 1: The Discovery

By Tim Mitchell in Tim Mitchell | 09-28-2008 10:42 AM | Categories: Filed under: ,
Rating: (not yet rated) Rate this |  Discuss | 2,395 Reads | 214 Reads in Last 30 Days |no comments

I wrote a post a few months back about a healthcare data conversion project that I’ve been working on for the better part of 2 years.  My task on this project is to convert data from an old UNIX-based Universe database to a SQL Server-based application; the database we are extracting from is quite old, both in terms of technology as well as the length of time this application has been in use at this facility.  Universe, and IBM product, is a multivalue database, which in my understanding is like the anti-RDBMS, allowing for multiple values within each row/column intersection.  Getting data stored in this manner extracted and imported to SQL Server has been the quite challenging, but it pales in comparison with the difficulty we have had with the quality of the data itself. 

We discovered early in the conversion spec process that we would need to make some significant changes to the data to allow it to fit the data model expected by our new software vendor.  For example, the current application allows registration staff to enter a patient’s employer information by either direct text entry or through a lookup of existing employers in a lookup table.  This was not a problem in itself, as we could reference the lookup table during the conversion to retrieve the employer information.  However, a “feature” of the application is that a person can delete records from the lookup table, and since the patient’s visit information stored only the employer code, that information became orphaned; Bob’s Lumber Yard might be BLMYD or similar jibberish, and unless a tenured employee knew the code by heart, that data was lost to the bit bucket.

Another oddity we found was that the application allowed for the deletion of general ledger accounts, even if those accounts had transactions attached to them.  Now I’m not an accounting guy (at least I wasn’t before this project – I’ve since gotten a crash course) but even I can tell you that orphaning those transactions can throw off your bottom line, and the effects could be felt for many quarters to come.  Even though the transactions in question were several years old and I was able to address them relatively easily in the conversion, the overall quality of the data briefly came into question simply due to this little oddity.

Not all of the data quality issues we found were related to shortcomings of the system.  The most dangerous threat to data quality – user input – remains the most unpredictable wildcard in the entire process.  Staff turnover, training, and administrative policy changes through the years all contributed to inconsistencies in the data, causing us to create complex and often case-specific algorithms to correct.

Through the course of this conversion, we found a number of other similar issues that caused us grief.  However, this old application has been a workhorse and has served its purpose well despite being very difficult to learn and expensive to maintain.  For years, users have been able to use the system effectively on a day-to-day basis, and there is very little downtime.  Reporting, on the other hand, has been the key shortcoming of this dinosaur.  These data quality issues, along with the lack of an intuitive reporting tool, have made reporting from this system all but impossible.  This organization has had to spend a painful amount of staff time creating reports because of the manual work required to do so.  More often than not, the reporting process goes like this:

 The user generates a text-based report using one of two integrated reporting tools (neither of which is user-friendly or intuitive).
 Said user runs a copy/paste operation from the text report into Excel.
 Extraneous text such as page headers, etc., must be manually removed from the Excel file.
 The user creates extra rows or columns and copies/pastes the data from one cell to another to properly align the data as required for reporting.
 The user then performs all of the summarizing/sorting on the Excel document.

This is done for most of the critical reports each month.  Some reports must be generated weekly, which of course further increases staff workload.  The most discouraging thing, though, was that many users simply accepted this and didn’t know how much easier reporting could – and should – be.

In this first of what will hopefully be many posts about data quality, I’ve painted sort of a bleak picture – I generally start out on a more positive note, so please bear with me for now.  In future installments I’ll break down some of the particulars I ran into and some common (and a few uncommon) methods that can be used to improve the quality of data.


Slacking and data quality (in that order)

By Tim Mitchell in Tim Mitchell | 06-30-2008 10:10 PM | Categories: Filed under:
Rating: (not yet rated) Rate this |  Discuss | 1,645 Reads | 106 Reads in Last 30 Days |no comments

Well, I almost missed blogging for the entire month of June.  I'm sure that this fact didn't go unnoticed by both of the people who read my blog...  I'm working on a major data conversion and am in a mad dash to finish converting and validating years of healthcare and financial data, and unfortunately my free time (including the time allocated for blogging) has been scarce.  The good news is that the project - at least the data conversion piece - will be over in late September and perhaps life will return to some semblance of normalcy.

The aforementioned project has been an interesting exercise in data quality.  The system from which I am extracting data is quite old, in technology years anyway, and the application design lacks some of the keystones of modern systems - not the least of which is relational integrity.  The de facto standard for data entry was free text, which made for many (in some cases, tens of thousands) of duplicates.  Fortunately, the system to which I am converting has a well designed SQL Server backend, and in spite of a few disagreements, the vendor has been open to modifying the system to suite or needs.  As to the quality of our data, I've had lots of opportunities to expand my SSIS skills to gently (most of the time) massage the data into the target system.  I've even been able to write some code, which I don't do that much any more, for some advanced text parsing and manipulation.

Once this project is complete, I'll write a more comprehensive - and coherent - post to discuss in more detail my travels through this conversion and some of the data quality lessons I've learned.