Slacking and data quality (in that order)

By Tim Mitchell, 2008/06/30

Well, I almost missed blogging for the entire month of June.  I'm sure that this fact didn't go unnoticed by both of the people who read my blog...  I'm working on a major data conversion and am in a mad dash to finish converting and validating years of healthcare and financial data, and unfortunately my free time (including the time allocated for blogging) has been scarce.  The good news is that the project - at least the data conversion piece - will be over in late September and perhaps life will return to some semblance of normalcy.

The aforementioned project has been an interesting exercise in data quality.  The system from which I am extracting data is quite old, in technology years anyway, and the application design lacks some of the keystones of modern systems - not the least of which is relational integrity.  The de facto standard for data entry was free text, which made for many (in some cases, tens of thousands) of duplicates.  Fortunately, the system to which I am converting has a well designed SQL Server backend, and in spite of a few disagreements, the vendor has been open to modifying the system to suite or needs.  As to the quality of our data, I've had lots of opportunities to expand my SSIS skills to gently (most of the time) massage the data into the target system.  I've even been able to write some code, which I don't do that much any more, for some advanced text parsing and manipulation.

Once this project is complete, I'll write a more comprehensive - and coherent - post to discuss in more detail my travels through this conversion and some of the data quality lessons I've learned.

