SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Human Data Quality

Disclaimer: I read and speak one language, having failed pretty well at learning Latin, Spanish, and then Japanese in my schooling. I'm sure there are more than a few people that would actually say I've not doing too well with English, either!

I've got a few examples here of "data quality" issues that I've seen in emails and posts lately. I don't intent to make fun of anyone, and I'm sure I would make much worse mistakes if I were to attempt to post on a non-English site. Instead I thought these highlighted some great challenges in the data world. First my examples:

"Greet" in response to fixing something.

"I'm thinning about the best way to ..." - A post wondering about a T-SQL query.

"sintax error"

That last one might be easily corrected, and I've seen other errors that are worse (and I can't find right now). But how smart does a routine need to be to decode these types of grammatical issues?

You might think a grammar checked can handle things, but I've written a lot of sentences that Word flags as having an issue, but isn't sure what to do with them. And Word is a free-form application. Imagine if you are trying to do some type of parsing or clean-up of data that isn't constrained with look-up tables?

Data quality is becoming a bigger and bigger issue in our world, and I'm not even sure that we realize it. More and more systems exchange data, and greater amounts of it. As companies seek to work together, and partner to develop new applications, they are merging data between them, depending on employees that aren't always DBAs to somehow match up data. Or they depend on automated systems to "guess" what should go where?  I'm not always sure they do a good job matching up data.

And then information is lost.

Not that DBAs do a better job, but I think a human has a better chance of learning from past mistakes and correcting them in the future.

I'm not an ETL expert, but I think there is a tremendous amount of flexibility and power in the SSIS programming model to help you figure out how to best match up data from disparate sources and clean if before it infects your system.

The Voice of the DBA

Steve Jones is the editor of SQLServerCentral.com and visits a wide variety of data related topics in his daily editorial. Steve has spent years working as a DBA and general purpose Windows administrator, primarily working with SQL Server since it was ported from Sybase in 1990. You can follow Steve on Twitter at twitter.com/way0utwest


Posted by Jack Corbett on 22 April 2009

It's a serious issue in my opinion.  I don't know if it's been improved in 2008, but even the Fuzzy lookup in SSIS 2005 doesn't even come close to removing the need for human eyes to get it right or as close to it as you can.

Posted by Steve Jones on 23 April 2009

I tend to think so as well. My guess is if we were to continue adding rules for correction, at some point the system would not only be too complex to understand, but it would also be colliding with itself.

Leave a Comment

Please register or log in to leave a comment.