When Should We Be Cleaning Data?

Question

When Should We Be Cleaning Data?

Steve Jones - SSC Editor

SSC Guru

Points: 734434
More actions
June 6, 2015 at 11:10 am

#298739

Comments posted to this topic are about the item When Should We Be Cleaning Data?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply

Knut Boehnert SSCrazy Points: 2951 More actions · Answer 1

It is not data per se that causes the problem. Most data is good.

Some data is bad due to an application allowing to input bad data. This data can be cleansed when it comes in and follows business rules that can be determined beforehand.

Create a process issues, implement a workaround and fix the application to follow business rules.

However in a data pool big enough there is also ugly data.

This data looks right, follows all business rules, is consistent and still plain wrong. It generates from typos, wrong info provided and it is caused by people.

The only way to find these anomalies is to expose the data to the right audience - then have a process to fix the data.

Tom Gillies SSCrazy Points: 2761 More actions · Answer 2

Steve poses a good question. It's one that has exercised my mind from time to time. A purist (that's me) would argue GIGO (Garbage In, Garbage Out) and say that we should ensure we get the right data in the first place. On the other hand, a system user (that's me too) who has encountered mandatory fields which I do not understand knows the frustration that can cause. I've dealt with the consequences of real data corruption - we can all agree we don't want that!

I think Steve points to one good solution, and one I'm not sure I agree with. The one I agree with is make what is "necessary for action" mandatory, but not the things which are just an convenience or "more of a burden than a necessity". The problem with that is - what do you do about the optional fields. The solution I'm dubious about is "building in defaults". That is acceptable, but only so long as we recognise that we are changing what the user entered and probably introducing bias into the data (the default is quite likely to become a common, maybe the most common, category).

It seems to me that it is likely that people are simply not spending enough effort when they are deciding whether particular pieces of data are mandatory or not. I understand why it happens, but the consequences appear later on, either in a user interface which people do not like (inappropriate mandatory fields) or a system which has NULL columns which become inconvenient or a system that "second guesses what people want" and get's it wrong!

The problem is not really a technical one, but one of understanding what the system should be trying to achieve. The problem is one of "requirements" and "understanding".

Tom Gillies LinkedIn Profile www.DuhallowGreyGeek.com[/url]

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 3

The kind of bad data I deal with is data about a person such as birth date, name, SSN. When we bring in data for an existing person that is slightly off in multiple areas, we may create what we later discover are duplicate records for the same person. Then we have to merge the data back together and clean up the problems that multiple accounts have caused.

Steve Jones - SSC Editor SSC Guru Points: 734434 More actions · Answer 4

A couple things. In terms of defaults, I think defaults work well, if we know they're defaults. That means that downstream systems and analysis need to know the user allowed a default to flow through and didn't necessarily make a decision. That's an important thing to know.

Some of this is also human nature. I can't force a salesman to input extra data if they don't have it, don't have time, or don't want to. I also have to allow for saving some data when I lose connections or have issues.

The big thing I want to do with bad (or missing data) is have a way to follow up. Ask the user again later to input more, or fix things.

Tom Gillies SSCrazy Points: 2761 More actions · Answer 5

I'm in completely in agreement with Steve about the "if we know they're defaults". If we know, then the thinking and implementation downstream can and _should_ take that into account. As Steve says - "it's important".

I agree there is no point in trying to get a salesman to fill in data they don't have etc. Trying to do that just results in a clunky user interface which people try and circumvent with "workarounds" which are usually worse that whatever you might do with defaults, because at the very least defaults are consistent and predictable.

Having something which salvages "work in progress" is a good idea too. It complicates matters but it is still a good idea. Which leads on to Steve's last suggestion - knowing that something/someone needs to follow-up on a particular set of data. I've thought about that idea before (I once toyed with the idea of an application which was deliberately designed to handle incomplete or suspect data) but I don't think I've ever come across a situation where it was implemented. Or, perhaps I have and didn't recognise it for what it was.

Tom Gillies LinkedIn Profile www.DuhallowGreyGeek.com[/url]

David.Poole SSC Guru Points: 75897 More actions · Answer 6

If you have a default that means "not specified" then all well and good. If you have a default that presumes someone filled something in then that is dangerous, particularly if that data plays a large part in making a business decision.

Much as I don't like to see loads of nullable columns in a schema intended for an RDBMS I'm not sure that nullable fields are the lesser of two evils when the 2nd one is an ill thought out default.

I think Phil Factor's recent "Spinach" blog post made a good point. If you have a batch of data with dodgy values in it you cannot simply cull the records with dodgy values. How do you know that the "good" values are genuine rather than bad data that looks plausible.

LinkedIn Profile

Rod at work SSC-Dedicated Points: 33900 More actions · Answer 7

I think that sometimes the amount of required data is simply out of our hands. I now work for a State agency, which needs to report data to different Federal agencies. There's a boat load of data we've got to collect to report on and it isn't optional. And sometimes, at least to me, one question seems very close to another, and yet we've got to collect data on both. I just bring this up to say that, although I agree with Steve in principal that its better to have as few mandatory fields in a table, sometimes you just don't have a choice.

Kindest Regards, Rod Connect with me on LinkedIn.