Database professionals soon learn, by experience, a great respect for the need for quality in the data. I remember vividly the time I first learned this lesson, while building applications for dealers on the London Metal Exchange. One of my SUM aggregations returned the wrong answer (double entry bookkeeping caught the error). Due to a slight problem with the BCD Math package I was using, the grand total was a couple of pennies out in totals of around 5 million pounds. Jokingly, I offered to pay the Stockbroker the difference but he was horrified, "Either data is right or it is wrong. There is no in-between. Get it right!"
It's a lesson that stayed with me. In this case, we were dealing with exact measurements and, if the data is 'good', we can trust the judgments we make based on its analysis, providing we get the calculations right. More generally, we live in a world of uncertainty, and have to be clear about the level of uncertainty when we present figures. Moreover, if data is 'bad', it is very difficult to 'cleanse' it in a way that we can rely on the calculations and decisions we derive from it. There is no magic cleansing agent in statistics. Unfortunately, there are cases where important decisions based on 'bad' or at least 'uncertain' data can cost lives, as was the case in the recent scandal that hit the UK Mid-Staffordshire NHS Foundation Trust.
In the UK, hospitals receive a rank, on behalf of the government, based on their Hospital Standardized Mortality Ratios (HSMRs). In short, and as described in more detail here, hospitals attribute "diagnostic codes" to their patients based on the disorders and diseases from which they are suffering. The HSMRs derived from these codes aim to account for every important variable that determines whether a patient admitted to hospital lives or dies, so that what is left is a way to compare directly the quality of care across hospitals. The idea is that the low-ranking hospitals get an incentive to increase their quality of care, and the public can select the best hospital in their area.
It's a good idea, but reality got in the way. Firstly, the recording of the diagnosis for a patient is not always accurate. Last year, for example, the Hospital Episode Statistics (HES) data, which converts the hospitals records into internationally recognized ICD or OPCS coding, recorded that 16,992 of the 785,263 patients coded as having had "in-patient Obstetrics episodes" were male. Hmm. Wrong.
Even more worrying is what happens if a hospital decides that a low rank is not a problem with its care, but with its coding. For example, the "palliative care" code can have a significant impact in reducing HSMR. If a patient is assigned this code, allowances are made in the HSMR calculation to prevent hospitals from blame in cases where a patient's life cannot be saved. The use of this code has increased, for valid reasons in many cases, but the fear is that hospitals can respond to poor rankings not with proper inspections and improved care procedures, but by disguising the true mortality rates with data 'cleansing' (recoding), so putting the lives of patients at risk.
The problem in mid-Staffordshire seems to be one of managers, monitoring quality of care at their hospitals, putting too much faith in data that was divorced from reality. The data said that mortality rates were low, in direct contradiction of the testimony of relatives of those who felt relatives had died unnecessarily, which went unheeded. "They must be wrong, because we have the data". When The Francis Report published, in February, the government spoke of a culture of 'metrics and league tables' in the way that hospitals are judged as a key factor in the scandal.
As database professionals, we are all too familiar with the concept of Bad Data, and have the experience to spot it and prevent its misuse. Indeed, perhaps it is time we took the lead in ensuring that the specialism of 'Data Scientist' is based on responsible use of data and respect for data quality.