As a data person, I'm appalled at the errors made by many journalists in reporting the spread of the Covid-19 pandemic. I'm aghast at the way that they can publish fantastic nonsense that would have caused apoplexy to any peer reviewer in a scientific journal. However, what worries me the most is how uncertain data, initially provided with caveats, is eventually rendered to the public as certainty. It brings me to a more general point. How should we, as responsible data people, represent uncertainty in data?
Obviously, we need to track the course of any pandemic as accurately as possible. To do this we need whatever figures are available. Once a figure is recorded, however, it takes on an undeserved aura of truth. It could have been a rough estimate, or a best effort in the light of a blaze of contradictory evidence. Most sources of the data for the incidence of covid-19 are collated and published with caveats and warnings for the unwary. They are intended for a scientific audience that understands the constraints and uncertainties. When this data reaches the press, the temptation is to compare them in different regions or countries, and use the results to further almost any political cause; a temptation that is seldom resisted.
The data was usually published as a daily accumulation, from zero, to the total. When figures are uncertain, this makes sense because, if you find, in retrospect, that certain facts weren't known at the time, you merely bump the total. Easy: but if this happens, the intermediate figures are invalidated. Why wouldn't deaths, for example, be certain? The major problem is attributing the cause of death. It might seem obvious, but the great majority of these deaths occur with people having other severe underlying medical conditions, or who are in general frail health. This results in a huge variation in reporting: Belgium, for example, counts all coronavirus deaths outside hospitals in its daily statistics: deaths in care homes account for 53% of the total. Belgium's official toll also includes people suspected of having died of coronavirus, without a confirmed diagnosis. Other countries count only confirmed diagnoses, even though a post-mortem could be delayed by days. Deaths in care homes tend to be under-reported in Europe, and ignored in other countries, sometimes for political or economic reasons. What about the figures for 'confirmed cases'? A minefield according to the EU's European Centre for Disease Prevention and Control.
If reported deaths are so unreliable, why not just track the course of the pandemic via the excess mortality over and above the normal? The problem is accuracy. The normal variance in the death rate is too great to be useful. The scale of deaths due to the Covid-19 pandemic are within the normal variance. Certainty is elusive.
How should one represent uncertainty? We used to have a clever convention where if the data was "sketchy", the graph was in sketch form, indicating the uncertainty. Some graphing packages still allow this. We should, of course, report the variance of normally-distributed data, but data like this has a high level of bias and inherent noise. This needs to be represented, because decisions made based on bad data are likely to be bad decisions.