• I've advocated for years that many of our metrics for data quality should be based on statistical analysis rather than brute force counting of every single value in every single row.

    At one of my contracts we were collecting data from a consumer "internet streaming" device that "called home" constantly while in use in millions of consumer locations.

    There were times when a firmware update could - and did - unintentionally break "part" of the data collection but not all of it. For example the volume of data coming in might be normal, but a particular data value might be completely broken.

    I implemented a sampling algorithm in TSQL that dynamically calculated the standard deviation for occurrence and range of selected values in a table on a daily basis and could then determine if the current/latest data load contained any value ranges that exceeded the standard deviation.

    So we could quickly detect any unplanned/unexpected loss of data in the collection system. Without an automated system it would have been prohibitively expensive to manually verify the data on a daily basis...

    And without statistical analysis it would have been too expensive in computing resources and time.