Comments posted to this topic are about the item A Lightweight, Self-adjusting, Baseline-less Data Monitor
I've advocated for years that many of our metrics for data quality should be based on statistical analysis rather than brute force counting of every single value in every single row.
At one of my contracts we were collecting data from a consumer "internet streaming" device that "called home" constantly while in use in millions of consumer locations.
There were times when a firmware update could - and did - unintentionally break "part" of the data collection but not all of it. For example the volume of data coming in might be normal, but a particular data value might be completely broken.
I implemented a sampling algorithm in TSQL that dynamically calculated the standard deviation for occurrence and range of selected values in a table on a daily basis and could then determine if the current/latest data load contained any value ranges that exceeded the standard deviation.
So we could quickly detect any unplanned/unexpected loss of data in the collection system. Without an automated system it would have been prohibitively expensive to manually verify the data on a daily basis...
And without statistical analysis it would have been too expensive in computing resources and time.
Excellent article and some very cogent comments by kenambcrose.
...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell
Excellent article, well written and an easy read. I need to look at/play around with your code a little more to understand what you're doing but I love the overall approach.
-- Itzik Ben-Gan 2001
Very interesting article...having done a little reading recently on Kaggle and analysing massive datasets with Python and R, I wonder if the recent R extensions to SQL Server would not be best suited for this type of analysis? Any thoughts?
The R or Python stuff is useful if the statistical work is more complex. Most of what can be done in R can easily work in Python, which is a great language for linking systems or data together.
The R Services will allow you to run R in SQL Server without the data movement in/out of SQL, which is nice, but that's only 2016. For most people, and most systems, this won't be reality for years.
Follow me on Twitter: http://www.twitter.com/way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
Great article, thanks.
Very interesting article, that I tucked away and recently had a reason to revisit. I am not sure I am understanding it completely. Do you have any other references to the technique you are using that you could share? Is this just a standard statistical method of forming a probability distribution from actual data?
Thanks for any pointers you can provide.
I just spent 30 minutes reading and re-reading this article. It is well thought out, noting it's strong points and where it can be weak. The writer presents the ideas in a clear and concise manner. Bravo! I wish more articles discussing an interesting idea were like this one.
I am sorry that I didn't see this last year - great article!
Viewing 10 posts - 1 through 9 (of 9 total)