A Lightweight, Self-adjusting, Baseline-less Data Monitor

  • Alex Chamchourine

    SSC Eights!

    Points: 911

    Comments posted to this topic are about the item A Lightweight, Self-adjusting, Baseline-less Data Monitor

  • kenambrose


    Points: 2483

    I've advocated for years that many of our metrics for data quality should be based on statistical analysis rather than brute force counting of every single value in every single row.

    At one of my contracts we were collecting data from a consumer "internet streaming" device that "called home" constantly while in use in millions of consumer locations.

    There were times when a firmware update could - and did - unintentionally break "part" of the data collection but not all of it. For example the volume of data coming in might be normal, but a particular data value might be completely broken.

    I implemented a sampling algorithm in TSQL that dynamically calculated the standard deviation for occurrence and range of selected values in a table on a daily basis and could then determine if the current/latest data load contained any value ranges that exceeded the standard deviation.

    So we could quickly detect any unplanned/unexpected loss of data in the collection system. Without an automated system it would have been prohibitively expensive to manually verify the data on a daily basis...

    And without statistical analysis it would have been too expensive in computing resources and time.

  • GPO

    SSCarpal Tunnel

    Points: 4556

    Excellent article and some very cogent comments by kenambcrose.

    ...One of the symptoms of an approaching nervous breakdown is the belief that ones work is terribly important.... Bertrand Russell

  • Alan Burstein

    SSC Guru

    Points: 61088

    Excellent article, well written and an easy read. I need to look at/play around with your code a little more to understand what you're doing but I love the overall approach.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • David McKinney


    Points: 10449

    Very interesting article...having done a little reading recently on Kaggle and analysing massive datasets with Python and R, I wonder if the recent R extensions to SQL Server would not be best suited for this type of analysis? Any thoughts?

  • Steve Jones - SSC Editor

    SSC Guru

    Points: 720963

    The R or Python stuff is useful if the statistical work is more complex. Most of what can be done in R can easily work in Python, which is a great language for linking systems or data together.

    The R Services will allow you to run R in SQL Server without the data movement in/out of SQL, which is nice, but that's only 2016. For most people, and most systems, this won't be reality for years.

  • akljfhnlaflkj

    SSC Guru

    Points: 76202

    Great article, thanks.

  • zintp

    SSChasing Mays

    Points: 608

    Very interesting article, that I tucked away and recently had a reason to revisit. I am not sure I am understanding it completely. Do you have any other references to the technique you are using that you could share? Is this just a standard statistical method of forming a probability distribution from actual data?

    Thanks for any pointers you can provide.

  • logitestus


    Points: 2878

    I just spent 30 minutes reading and re-reading this article.  It is well thought out, noting it's strong points and where it can be weak.  The writer presents the ideas in a clear and concise manner.  Bravo!  I wish more articles discussing an interesting idea were like this one.

  • John N Hick

    Ten Centuries

    Points: 1315

    I am sorry that I didn't see this last year - great article!

Viewing 10 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply