Data Vision

  • Comments posted to this topic are about the item Data Vision

  • Do we not do that already to at least some rudimentary level?

    For example, we have Burn Charts in Scrum that can be calculated automatically and generate emails upon reaching of thresholds, MTTF (Mean Time To Failure) analysis built in to support systems even event log analysers.

    I know that these are simpler than, perhaps, what Steve is suggesting. I think that at the most simplest of levels that we are already on this path. My interpretation of what Steve is saying is that we are marching down this road and could be further down it than many of us realise and each of us should consider taking us further.

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • I think the use of Machine Learning in ETL systems needs to be a bit more subtle than formatting errors. Formatting and data errors are easy enough to pick up and should be coded in to the ETL process by anyone who knows what they're doing.

    Other errors can be much more difficult to predict and it is these which the system should be trained to look for data which deviates from the "normal". For a data warehouse load examples might be:

    - Unusually high volumes of dimension updates

    - Historic sales appearing in the feed when yesterdays sales were expected

    - All customer names are suddenly a maximum of 5 characters long

    - Sales codes have jumped from "FGH000342" to "FGH010453"

    - Average sales values are much higher/lower than normal

    Any of the above cases should raise a red flag and cause the ETL process to stop immediately before any damage is done to the contents of data warehouse but at the same time none of those examples would necessarily cause an error due to the format or the data type being incorrect.

    By initially data mining the incoming data you can build a "yard stick" by which to measure new sets of incoming data automatically

  • Or, we could do what we are already supposed to do, and check the data being entered before saving it.

    You idea is a good one, but the example just struck me wrong. It is kind of like passing more gun control laws when we already have far too many in place. Passing a new law doesn't fix the root issue, nor does coming up with a work around for poor programming. The programmers who are too lazy to check the data they accept are the same ones who won't use your idea correctly either!

    Not sure why, but I sure seem to be stuck in a root cause analysis rut at this point!

    Dave

  • The emerging data science field is already making this a reality. There are very well-known statistical principles and methods that can be employed to analyze just about anything.

    These can be used to not only analyze the incoming data for errors, but also various facets of the ETL process itself to look for and alert the appropriate people to potential issues. Could be as simple as monitoring the mean processing time weighted by the amount of data processed (or other factors) to indicate possible performance issues. Time-series analysis could be used to detect and troubleshoot those troublesome "intermittent" ETL problems that just never seem to show up when someone from the IT department is looking. A time-series graph of the process could be compared against other data about the network, server, other infrastructure or even non-IT data for possible correlations to further investigate (which may even help in performing an RCA).

    Point being: We very often get into the "if it's not throwing an error, don't fix it" mode (which has a great deal of wisdom to it for certain) but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.

    ____________
    Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

  • lshanahan (6/12/2013)


    Point being: ...but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.

    Which is exactly my point. Too many people in this field belong elsewhere, maybe they can be politicians or something equally useless!

    Frequently I get told I am too picky about doing things right, to which I respond "that is what I am paid to do!" We need to follow proper testing methodologies PRIOR TO RELEASE, not wait for our customers to find our stupidity or errors. We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.

    Dave

  • We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.

    Agreed, though I was going in a slightly different direction. I was also thinking of situations where statistical models and predictive analytics may help catch not only errors (and the stupid stuff) but also alert to potential issues that are very hard to monitor which don't involve errors per se so action can be taken ahead of time to minimize or even completely avoid problems.

    ____________
    Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.

  • lshanahan (6/12/2013)


    We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.

    Agreed, though I was going in a slightly different direction. I was also thinking of situations where statistical models and predictive analytics may help catch not only errors (and the stupid stuff) but also alert to potential issues that are very hard to monitor which don't involve errors per se so action can be taken ahead of time to minimize or even completely avoid problems.

    I agree with that, but I'm also thinking that we might find that tools can learn to suggest better options as well. Such as looking at new features in the current version when editing an older package. Things like noting the volume of data through an SCD task is too large and it's slower than using something else.

  • djackson 22568 (6/12/2013)


    lshanahan (6/12/2013)


    Point being: ...but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.

    Which is exactly my point. Too many people in this field belong elsewhere, maybe they can be politicians or something equally useless!

    Frequently I get told I am too picky about doing things right, to which I respond "that is what I am paid to do!" We need to follow proper testing methodologies PRIOR TO RELEASE, not wait for our customers to find our stupidity or errors. We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.

    That is a good goal to have. I believe our industry suffers mostly from poor design and testing, and that we really need to worry about those things, instead of focusing on "see, my application is so much better than my competitors, so please pay me my commission!" Marketing and sales want profit, and seem to not care a whit about quality. Quality drives profits, though, and maybe one day that will become apparent.

    Dave

  • Good topic! I agree that creating systems to identify anomalous data is a great idea, and it's actually something that just about everyone reading this blog is familiar with already on some level. Any alert system that flags erroneous data/faied systems falls generally under this umbrella. With larger and more sophisticated data systems, we can create algorithms to flag more subtle irregularities. While a good visualization can allow humans to see irregularities in seconds that would take hours of poring over raw data to spot, when it comes to automating this, it may be helpful to remember that visualizations are simply something that we humans create to help leverage our visual skills. Computers don't necessaryily need the actual visualization to spot the pattern, just some clever algorithms -- algorithms that may or may not lend themselves to revealing visualizations. We may be alble to come up with algorithms that can identify anomalous data that we'd be hard pressed to visualize if we free ourselves from the constraint of "visualization".

    In the mean time, let's hear it for all of those folks who have found ways to visualize large amounts of data in simple form so that users can rapidly spot anomalies. And a "better late than never" blank square of participation award to MS for providing the PowerView visualization tool set.

  • Steve Jones - SSC Editor (6/12/2013)


    lshanahan (6/12/2013)


    We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.

    Agreed, though I was going in a slightly different direction. I was also thinking of situations where statistical models and predictive analytics may help catch not only errors (and the stupid stuff) but also alert to potential issues that are very hard to monitor which don't involve errors per se so action can be taken ahead of time to minimize or even completely avoid problems.

    I agree with that, but I'm also thinking that we might find that tools can learn to suggest better options as well. Such as looking at new features in the current version when editing an older package. Things like noting the volume of data through an SCD task is too large and it's slower than using something else.

    Now that is intriguing, because I don't see humans being suited to that task, at least not very efficiently.

    I solved a performance/stability issue once by searching source code for instances where we made calls to C++ memory allocation functions. It would be somewhat trivial to produce a tool to handle this, or reconfigure the compiler to look for times when we allocated but did not delete.

    Using a tool to do those things which we aren't very efficient at is a great use of resources.

    Dave

  • djackson 22568 (6/12/2013)


    Now that is intriguing, because I don't see humans being suited to that task, at least not very efficiently.

    I solved a performance/stability issue once by searching source code for instances where we made calls to C++ memory allocation functions. It would be somewhat trivial to produce a tool to handle this, or reconfigure the compiler to look for times when we allocated but did not delete.

    Using a tool to do those things which we aren't very efficient at is a great use of resources.

    Hi Dave, will not argue with you on your current train of thought but might add something. Not only should we control the memory leakages created by use leave and go on strategy but we should also have monitoring on the number of calls made to certain routines. Years back CICS use to track this in the IBM environs and we could see what was happening even on the system subroutine levels. With those in hand we could determine if the system was actually reusing the allocated mapped instance or was for some reason not sharing an instance but was always creating a new one. This left us at time with memory blocks all over the place and a massive cleanup effort going on regularly on the OS level. When we were able to identify those routines we determined if they were really necessary or if we were abusing a routine unnecessarily or ff it was just popular. If it was popular we could load it up early in the CICS region and make it stay resident, and almost force programs to use the one instance. Saved a lot of execution time.

    M....

    Not all gray hairs are Dinosaurs!

  • Steve, like you I find the visualizations of data very helpful and that they are very excellent tools for data quality validation and checking as well. Those charts and maps with odd spikes or locations mapped way out in the water are very obvious if you take the time to visualize, chart, or map your data.

    Using visual presentations of data if they can be prepared quickly can be a nice tool in determining trends in the data. This is not news and I know that millions have stated this already but it still remains true. If we will take the time to do the preliminary analysis even after the data is captured or uploaded, we can identify some oddities or trends we might want to investigate. In doing so we might find things like constant monitoring devices calibrated by a new employee were not set right before they went into the field and the data is a couple of decimal points to the right or the left. And the neat thing is that if we use the visualizations in a dashboard we will see quickly any variations in the data we monitor.

    I have built OLAP cubes over datasets just to see what is going on, and have been rewarded with information and interesting facts concerning the data collections by doing so.

    Loved the post!

    M.

    Not all gray hairs are Dinosaurs!

  • Visualizations are a great place to start, but not sure they're often good for decisions. Better to dig in further to the data itself and verify with the details. however they do get us to think about data in different ways.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply