Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Data Vision Expand / Collapse
Author
Message
Posted Wednesday, June 12, 2013 12:07 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 3:03 PM
Points: 31,276, Visits: 15,728
Comments posted to this topic are about the item Data Vision






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1462468
Posted Wednesday, June 12, 2013 2:20 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 12:23 PM
Points: 5,734, Visits: 3,644
Do we not do that already to at least some rudimentary level?

For example, we have Burn Charts in Scrum that can be calculated automatically and generate emails upon reaching of thresholds, MTTF (Mean Time To Failure) analysis built in to support systems even event log analysers.

I know that these are simpler than, perhaps, what Steve is suggesting. I think that at the most simplest of levels that we are already on this path. My interpretation of what Steve is saying is that we are marching down this road and could be further down it than many of us realise and each of us should consider taking us further.


Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!
Post #1462497
Posted Wednesday, June 12, 2013 2:51 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: 2 days ago @ 5:51 AM
Points: 328, Visits: 2,001
I think the use of Machine Learning in ETL systems needs to be a bit more subtle than formatting errors. Formatting and data errors are easy enough to pick up and should be coded in to the ETL process by anyone who knows what they're doing.

Other errors can be much more difficult to predict and it is these which the system should be trained to look for data which deviates from the "normal". For a data warehouse load examples might be:
- Unusually high volumes of dimension updates
- Historic sales appearing in the feed when yesterdays sales were expected
- All customer names are suddenly a maximum of 5 characters long
- Sales codes have jumped from "FGH000342" to "FGH010453"
- Average sales values are much higher/lower than normal

Any of the above cases should raise a red flag and cause the ETL process to stop immediately before any damage is done to the contents of data warehouse but at the same time none of those examples would necessarily cause an error due to the format or the data type being incorrect.

By initially data mining the incoming data you can build a "yard stick" by which to measure new sets of incoming data automatically
Post #1462505
Posted Wednesday, June 12, 2013 6:49 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, November 10, 2014 7:42 AM
Points: 492, Visits: 814
Or, we could do what we are already supposed to do, and check the data being entered before saving it.

You idea is a good one, but the example just struck me wrong. It is kind of like passing more gun control laws when we already have far too many in place. Passing a new law doesn't fix the root issue, nor does coming up with a work around for poor programming. The programmers who are too lazy to check the data they accept are the same ones who won't use your idea correctly either!

Not sure why, but I sure seem to be stuck in a root cause analysis rut at this point!


Dave
Post #1462602
Posted Wednesday, June 12, 2013 7:31 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, October 22, 2014 10:33 AM
Points: 140, Visits: 260
The emerging data science field is already making this a reality. There are very well-known statistical principles and methods that can be employed to analyze just about anything.

These can be used to not only analyze the incoming data for errors, but also various facets of the ETL process itself to look for and alert the appropriate people to potential issues. Could be as simple as monitoring the mean processing time weighted by the amount of data processed (or other factors) to indicate possible performance issues. Time-series analysis could be used to detect and troubleshoot those troublesome "intermittent" ETL problems that just never seem to show up when someone from the IT department is looking. A time-series graph of the process could be compared against other data about the network, server, other infrastructure or even non-IT data for possible correlations to further investigate (which may even help in performing an RCA).

Point being: We very often get into the "if it's not throwing an error, don't fix it" mode (which has a great deal of wisdom to it for certain) but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.


____________
Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.
Post #1462627
Posted Wednesday, June 12, 2013 7:38 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, November 10, 2014 7:42 AM
Points: 492, Visits: 814
lshanahan (6/12/2013)
Point being: ...but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.


Which is exactly my point. Too many people in this field belong elsewhere, maybe they can be politicians or something equally useless!

Frequently I get told I am too picky about doing things right, to which I respond "that is what I am paid to do!" We need to follow proper testing methodologies PRIOR TO RELEASE, not wait for our customers to find our stupidity or errors. We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.


Dave
Post #1462630
Posted Wednesday, June 12, 2013 7:54 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, October 22, 2014 10:33 AM
Points: 140, Visits: 260
We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.


Agreed, though I was going in a slightly different direction. I was also thinking of situations where statistical models and predictive analytics may help catch not only errors (and the stupid stuff) but also alert to potential issues that are very hard to monitor which don't involve errors per se so action can be taken ahead of time to minimize or even completely avoid problems.


____________
Just my $0.02 from over here in the cheap seats of the peanut gallery - please adjust for inflation and/or your local currency.
Post #1462645
Posted Wednesday, June 12, 2013 8:49 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 3:03 PM
Points: 31,276, Visits: 15,728
lshanahan (6/12/2013)
We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.


Agreed, though I was going in a slightly different direction. I was also thinking of situations where statistical models and predictive analytics may help catch not only errors (and the stupid stuff) but also alert to potential issues that are very hard to monitor which don't involve errors per se so action can be taken ahead of time to minimize or even completely avoid problems.


I agree with that, but I'm also thinking that we might find that tools can learn to suggest better options as well. Such as looking at new features in the current version when editing an older package. Things like noting the volume of data through an SCD task is too large and it's slower than using something else.







Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1462672
Posted Wednesday, June 12, 2013 8:51 AM
SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, November 10, 2014 7:42 AM
Points: 492, Visits: 814
djackson 22568 (6/12/2013)
lshanahan (6/12/2013)
Point being: ...but there is also a place and time to proactively evaluate our systems and look at ways things cold improve.


Which is exactly my point. Too many people in this field belong elsewhere, maybe they can be politicians or something equally useless!

Frequently I get told I am too picky about doing things right, to which I respond "that is what I am paid to do!" We need to follow proper testing methodologies PRIOR TO RELEASE, not wait for our customers to find our stupidity or errors. We all make mistakes, and may never catch all our errors in testing, but we damn sure better catch our stupidity. Not checking the input data before processing it is just stupid. We all know it can lead to all kinds of problems, not just incorrect data.


That is a good goal to have. I believe our industry suffers mostly from poor design and testing, and that we really need to worry about those things, instead of focusing on "see, my application is so much better than my competitors, so please pay me my commission!" Marketing and sales want profit, and seem to not care a whit about quality. Quality drives profits, though, and maybe one day that will become apparent.


Dave
Post #1462677
Posted Wednesday, June 12, 2013 8:53 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Thursday, November 13, 2014 11:33 AM
Points: 66, Visits: 326
Good topic! I agree that creating systems to identify anomalous data is a great idea, and it's actually something that just about everyone reading this blog is familiar with already on some level. Any alert system that flags erroneous data/faied systems falls generally under this umbrella. With larger and more sophisticated data systems, we can create algorithms to flag more subtle irregularities. While a good visualization can allow humans to see irregularities in seconds that would take hours of poring over raw data to spot, when it comes to automating this, it may be helpful to remember that visualizations are simply something that we humans create to help leverage our visual skills. Computers don't necessaryily need the actual visualization to spot the pattern, just some clever algorithms -- algorithms that may or may not lend themselves to revealing visualizations. We may be alble to come up with algorithms that can identify anomalous data that we'd be hard pressed to visualize if we free ourselves from the constraint of "visualization".

In the mean time, let's hear it for all of those folks who have found ways to visualize large amounts of data in simple form so that users can rapidly spot anomalies. And a "better late than never" blank square of participation award to MS for providing the PowerView visualization tool set.
Post #1462680
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse