SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Data Science Sanity Checks


Data Science Sanity Checks

Author
Message
Phil Factor
Phil Factor
Hall of Fame
Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)Hall of Fame (3.7K reputation)

Group: General Forum Members
Points: 3717 Visits: 3005
Comments posted to this topic are about the item Data Science Sanity Checks


Best wishes,

Phil Factor
Simple Talk
Steph Locke
Steph Locke
SSC-Addicted
SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)SSC-Addicted (409 reputation)

Group: General Forum Members
Points: 409 Visits: 870
This was a very thought provoking editorial. I'm still cogitating but immediately...

I see the value of having someone whose job it is to check these things out and know the gritty details of the data. I think this is a crucial role, but I'd encourage everyone involved in the production, storage, and use of the data to try and understand it at a fine level also.
That being said, I'm not sold on the idea that there can be sufficient resources to identify all trends emerging within the data and then investigating them before the rest of the business pick up on them - the other option is the question of 'releasing' data to folk and that seems like an impediment to work, and still couldn't guarantee success.
Robert.Sterbal
Robert.Sterbal
Mr or Mrs. 500
Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)

Group: General Forum Members
Points: 537 Visits: 2000
Remember that investment management is about risk, not return. In contemplating what opportunities to pursue you have to understand what the risks are you are taking, and then acknowledge whether you were right or not.
krowley
krowley
SSC Veteran
SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)

Group: General Forum Members
Points: 256 Visits: 429
The big problem with this is that business is about making money and sometimes there is money to be made even from bad data. Take the recent twitter hack on the AP news. Businesses that reacted quickly could short sell the stock market and make a bundle off this bad data even though it was absolutely off base and a quick check could show that.
dliabenow
dliabenow
SSC Journeyman
SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)SSC Journeyman (97 reputation)

Group: General Forum Members
Points: 97 Visits: 31
Awesome subject Phil, thanks. I relate it to driving a car, you have to keep you eyes on the road, the mirrors the dashboard/control panel (throw in some breakfast and a smart phone too for added troubles/difficulties) and there's your business system. Crazy drivers (the other drivers naturally) to deal with, road hazards/accidents, etc. it's a constant barrage and requires constant vigilance of monitoring input and output to get where you need to go.
jay-h
jay-h
SSCrazy
SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)

Group: General Forum Members
Points: 2959 Visits: 2351
One of the real things that must be checked when confronted by an anomalous item is: is it an artifact? The way data is collected, the questions asked, the context of its selection can sometimes cause very subtle distortions, not noticeable in the small scale but visible when trying to pull signal out of a lot of noise.

At times (often) the researcher doesn't have access to the context of the original data acquisition and there is plenty of room for serious errors.

...

-- FORTRAN manual for Xerox Computers --
charles.wong
charles.wong
Grasshopper
Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)Grasshopper (13 reputation)

Group: General Forum Members
Points: 13 Visits: 18
So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?

The fact is the end users of those data are the domain experts who can tell what is rogue data and what is real trend better than anybody else, including the data scientist who has generic data knowledge but not necessarily the domain knowledge.

IMHO, we should just give the data to the business, and give them the tools that highlights abnormal trends and help them do the analysis. That way you don't stop them seeing the data, but also help them identify rogue data.
krowley
krowley
SSC Veteran
SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)SSC Veteran (256 reputation)

Group: General Forum Members
Points: 256 Visits: 429
I agree with Charles. At some point we have to trust our users.
jay-h
jay-h
SSCrazy
SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)SSCrazy (3K reputation)

Group: General Forum Members
Points: 2959 Visits: 2351
charles.wong (5/13/2013)
So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?
...


The point was caution. Especially with market research and other non dollars and cents determinations.

Rogue data and artifact are not the same thing. Polling organizations (at least the good ones) have learned the pitfalls of categorization. The majority of potential customers may say they prefer prodcut B, but unless you know what A,C, and D were, or if other options were missing from the list (the old 'have you stopped beating your wife?' conumdrum), you don't know what they would actually buy. Even priming questions, that is seemingly unrelated questions asked before the choice have been proven to make a big difference in the answers givern.

Disastrous business and political decisions have been made by not understanding the data. When dealing with large amounts of data from disparate and uncontrolled sources, the risk is higher. By all means listen to those close to the issue, but remember that everyone, including those close to the issue can unintentionally bring in their own preferences and biases (remeber 'new Coke'?)

...

-- FORTRAN manual for Xerox Computers --
Robert.Sterbal
Robert.Sterbal
Mr or Mrs. 500
Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)Mr or Mrs. 500 (537 reputation)

Group: General Forum Members
Points: 537 Visits: 2000
charles.wong (5/13/2013)
So the whole business should stop seeing crucial data that supports their daily decision-making, and wait for a data scientist to sanitise the data (however long it takes)?

The fact is the end users of those data are the domain experts who can tell what is rogue data and what is real trend better than anybody else, including the data scientist who has generic data knowledge but not necessarily the domain knowledge.

IMHO, we should just give the data to the business, and give them the tools that highlights abnormal trends and help them do the analysis. That way you don't stop them seeing the data, but also help them identify rogue data.


That really expresses where the efforts of data managers should be expended. Thanks for saying it so well
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search