Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««123

Flying high on the Big Data hot-air Expand / Collapse
Author
Message
Posted Tuesday, July 30, 2013 2:57 PM


SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 5:49 AM
Points: 76, Visits: 231
chrisn-585491 (7/30/2013)
In terms of implementing (cutting-edge) stats, R has no peer.


The Python combo of pandas and numpy begs to disagree...


I expect they would. Dueling pistols at dawn????
Post #1479173
Posted Tuesday, July 30, 2013 7:43 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, August 26, 2014 8:56 PM
Points: 178, Visits: 571
Well put, Phil. I've been getting 'Big Data' emails in my inbox for years and not one of these emails actually had any real practical content in them.
Post #1479253
Posted Wednesday, July 31, 2013 2:36 AM


Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Today @ 6:30 AM
Points: 587, Visits: 2,524
But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.


With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.



Best wishes,

Phil Factor
Simple Talk
Post #1479345
Posted Wednesday, July 31, 2013 7:12 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, August 4, 2014 11:40 PM
Points: 33, Visits: 82
Nice read! There does seem to be an overwhelming amount of buzz about big data and I'm skeptical about how much of it is purely hype and the newest thing for purely selling organisations to make a quick buck.

But we shouldn't let the marketers have all the fun :)
Post #1479465
Posted Wednesday, July 31, 2013 7:28 AM


SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Wednesday, July 23, 2014 5:49 AM
Points: 76, Visits: 231
Phil Factor (7/31/2013)
But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.


With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.


What Silver did is technically a meta-analysis, which in English means an analysis of other analyses. Drug companies have been pushing such "analysis" for some time, as a way to get approval when a bunch trials miss their endpoints. In some cases, the aggregated data can be massaged into success.

In Silver's case, he took existing *sample* survey data, added his own weights (based upon his political acuity), and spit out a new number. There was nothing big about his data.

Big data, whether commercial, political, or military is an effort to find some needle in some haystack. The NSA vacuuming of communications, which it's been doing from its inception, is just the latest public airing. The big data practitioners argue that speed and volume of data makes separating the wheat from the chaff with legacy RDBMS impractical. They also argue that transactional guarantees aren't meaningful, so let's all use some cobbled together file spec. And so it was.

On the stats side, similar conflict. Big data is census data, and thus the source of merely descriptive statistics (which aren't technically statistics in the first place). All that inferential machinery (frequentist or Bayes) doesn't matter much if you're not sampling, or endlessly re-sampling for the Bayesians.

So, spend gobs of money looking for a needle or two. Sometimes the expected value of the needle is worth the cost. Mostly, not so much. But lemmings will be lemmings.
Post #1479477
« Prev Topic | Next Topic »

Add to briefcase «««123

Permissions Expand / Collapse