Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Flying high on the Big Data hot-air


Flying high on the Big Data hot-air

Author
Message
RobertYoung
RobertYoung
SSC-Enthusiastic
SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)

Group: General Forum Members
Points: 100 Visits: 232
chrisn-585491 (7/30/2013)
In terms of implementing (cutting-edge) stats, R has no peer.


The Python combo of pandas and numpy begs to disagree...


I expect they would. Dueling pistols at dawn????w00t
nick.mcdermaid
nick.mcdermaid
SSC Veteran
SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)SSC Veteran (201 reputation)

Group: General Forum Members
Points: 201 Visits: 766
Well put, Phil. I've been getting 'Big Data' emails in my inbox for years and not one of these emails actually had any real practical content in them.
Phil Factor
Phil Factor
Right there with Babe
Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)Right there with Babe (743 reputation)

Group: General Forum Members
Points: 743 Visits: 2937
But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.


With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.


Best wishes,

Phil Factor
Simple Talk
chris.ross 34852
chris.ross 34852
SSC Rookie
SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)SSC Rookie (33 reputation)

Group: General Forum Members
Points: 33 Visits: 99
Nice read! There does seem to be an overwhelming amount of buzz about big data and I'm skeptical about how much of it is purely hype and the newest thing for purely selling organisations to make a quick buck.

But we shouldn't let the marketers have all the fun Smile
RobertYoung
RobertYoung
SSC-Enthusiastic
SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)SSC-Enthusiastic (100 reputation)

Group: General Forum Members
Points: 100 Visits: 232
Phil Factor (7/31/2013)
But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.


With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.


What Silver did is technically a meta-analysis, which in English means an analysis of other analyses. Drug companies have been pushing such "analysis" for some time, as a way to get approval when a bunch trials miss their endpoints. In some cases, the aggregated data can be massaged into success.

In Silver's case, he took existing *sample* survey data, added his own weights (based upon his political acuity), and spit out a new number. There was nothing big about his data.

Big data, whether commercial, political, or military is an effort to find some needle in some haystack. The NSA vacuuming of communications, which it's been doing from its inception, is just the latest public airing. The big data practitioners argue that speed and volume of data makes separating the wheat from the chaff with legacy RDBMS impractical. They also argue that transactional guarantees aren't meaningful, so let's all use some cobbled together file spec. And so it was.

On the stats side, similar conflict. Big data is census data, and thus the source of merely descriptive statistics (which aren't technically statistics in the first place). All that inferential machinery (frequentist or Bayes) doesn't matter much if you're not sampling, or endlessly re-sampling for the Bayesians.

So, spend gobs of money looking for a needle or two. Sometimes the expected value of the needle is worth the cost. Mostly, not so much. But lemmings will be lemmings.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search