• Phil Factor (7/31/2013)


    But we also need to look very closely with what we have and determine if the tools we used yesterday are able to manage horizontally scaled data correctly and with a large enough sample to do basic analysis as well as meet the demands of investigations where every element of certain criteria is require to be presented. If the tools of yesterday cannot do it and those being developed today will also fall short, it might be good to get involved as you have in trying to scope and define the tools of the future.

    With 'Big Data', it isn't the quantity of data, it is the way you deal with it. After all, Nate Silver's spectacular predictions of the result of the US election were done on a spreadsheet. The first 'big data' applications I came across were in analyzing the test data for automobiles, in the days of Sybase and DECs. The trick is that, once you've extracted the 'juice' from the raw data, you archive it if you can/need, or else throw it away. You usually don't let it anywhere near the database doing the analysis. Think hierarchically. Nowdays we have Streaminsight and Hadoop to do the low-level drudgery for us. Sure it is easier, but these techniques were developed in the eighties when engineering industries were awash with test data and had to develop ways of dealing with it.

    What Silver did is technically a meta-analysis, which in English means an analysis of other analyses. Drug companies have been pushing such "analysis" for some time, as a way to get approval when a bunch trials miss their endpoints. In some cases, the aggregated data can be massaged into success.

    In Silver's case, he took existing *sample* survey data, added his own weights (based upon his political acuity), and spit out a new number. There was nothing big about his data.

    Big data, whether commercial, political, or military is an effort to find some needle in some haystack. The NSA vacuuming of communications, which it's been doing from its inception, is just the latest public airing. The big data practitioners argue that speed and volume of data makes separating the wheat from the chaff with legacy RDBMS impractical. They also argue that transactional guarantees aren't meaningful, so let's all use some cobbled together file spec. And so it was.

    On the stats side, similar conflict. Big data is census data, and thus the source of merely descriptive statistics (which aren't technically statistics in the first place). All that inferential machinery (frequentist or Bayes) doesn't matter much if you're not sampling, or endlessly re-sampling for the Bayesians.

    So, spend gobs of money looking for a needle or two. Sometimes the expected value of the needle is worth the cost. Mostly, not so much. But lemmings will be lemmings.