• There are a number of flaws in the starry eyed optimism in that article.

    Of course there are the serious privacy threats (as addressed above). Recently Australia has been working on a project to open source anonomized medical data. After the first trial batch was posted, researchers showed how easy it was to de-anonomize it with a few crosschecks.

    But it goes beyond that. It assumes that data will be of consistent quality, of known origin ... not likely the case for data gathered and 'given away' by a variety of different sources (sources who have a vested interest in keeping some of their best data). Even small biasing factors in the sourcing of data can have significant effects on the statistical validity. Good research requires that all those effects be accounted for and balanced before conclusions are drawn.

    There is NO magic in crazy large data. Data size has an asymptotic effect on results, after a certain point, just piling on more data does not necessarily introduce more insights, indeed it may include confounding factors.

    What we consider machine learning (or even more incorrectly called 'artificial intelligence') is essentially statistical pattern matching. It does not 'know' whether the sunrise causes the rooster to crow or the rooster causes the sunrise. It has no comprehension of creating a theory and then creating tests to confirm or disprove that theory. Machine learning cannot look at a correlation and conclude that there must be an additional factor because the result does not 'make sense' in its current interpretation.

    We look at some of the successes that statistical analysis has had, and some extrapolate that into a vast future.. with 10x the data we'll have 10x the knowledge.
    It doesn't work that way.

    ...

    -- FORTRAN manual for Xerox Computers --