Problems with Big Data

  • Comments posted to this topic are about the item Problems with Big Data

  • I wrote an Big Data article [/url]for simple-talk in a similar vein.

    Basically, what are the business requirements and how's it going to make money.

    Two of the traditional 3Vs of Big Data (Volume, Velocity) are moving targets.

    Teradata gets its name because the goal was to be capable of processing 1TB of data back in 1979!

    The first actual live 1TB Teradata installation was for Walmart in 1992. As a reference point the HP3000 Series 70 mini-computer I worked on had 670Mb of disk and 8Mb RAM. Teradata WAS Big Data.

    These days an iPod Nano is massively more powerful. A mid-range desktop has 2,000x the RAM.

    The only one of the Vs that present a constant challenge is "Variety". There are over 50,000 file formats out there containing data and that is before you have to face the challenge of file layouts.

  • I find, that like most things, it is not the tools, it is how they are used.

    A good mechanic with basic tools can do great things, while someone with less understanding of the problem and needs ( aka: beginner/novice/ over confident sophomore, etc.) will be limited in their success regardless of the amazing tools they have.

    Tools are great, but as Steve and others have pointed out, one must understand the problems.

    Youtube has many videos on how to setup a Hadoop cluster in 30 minutes.

    And with the ease that just about anyone can install SQL Server, we know that being able to install the tools is far different from being able to use them wisely.

    I watched a webinar on how Microsoft has worked with Hortonworks to integrate Hadoop with the parallel data warehouse. I believe it was called Polybase. Very nice.

    I hope we can see more of Polybase, and have access to it for those of use who do not have the resources to acquire PDW.

    The more you are prepared, the less you need it.

  • Big amounts of garbage data is still garbage. Throwing more data on a flawed analysis does not make it better. With enough data you can draw all sorts of correlations which are absolutely meaningless.

    You need to understand the problem first, then understand the weaknesses and reliability issues in all your sources. When you finally do get a result you need to back test it against independent data sets to see if it still holds up.

    Unfortunately many managers (as well as some IT people) are so excited by the prospect of magical extraction that they fail to take a hard, critical look at their processes.

    ...

    -- FORTRAN manual for Xerox Computers --

  • I don't know, it still walks and quacks like a duck to me. Sure, everybody's got more data and guess what? There's valuable information in there! I think we used to call it Data Mining. We had to validate our models then too. I'm failing to see what has fundamentally changed. If the term 'Big Data' gets the people who sign the checks to be more likely to invest in harvesting their data assets then I think that's a good end result. From a career perspective as a DBA, you don't want to find yourself going from managing small databases to big ones and not be prepared. IMO, it's a whole different ball game.

  • Steve, you said in your editorial:

    "Researchers need to be willing to evolve their algorithms as they learn more about a problem. Probably they should also assume their algorithms are not correct until they've proven their ability to predict actual trends for some period of time. "

    I agree strongly with this. The CIA has put this into practice, pitting their own analysts using classified information against various other scientifically tracked methods of probability analysis. You can read about one of these projects here: http://www.npr.org/blogs/parallels/2014/04/02/297839429/-so-you-think-youre-smarter-than-a-cia-agent

    This is a years long project being tested for reproducible results over time using different groups of analysts. It is hard to convince a company needing a stock boost right now to invest the time and money in scientific accountability. But the long game seems to be paying off handsomely for the CIA.

  • ahperez (4/9/2014)


    Steve, you said in your editorial:

    "Researchers need to be willing to evolve their algorithms as they learn more about a problem. Probably they should also assume their algorithms are not correct until they've proven their ability to predict actual trends for some period of time. "

    I agree strongly with this. The CIA has put this into practice, pitting their own analysts using classified information against various other scientifically tracked methods of probability analysis. You can read about one of these projects here: http://www.npr.org/blogs/parallels/2014/04/02/297839429/-so-you-think-youre-smarter-than-a-cia-agent

    This is a years long project being tested for reproducible results over time using different groups of analysts. It is hard to convince a company needing a stock boost right now to invest the time and money in scientific accountability. But the long game seems to be paying off handsomely for the CIA.

    I read that. It's interesting, the wisdom of the crowds doing well.

  • For me Big Data is slightly different. The name refers to the larger quantity of data but larger than what? Basically, in my opinion, Big Data is the pooling of data from multiple, even numerous, sources. For some, Big Data amounts to a quantity of data that, for others, is of a smaller scale than before Big Data. So Big Data is more data, but only relative to ones own prior experience, that comes from a wider range of sources.

    or

    Big Data = Big Bucks!!! 😛

    Gaz

    -- Stop your grinnin' and drop your linen...they're everywhere!!!

  • At SQL Saturday 279 I felt like I finally understood Big Data. Carlos Bossy presented in a clear fashion that cut through the hype to show us what Big Data is, what it can do well and what it can't.

    Here is my summary of Carlos's presentation http://carlosbossy.wordpress.com/downloads/:

    Big Data = Large volumes, Complex and unstructured

    The secret to getting anything useful from Big Data is Map Reduce

    You as the developer must write a Map function -- this is where you Map the unstructured data into a structure. Your function must parse through the unstructured data deciding what data to include and then imposing a structure. So it is kind of like a parsing function plus the select from and where clauses.

    Then you must write a Reduce function where you group and aggregate your data.

    What makes this possible is the architecture of Hadoop with the Hadoop DFS blocks of data replicated to local storage on 3 nodes (also eliminates the need for backups) and the parallel architecture.

    But this architecture also means that Hadoop is slow for running a query compared to standard SQL Server queries against structured data. Things that SQL Server can query in seconds can take Hadoop minutes. But queries against really large unstructured data that would take standard SQL Server queries days to get Hadoop can get in hours or even minutes.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply