Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Problems with Big Data Expand / Collapse
Author
Message
Posted Tuesday, April 8, 2014 8:38 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 5:59 PM
Points: 31,082, Visits: 15,529
Comments posted to this topic are about the item Problems with Big Data






Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1559758
Posted Wednesday, April 9, 2014 1:32 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Yesterday @ 3:23 PM
Points: 2,907, Visits: 1,830
I wrote an Big Data article for simple-talk in a similar vein.

Basically, what are the business requirements and how's it going to make money.

Two of the traditional 3Vs of Big Data (Volume, Velocity) are moving targets.

Teradata gets its name because the goal was to be capable of processing 1TB of data back in 1979!
The first actual live 1TB Teradata installation was for Walmart in 1992. As a reference point the HP3000 Series 70 mini-computer I worked on had 670Mb of disk and 8Mb RAM. Teradata WAS Big Data.

These days an iPod Nano is massively more powerful. A mid-range desktop has 2,000x the RAM.

The only one of the Vs that present a constant challenge is "Variety". There are over 50,000 file formats out there containing data and that is before you have to face the challenge of file layouts.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1559807
Posted Wednesday, April 9, 2014 9:12 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Wednesday, September 24, 2014 9:23 AM
Points: 186, Visits: 365
I find, that like most things, it is not the tools, it is how they are used.

A good mechanic with basic tools can do great things, while someone with less understanding of the problem and needs ( aka: beginner/novice/ over confident sophomore, etc.) will be limited in their success regardless of the amazing tools they have.

Tools are great, but as Steve and others have pointed out, one must understand the problems.

Youtube has many videos on how to setup a Hadoop cluster in 30 minutes.
And with the ease that just about anyone can install SQL Server, we know that being able to install the tools is far different from being able to use them wisely.

I watched a webinar on how Microsoft has worked with Hortonworks to integrate Hadoop with the parallel data warehouse. I believe it was called Polybase. Very nice.
I hope we can see more of Polybase, and have access to it for those of use who do not have the resources to acquire PDW.


The more you are prepared, the less you need it.
Post #1560017
Posted Wednesday, April 9, 2014 9:17 AM
Right there with Babe

Right there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with BabeRight there with Babe

Group: General Forum Members
Last Login: Tuesday, September 2, 2014 8:37 AM
Points: 751, Visits: 1,917
Big amounts of garbage data is still garbage. Throwing more data on a flawed analysis does not make it better. With enough data you can draw all sorts of correlations which are absolutely meaningless.

You need to understand the problem first, then understand the weaknesses and reliability issues in all your sources. When you finally do get a result you need to back test it against independent data sets to see if it still holds up.

Unfortunately many managers (as well as some IT people) are so excited by the prospect of magical extraction that they fail to take a hard, critical look at their processes.


...

-- FORTRAN manual for Xerox Computers --
Post #1560021
Posted Wednesday, April 9, 2014 11:25 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Today @ 8:23 AM
Points: 33, Visits: 276
I don't know, it still walks and quacks like a duck to me. Sure, everybody's got more data and guess what? There's valuable information in there! I think we used to call it Data Mining. We had to validate our models then too. I'm failing to see what has fundamentally changed. If the term 'Big Data' gets the people who sign the checks to be more likely to invest in harvesting their data assets then I think that's a good end result. From a career perspective as a DBA, you don't want to find yourself going from managing small databases to big ones and not be prepared. IMO, it's a whole different ball game.
Post #1560112
Posted Wednesday, April 9, 2014 12:55 PM


SSC-Addicted

SSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-AddictedSSC-Addicted

Group: General Forum Members
Last Login: Monday, September 15, 2014 3:08 PM
Points: 440, Visits: 595
Steve, you said in your editorial:
"Researchers need to be willing to evolve their algorithms as they learn more about a problem. Probably they should also assume their algorithms are not correct until they've proven their ability to predict actual trends for some period of time. "

I agree strongly with this. The CIA has put this into practice, pitting their own analysts using classified information against various other scientifically tracked methods of probability analysis. You can read about one of these projects here: http://www.npr.org/blogs/parallels/2014/04/02/297839429/-so-you-think-youre-smarter-than-a-cia-agent

This is a years long project being tested for reproducible results over time using different groups of analysts. It is hard to convince a company needing a stock boost right now to invest the time and money in scientific accountability. But the long game seems to be paying off handsomely for the CIA.
Post #1560148
Posted Wednesday, April 9, 2014 3:03 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: Administrators
Last Login: Today @ 5:59 PM
Points: 31,082, Visits: 15,529
ahperez (4/9/2014)
Steve, you said in your editorial:
"Researchers need to be willing to evolve their algorithms as they learn more about a problem. Probably they should also assume their algorithms are not correct until they've proven their ability to predict actual trends for some period of time. "

I agree strongly with this. The CIA has put this into practice, pitting their own analysts using classified information against various other scientifically tracked methods of probability analysis. You can read about one of these projects here: http://www.npr.org/blogs/parallels/2014/04/02/297839429/-so-you-think-youre-smarter-than-a-cia-agent

This is a years long project being tested for reproducible results over time using different groups of analysts. It is hard to convince a company needing a stock boost right now to invest the time and money in scientific accountability. But the long game seems to be paying off handsomely for the CIA.


I read that. It's interesting, the wisdom of the crowds doing well.








Follow me on Twitter: @way0utwest

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1560194
Posted Thursday, April 10, 2014 2:15 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 4:41 PM
Points: 5,471, Visits: 3,258
For me Big Data is slightly different. The name refers to the larger quantity of data but larger than what? Basically, in my opinion, Big Data is the pooling of data from multiple, even numerous, sources. For some, Big Data amounts to a quantity of data that, for others, is of a smaller scale than before Big Data. So Big Data is more data, but only relative to ones own prior experience, that comes from a wider range of sources.

or

Big Data = Big Bucks!!!


Gaz

-- Stop your grinnin' and drop your linen...they're everywhere!!!
Post #1560282
Posted Thursday, April 10, 2014 10:58 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, April 10, 2014 10:58 AM
Points: 3, Visits: 18
At SQL Saturday 279 I felt like I finally understood Big Data. Carlos Bossy presented in a clear fashion that cut through the hype to show us what Big Data is, what it can do well and what it can't.

Here is my summary of Carlos's presentation http://carlosbossy.wordpress.com/downloads/:
Big Data = Large volumes, Complex and unstructured
The secret to getting anything useful from Big Data is Map Reduce

You as the developer must write a Map function -- this is where you Map the unstructured data into a structure. Your function must parse through the unstructured data deciding what data to include and then imposing a structure. So it is kind of like a parsing function plus the select from and where clauses.

Then you must write a Reduce function where you group and aggregate your data.

What makes this possible is the architecture of Hadoop with the Hadoop DFS blocks of data replicated to local storage on 3 nodes (also eliminates the need for backups) and the parallel architecture.

But this architecture also means that Hadoop is slow for running a query compared to standard SQL Server queries against structured data. Things that SQL Server can query in seconds can take Hadoop minutes. But queries against really large unstructured data that would take standard SQL Server queries days to get Hadoop can get in hours or even minutes.



Post #1560545
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse