• At SQL Saturday 279 I felt like I finally understood Big Data. Carlos Bossy presented in a clear fashion that cut through the hype to show us what Big Data is, what it can do well and what it can't.

    Here is my summary of Carlos's presentation http://carlosbossy.wordpress.com/downloads/:

    Big Data = Large volumes, Complex and unstructured

    The secret to getting anything useful from Big Data is Map Reduce

    You as the developer must write a Map function -- this is where you Map the unstructured data into a structure. Your function must parse through the unstructured data deciding what data to include and then imposing a structure. So it is kind of like a parsing function plus the select from and where clauses.

    Then you must write a Reduce function where you group and aggregate your data.

    What makes this possible is the architecture of Hadoop with the Hadoop DFS blocks of data replicated to local storage on 3 nodes (also eliminates the need for backups) and the parallel architecture.

    But this architecture also means that Hadoop is slow for running a query compared to standard SQL Server queries against structured data. Things that SQL Server can query in seconds can take Hadoop minutes. But queries against really large unstructured data that would take standard SQL Server queries days to get Hadoop can get in hours or even minutes.