Dr. DeWitt has spoken to the PASS Summit for the past two years – turning what was otherwise a very forgettable “day 3” keynote slot into a must-see event for anyone going to the Summit.
The past two years Dr. DeWitt took us through the internals of columnstore databases and query optimization. They were fantastically content-rich and dug right to the bottom of the technology. On top of explaining the mechanics of the query optimizer and theory of column-based storage, Dr. DeWitt demonstrated understandable applications of the change in technology to common queries and functions we all enjoy today in SQL Server. He's also a pretty funny presenter.
That brings us to today and the changing world of data. We’re going to get the concept of “big data” funneled into our brains faster than we can process it – but that’s part of why I’m here. I hope to capture some of the content that comes flying at us this morning about Hadoop and distributed data processing. But my other job is to comment on how this movement will affect our industry going forward, changing your data, your job, your server architecture, and even SQL Server internals. I have no doubt that I’ll only scratch the surface of what Dr. DeWitt conveys this morning – and for that reason alone, I recommend you don’t miss his keynote (if they can get him) next year at the PASS Community Summit.
According to Dr. DeWitt, big data is a large collection of data – petabytes – on a large collection of machines. To others, “big data” means NoSQL, but that's not a requirement to Dr. DeWitt. For some sophisticated commentary on Dr. DeWitt’s presentation (and some inspired discussion), I expect to see something on Jeremiah Peschka’s blog – I’d encourage you to go read it if you end up interested in this type of technology.
The data explosion is largely due to automation, a sea change in collecting data. In the past, it was all manually entered, but now cell phones collect location data automatically, clickstreams rule advertising, and on and on. The “old guard” (and Dr. DeWitt freely considers himself part of that) would handle these petabyte data warehouses on a parallel database system. The new guard wants to use NoSQL. In his observations, current implementations of NoSQL take about 5 times the server hardware as parallel technologies. For example, EBay stores 10PB on 256 nodes vs. Facebook’s 20PB on 2700 nodes, and Bing’s 150PB on 27,000 nodes! Of course, parallel DB technologies are typically (for good reason) delivered as appliances of high-quality hardware compared to NoSQL usage of commodity type (possibly-heterogeneous) servers. As with anything to do with databases - "it depends."
NoSQL isn’t all the same – it’s a big tent of technology – but there are a couple major types: Key/value stores (CouchBase, MongoDB, …) and Hadoop. The former are intended primarily for single-record updates/retrievals – you can think of them as the “OLTP NoSQL” technology. Hadoop is designed for processing massive data sets, and is roughly expressible as “data warehouse NoSQL.”
The claim that NoSQL works on unstructured data, and RDBMSs are for structured data is a misconception. NoSQL databases do have structure – it’s just that the structure is applied to the data later in the process than for your traditional RDBMS. Instead of the metadata being stored with the data – in fact being a requirement to actually store the data – NoSQL metadata is “applied” when the analysis function processes the data. I think of it as “compiled” vs. “interpreted” code.
Only eight years ago, Google started in on the big data problem in order to analyze clickstream data. Hadoop is an open-source interpretation of Google’s Map/Reduce+GFS system. The general structure has two parts – a fault-tolerant data storage layer (HDFS) and a fault tolerant processing layer.
The data storage layer provides fault tolerant storage using file redundancy across different nodes in the cluster of servers, protecting against individual server failure as well as entire racks failing. You can think of this like RAID – data is smartly distributed over the cluster. If one node fails, the data exists on other nodes, so it isn’t lost. The main difference here is that RAID only addresses single disk failures, whereas the HDFS protects against query, disk, server, switch, rack, and datacenter failures without necessarily having to restart or recover… using only one “redundancy” algorithm.
The layer that sits on top of the storage is the processing engine – Map/Reduce. Map/Reduce is best at performing aggregations, but it isn’t designed to do joins. Joins take a ton of code to achieve – four full screens of multiple “programs” compared to a quarter-screen of SQL (HiveQL). Dr. DeWitt compared performance of a few TPC-H “structured” queries between Hive and PDW – all showing significant wins by parallel database “old guard” technology. The reason is primarily due to “old technology” being intentionally designed to handle these specific type of queries – even though they’re aggregations that he’d just said perform typically better in NoSQL scenarios. Why? Because of the inherent optimizations possible when the data used is a TPC-H workload, which is already schema’d and loaded. If the data structure wasn’t known ahead of time, a parallel database would not be capable of servicing the query at all – until metadata was “applied” by performing ETL.
There still is a strong use case for NoSQL – and his reasoning follows how I think about the relationship between SSAS (cubes) and PowerPivot (tabular) analysis technologies. In Map/Reduce, the schema is stored in the application that’s doing the analysis itself – not within the data like in SQL. So it’s a lot “easier” and better to use when you’re doing exploratory analysis of data. Much like PowerPivot, you only need to define enough schema to get your particular job done – not enough schema to get any type of query done, or “explain” each piece of data in the set. But at some point, if your exploratory analysis finds value in the data, you’re going to want to move it into an RDBMS to get the query performance out of it, and have a consistent schema applied to avoid two different analysis applications interpreting the data in inconsistent ways. Currently, there are techniques and technologies for moving data between these platforms, but they can be pretty difficult and involved. One of those is called Sqoop.
Dr.DeWitt envisions a better way to share data between SQL and NoSQL worlds than Sqoop. He’s investigating the concept of an “Enterprise Data Manager” that can relate and use unstructured and structured data so that the appropriate engine can traverse and analyze whichever type of data it’s presented with.
A few ideas Dr. DeWitt had for this year that “didn’t make the cut”, but could be a topic for next year’s Summit:
- What the “Cloud” means for databases
- The impact of Solid State Drives and Phase Change Memory
- ... whatever you might want to see – send your ideas to PASS!
If you want more, visit David DeWitt and his team on the web, there are links to the PASS talks on the left side. My thanks go to Dr. DeWitt and the team that got him up on stage - I really hope they can do so again next year.