The past two years Dr. DeWitttook us through the internals of columnstore databases and queryoptimization. They were fantasticallycontent-rich and dug right to the bottom of the technology. On top of explaining the mechanics of thequery optimizer and theory of column-based storage, Dr. DeWitt demonstratedunderstandable applications of the change in technology to common queries andfunctions we all enjoy today in SQL Server. He's also a pretty funny presenter.
That brings us totoday and the changing world of data. We’re going to get the concept of “big data” funneled into our brainsfaster than we can process it – but that’s part of why I’m here. I hope to capture some of the content thatcomes flying at us this morning about Hadoop and distributed data processing. But my other job is to comment on how thismovement will affect our industry going forward, changing your data, your job,your server architecture, and even SQL Server internals. I have no doubt that I’ll only scratch thesurface of what Dr. DeWitt conveys this morning – and for that reason alone, Irecommend you don’t miss his keynote (if they can get him) next year at thePASS Community Summit.
According to Dr.DeWitt, big data is a large collection of data – petabytes – on a largecollection of machines. To others, “bigdata” means NoSQL, but that's not a requirement to Dr. DeWitt. For some sophisticated commentary on Dr. DeWitt’s presentation (and some inspired discussion), I expectto see something on Jeremiah Peschka’s blog – I’d encourage you to go read itif you end up interested in this type of technology.
The data explosion islargely due to automation, a sea change in collecting data. In the past, it was all manually entered, butnow cell phones collect location data automatically, clickstreams ruleadvertising, and on and on. The “oldguard” (and Dr. DeWitt freely considers himself part of that) would handlethese petabyte data warehouses on a parallel database system. The new guard wants to use NoSQL. In his observations, current implementationsof NoSQL take about 5 times the server hardware as parallel technologies. For example, EBay stores 10PB on 256 nodes vs.Facebook’s 20PB on 2700 nodes, and Bing’s 150PB on 27,000 nodes! Of course, parallel DB technologies aretypically (for good reason) delivered as appliances of high-quality hardwarecompared to NoSQL usage of commodity type (possibly-heterogeneous) servers. As with anything to do with databases - "it depends."
NoSQL isn’t all thesame – it’s a big tent of technology – but there are a couple major types: Key/value stores (CouchBase, MongoDB, …) andHadoop. The former are intendedprimarily for single-record updates/retrievals – you can think of them as the“OLTP NoSQL” technology. Hadoop isdesigned for processing massive data sets, and is roughly expressible as “datawarehouse NoSQL.”
The claim that NoSQLworks on unstructured data, and RDBMSs are for structured data is amisconception. NoSQL databases do havestructure – it’s just that the structure is applied to the data later in theprocess than for your traditional RDBMS. Instead of the metadata being stored with the data – in fact being arequirement to actually store the data – NoSQL metadata is “applied” when theanalysis function processes the data. I think of it as “compiled” vs. “interpreted” code.
Only eight years ago, Google started in on thebig data problem in order to analyze clickstream data. Hadoop is an open-source interpretation ofGoogle’s Map/Reduce+GFS system. Thegeneral structure has two parts – a fault-tolerant data storage layer (HDFS)and a fault tolerant processing layer.
The data storage layerprovides fault tolerant storage using file redundancy across different nodes inthe cluster of servers, protecting against individual server failure as well asentire racks failing. You can think ofthis like RAID – data is smartly distributed over the cluster. If one node fails, the data exists on othernodes, so it isn’t lost. The maindifference here is that RAID only addresses single disk failures, whereas theHDFS protects against query, disk, server, switch, rack, and datacenterfailures without necessarily having to restart or recover… using only one “redundancy”algorithm.
The layer that sits ontop of the storage is the processing engine – Map/Reduce. Map/Reduce is best at performingaggregations, but it isn’t designed to do joins. Joins take a ton of code to achieve – four fullscreens of multiple “programs” compared to a quarter-screen of SQL (HiveQL). Dr. DeWitt compared performanceof a few TPC-H “structured” queries between Hive and PDW – all showingsignificant wins by parallel database “old guard” technology. The reason is primarily due to “oldtechnology” being intentionally designed to handle these specific type ofqueries – even though they’re aggregations that he’d just said performtypically better in NoSQL scenarios. Why? Because of the inherentoptimizations possible when the data used is a TPC-H workload, which is alreadyschema’d and loaded. If the datastructure wasn’t known ahead of time, a parallel database would not be capableof servicing the query at all – until metadata was “applied” by performing ETL.
There still is astrong use case for NoSQL – and his reasoning follows how I think about therelationship between SSAS (cubes) and PowerPivot (tabular) analysistechnologies. In Map/Reduce, the schemais stored in the application that’s doing the analysis itself – not within thedata like in SQL. So it’s a lot “easier”and better to use when you’re doing exploratory analysis of data. Much like PowerPivot, you only need to defineenough schema to get your particular job done – not enough schema to get anytype of query done, or “explain” each piece of data in the set. But at some point, if your exploratoryanalysis finds value in the data, you’re going to want to move it into an RDBMSto get the query performance out of it, and have a consistent schema applied toavoid two different analysis applications interpreting the datain inconsistent ways. Currently, there aretechniques and technologies for moving data between these platforms, but they canbe pretty difficult and involved. One ofthose is called Sqoop.
Dr.DeWitt envisions abetter way to share data between SQL and NoSQL worlds than Sqoop. He’s investigating the concept of an“Enterprise Data Manager” that can relate and use unstructured and structureddata so that the appropriate engine can traverse and analyze whichever type ofdata it’s presented with.A few ideas Dr. DeWitthad for this year that “didn’t make the cut”, but could be a topic for nextyear’s Summit:
- What the “Cloud” meansfor databases
- The impact of Solid StateDrives and Phase Change Memory
- ... whatever you mightwant to see – send your ideas to PASS!