Two quick points, I’m putting this blog together using the Surface.. ooh… and this isn’t a keynote, but a spotlight session at the Summit. Still, I thought I would live blog my thoughts because I’ve done it for every time Dr. Dewitt has spoken at the Summit.
Right off, he has a slide with a little brain character representing himself.
But, we’re talking PolyBase, and futures. This is basically a way to combine hadoop unstructured nosql data with structured storage within SQL Server. Mostly this is within the new Parallel Datawarehouse. But it’s coming to all of SQL Server, so we need to learn this. The information ties directly back to what was presented at yesterday’s keynote.
HDFS is the file system. On top of that a framework for executing distributed fault-tolerant algorithms. Hive & Pig are the SQL languages. Sqooop is the package for moving data and Dr Dewitt says it’s awful and he’s going to tell us why.
HDFS was based on a google file system. It supports 1000s of nodes and it assumes hardware failure. It’s aimed at small numbers of large files. Write once, read multiple times. The limitations on it are caused by the replication of the files which makes querying the information from a datawarehouse more difficult. He covers all the types of nodes that manage HDFS.
MapReduce is used as a framework for accessing the data. It splits the big problem into several small problems. It puts the work out into the nodes. That’s Map, Then the partial results from all the nodes is combined back together through Reduce. MapReduce uses a master, JobTracker and slaves, multiple TaskTrackers.
Hive, a datawarehouse solution for Hadoop. Supports SQL-like queries.It has somewhat performant queries. By somewhat he says that the PDW is 10 times faster.
Sqoop is the library and framework for moving data between HDFS and a relational DBMS. It seriealizes access to hadoop. That’s the purpose of PolyBase to get parallel execution access all the Hadoop hdfs. Sqoop breaks up a query through Map process. Then Sqooop runs two queries a count, and then reworks the query into a pretty scary query including an ORDER BY statement. This causes multiple scans against the tables.
Dr. Dewitt talks through the choices for figuring out how to put together the two data sets, structured and unstructured. The approach taken by Polybase is to work directly into HDFS, ignoring where the nodes are stored. Because it’s all going through their own code, they’re also setting up to text and other data streams.
They’re parallelizing access to HDFS and supporting multiple file types. Further, putting “structure” on “unstructured data”
By the way, I’m trying to capture some of this information, but I have to pay attention. This is great stuff.
How the DMS,the stuff used by Microsoft to manage the jump between HDFS and SQL Server is just flat out complicated. But the goal was to address the issues above and it does it.
He’s showing the direction that they’re heading in. You can create nodes and objects within the nodes through sql-like syntax. Same thing with the queries. They’ll be using the PDW optimizer. Phase 2 modifies the methods used.
I’m frankly having a little trouble keeping up.
It’s pretty clear that the PDW in combination with the HDFS allows for throwing lots and lots of machines at the problem. If I was in the situation of needing to collect & process seriously huge data, I’d be checking this out. The concepts are to use MapReduce directly, but without requiring the user to do that work, but instead using TSQL. It’s seriously slick.
By the way, this is also making yesterday’s keynote more exciting. That did get a bad rap yesterday, but I’m convinced it was a great presentation spoiled by some weak presentation skills.
All the work in Phase 1 is done on PDW. Phase 2 moves the work, optionally, to HDFS directly, but still allows for that to be through a query.
Dr. Dewitt’s explanation of how the queries are moved in and out of PDW and HDFS are almost understandable, not because he[s not explaining it well, but because I’m not understanding it well. But seeing how the structures are logically handling the information does make me more comfortable with what’s going on over there in HDFS.
I’m starting to wonder if I’m making buggy whips and this is an automobile driving by. The problem is, how on earth do you get your hands on PDW to start learning this?