PASS Summit Day 3: Dr. David Dewitt

Two quick points, I’m putting this blog together using the Surface.. ooh… and this isn’t a keynote, but a spotlight session at the Summit. Still, I thought I would live blog my thoughts because I’ve done it for every time Dr. Dewitt has spoken at the Summit.

Right off, he has a slide with a little brain character representing himself.

But, we’re talking PolyBase, and futures. This is basically a way to combine hadoop unstructured nosql data with structured storage within SQL Server. Mostly this is within the new Parallel Datawarehouse. But it’s coming to all of SQL Server, so we need to learn this. The information ties directly back to what was presented at yesterday’s keynote.

HDFS is the file system. On top of that a framework for executing distributed fault-tolerant algorithms. Hive & Pig are the SQL languages. Sqooop is the package for moving data and Dr Dewitt says it’s awful and he’s going to tell us why.

HDFS was based on a google file system. It supports 1000s of nodes and it assumes hardware failure. It’s aimed at small numbers of large files. Write once, read multiple times. The limitations on it are caused by the replication of the files which makes querying the information from a datawarehouse more difficult. He covers all the types of nodes that manage HDFS.

MapReduce is used as a framework for accessing the data. It splits the big problem into several small problems. It puts the work out into the nodes. That’s Map, Then the partial results from all the nodes is combined back together through Reduce. MapReduce uses a master, JobTracker and slaves, multiple TaskTrackers.

Hive, a datawarehouse solution for Hadoop. Supports SQL-like queries.It has somewhat performant queries. By somewhat he says that the PDW is 10 times faster.

Sqoop is the library and framework for moving data between HDFS and a relational DBMS. It seriealizes access to hadoop. That’s the purpose of PolyBase to get parallel execution access all the Hadoop hdfs. Sqoop breaks up a query through Map process. Then Sqooop runs two queries a count, and then reworks the query into a pretty scary query including an ORDER BY statement. This causes multiple scans against the tables.

Dr. Dewitt talks through the choices for figuring out how to put together the two data sets, structured and unstructured. The approach taken by Polybase is to work directly into HDFS, ignoring where the nodes are stored. Because it’s all going through their own code, they’re also setting up to text and other data streams.

They’re parallelizing access to HDFS and supporting multiple file types. Further, putting “structure” on “unstructured data”

By the way, I’m trying to capture some of this information, but I have to pay attention. This is great stuff.

How the DMS,the stuff used by Microsoft to manage the jump between HDFS and SQL Server is just flat out complicated. But the goal was to address the issues above and it does it.

He’s showing the direction that they’re heading in. You can create nodes and objects within the nodes through sql-like syntax. Same thing with the queries. They’ll be using the PDW optimizer. Phase 2 modifies the methods used.

I’m frankly having a little trouble keeping up.

It’s pretty clear that the PDW in combination with the HDFS allows for throwing lots and lots of machines at the problem. If I was in the situation of needing to collect & process seriously huge data, I’d be checking this out. The concepts are to use MapReduce directly, but without requiring the user to do that work, but instead using TSQL. It’s seriously slick.

By the way, this is also making yesterday’s keynote more exciting. That did get a bad rap yesterday, but I’m convinced it was a great presentation spoiled by some weak presentation skills.

All the work in Phase 1 is done on PDW. Phase 2 moves the work, optionally, to HDFS directly, but still allows for that to be through a query.

Dr. Dewitt’s explanation of how the queries are moved in and out of PDW and HDFS are almost understandable, not because he[s not explaining it well, but because I’m not understanding it well. But seeing how the structures are logically handling the information does make me more comfortable with what’s going on over there in HDFS.

I’m starting to wonder if I’m making buggy whips and this is an automobile driving by. The problem is, how on earth do you get your hands on PDW to start learning this?

Book Review: Big Red - Voyage of a Trident Submarine

by Andy Warren

SQLServerCentral.com

Blogs

I've grown up reading Tom Clancy and probably most of you have at least seen Red October, so this book caught my eye when browsing used books for a recent trip. It's a fairly human look at what's involved in sailing on a Trident missile submarine...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-03-10

1,439 reads

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

by Robert Davis

SQLServerCentral.com

Blogs

Question: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? This question was sent to me via email. My reply follows. Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup? Databases to be mirrored are currently running on 2005 SQL instances but will be upgraded to 2008 SQL in the near future.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-23

1,567 reads

Inserting Markup into a String with SQL

by Phil Factor

SQLServerCentral.com

T-SQL

In which Phil illustrates an old trick using STUFF to intert a number of substrings from a table into a string, and explains why the technique might speed up your code...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-18

1,631 reads

Networking - Part 4

by Andy Warren

SQLServerCentral.com

Blogs

You may want to read Part 1 , Part 2 , and Part 3 before continuing. This time around I'd like to talk about social networking. We'll start with social networking. Facebook, MySpace, and Twitter are all good examples of using technology to let...

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-17

1,530 reads

Speaking at Community Events - More Thoughts

by Andy Warren

SQLServerCentral.com

Blogs

Last week I posted Speaking at Community Events - Time to Raise the Bar?, a first cut at talking about to what degree we should require experience for speakers at events like SQLSaturday as well as when it might be appropriate to add additional focus/limitations on the presentations that are accepted. I've got a few more thoughts on the topic this week, and I look forward to your comments.

★ ★ ★ ★ ★ ★ ★ ★ ★ ★

You rated this post out of 5. Change rating

2009-02-13

360 reads

PASS Summit Day 3: Dr. David Dewitt

Rate

Share

Share

Rate

PASS Summit Day 3: Dr. David Dewitt

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts