It’s the million dollar question. Literally.
I stopped by the SQL Server area at TechEd in the expo to talk to Val Fontana, a Microsoft’ie I’ve known for a few years to talk about the largest, and yet-to-be-released version of SQL Server 2008 R2 known as Parallel Data Warehouse. It’s actually above DataCenter edition, and depending on the configuration, can be up to around $2mm for the appliance, with roughly 90+TB of data.
Literally a million dollar question if you have a large data warehouse.
I went to a deep dive on this product as well, from Matt Peebles, formerly of DATAllegro, which was purchased by Microsoft. Supposedly all of the Linux/IngreSQL/other technology is completely ported to Windows now, and they are almost ready for a V1 release of this version.
A few people had MPP/Shared nothing servers in their environments, but not many. A few definitions to start.
A core concept is shared nothing. Resource and data independence are maintained in each DBMS instance. Google does this, breaking each part of the problem into smaller parts, each part handled on a separate node.
In this instance, each instance has certain shared resources (CPU, memory, disk) that is for only it’s use. Then you can scale out continually to add new resources with another box/instance.
MPP is separate: massively parallel processing. Multiple concurrent resources are used to solve SQL queries. Each instance works in parallel on its own distribution of the data. PDW supports up to 10 parallel instances per data rack (HP appliance, 8 with Dell), limited to 4 racks (40 instances). The reason is that they have only enough to test 40 nodes, but they will go larger. No architectural limitation for 40 nodes.
Comparing this with SMP, theres this slide:
A couple interesting points. SMP needs HA architected, but in MPP in PDW, this is built in. This should also scale out to > 1PB+.
Funny to see them listing this as relatively low cost. I guess at $1M+, relatively is the key word.
There also is a note of limited SQL Server functionality. I had heard this earlier, with no stored procedures, no text/image/binary data, no spatial, no XML, just text/numbers/queries, but those are the critical types. Partitioning is supported.
This is also somewhat based on the Fast Track program, where systems are basically configured for you to drop an application on. The workload matters more than the capacity, and it is hard for people to design systems. So the Fast Track and PDW systems are almost turnkey. If you need more SQL Server capabilities, they recommend Fast Track. They can also publish data for smaller marts to Fast Track architectures if you need that. I guess if you have a seven figure budget, what’s a few more DW servers?
Ultra Shared Nothing – Never heard of that, but this is an extension of shared nothing. What’s different?
- IO and CPU affinity within SMP nodes.
- Multiple physical instances of a table
- Distribution of large tables
- Smaller tables replicated
- Rows are redistributed on the fly when needed.
This means the system is balanced, with the right amount of drives and CPU for max throughput. So each node is silo'ed, having each node dedicated to chunks of data. So you max out each node with the best CPU/RAM/DISK possible.
This gives more predictable results on the system. As you add mode nodes, it scales fairly linearly.
This technology is converted from DATAllergro to the Microsoft platform. This locked down on W2K8 SP2, SQL Server 2008 SP1 cu5. This is a software + hardware solution, pre-tuned and optimized, with users seeing 1 server. Specific architecture appliances from HP, Dell, IBM, EMC, and Bull.
This is designed for DW workloads, sequential in nature, read only access. It does do loads, and they have a fast loader, but the workloads should be DW in nature. They support normalized and star schemas. It’s built for large loads, 10TB –> 500TB
It’s an R2 release since they integrate with the R2 products, PowerPivot, SSAS/SSRS/SSIS versions of SQL Server 2008 R2.
They are working with many 3rd party BI solutions (Informatica, Microstrategy, etc)
The connection from client is using the Data Direct Drivers, not native SQL drivers. This is likely what will ship. They will ship ADO.NET, OLE-DB, and ODBC. Data Direct will provide JDBC drivers.
These drivers talk to the control node. That stores authorization info metadata, workspace for final aggregations, etc. This has the main PDW engine.
The control node talks to the underlying compute nodes. This is over a dual 2GB/s Infiniband bus, which is very fast. The compute nodes also keep a service called DMS, which redistributes data among nodes if needed.
There are also storage nodes, bascially SAN storage, but dedicated to each computer node. Should be connected over dual 4GB/s network, private for the storage.
There is a passive, spare compute node that is there for failover. So there is an N+1 failover. The control node also has a spare node with a Windows cluster.
There are a pair of management nodes that reads AD/DNS info and manages that. It also deals with setup/patching, and can leverage the HPC technologies from Windows for provisioning.
There is a landing zone node, an ETL interface for loading data. Is has an SSIS instance, a loader tool, and a file staging environment with a lot of disk. You can load from external SSIS directly if needed.
Backup node is there to do backups. Some test showing 5TB/hour. It does full and diff backups.
This system is delivered to you in multiple racks. All the control nodes (control, management, backup, ETL) are in a rack and then you have data racks with the compute and storage nodes. To expand, beyond the 8 or 10 compute nodes, you add in more data racks.
This is a major system, and not for most people, but for those that need it, it sounds very cool.
Initial reference architectures from HP are DL360 G6s, 8 hyperthreaded cores, 72GB RAM, 6 300GB disks. Or you can go Dell R610, 8 hyperthreaded cores, 96GB ram, 4 300GB disks. Those are some hefty nodes, and you get 8 or 10 of them!
Depending on storage, you can get up to 100TB in a rack, so potentially a 400TB capacity with 4 racks. You can choose performance (smaller, faster disks), or capacity (larger disks). These are basically going with 450GB or 1TB disks.
There are DMVs provided, and mostly the same as SQL Server, but they add on more information as needed. There will be a SCOM pack released after v1 comes out. There are web based admin tools as well. Customer monitoring tools can pull information from DMVs.
All components are redundant (disk, network, power, stored processors) and Windows Failover Clustering is used. Each rack is a separate cluster, and the management nodes use AD failover technology.
Under the Hood
Under the hood, you decide how to lay the data out. There are two classes of tables: big and small. Smaller tables are replicated on every compute node. For large tables, it gets partitioned out across the appliance. You choose the column to evenly lay out the data across all the nodes. The architecture is only as fast as the slowest node. The column you pick gets hashed and moves data across all nodes.
For each node, the data is actually broken up inside each node. There is some affinity that they get with Soft NUMA and Resource Governer to balance nodes. They actually have separate ports for each core in the processor. In terms of the 8 cores, they separate the data into 8 physical tables on the node, each one aligned with a core.
The data is in a R1 pair, with 8 LUNs corresponding to the cores. This is different than the Fast Track config where this separation is managed by the DBA. It’s done automatically here. They also ensure that replicated are striped here. tempdb is on dedicated LUNS, t-logs are on separate LUNs as well.
The query processing flow was shown next. As you might expect, it works like this:
- Connect with SQL auth. No Windows auth supported right now.
- The query is parsed.
- Validate the user has rights
- Build an MPP plan next. This is the big deal. They have to build a plan that optimally reads sequentially from the disks with minimal data movement.
- The plan is executed
- data returned (either to control node or not), and streamed to client.
There were some interesting questions, and lots of places where Matt mentioned they would do something in v2 to fix, extend, or optimize things. That makes me nervous, but I had heard great things about DAT Allegro before.
This is a good session to examine if you want to see how things work in detail. Some food notes on the flow that hopefully make it to the recording.
Loads also take place in parallel. might be moved twice, but each compute node can process, convert, hash, and then shoot across Infiniband to the correct node.
What are technology preview customers saying? For one that does credit card processing, they are seeing better than linear performance under concurrency, with virtually no degradation for concurrent loads and queries as they are added together.
There is a DW comparison with a 24 core, SMP server. The PDW is 18x faster on average, with a range of 2x-166x. Multi-hour queries ran in seconds. Note that these queries were run with 1 clustered index only on the tables. No NCIs.
Very cool technology. Can you use it? That’s literally the million dollar question.