Data growth is related to Moore’s law. As computers get faster and more powerful, we are using them to process more data in more complex applications. Traditional sources such as ERP systems are processing more transactions as our organizations grow. Newer sources such as web browsing activity, mobile devices, and social media are creating data sets that are orders of magnitude larger than what we’ve dealt with in the past.
We call these large scale data sets “big data.” Large organizations have always had to deal with big data, but the boundaries that define big data have expanded with the raw power of computers. The quantity of data you have to deal with only becomes a problem when it becomes difficult to work within your existing systems environment. For some organizations, this could be as small as 100s of gigabytes. For larger, more experienced organizations, the cross-over point is more in the tens of terabytes.
The term big data is a major buzzword in IT today precisely because big data holds the potential for big value. However, your ability to extract value from large data sets in your transaction systems and social media interactions is predicated on your ability to actually manage the data. Companies such as Google and Yahoo! have helped pioneer new technologies such as MapReduce and Hadoop to help deal with the massive quantities of unstructured data they collect. At the same time, mainstream technologies for dealing with more structured big data sets, such as massively parallel processing (MPP) systems and column store databases, are experiencing significant growth.
The common big data challenge in most organizations is to figure out how to make the large-scale volume of transactions in their corporate systems available and usable in an analytic environment. Dealing with newer sources of data, such as social networking, is important, but it is not the number-one problem in most companies today.
In this section we begin with a brief exploration of the strengths and limits of single-server systems when it comes to big data. We then dig into the MPP approach and explore the Parallel Data Warehouse system architecture.
Most of the “servers” you work with are single servers with shared resources. Each CPU core can work with any section of memory or disk, and all memory and disk is available to each core. This all-in-one architecture is known as symmetric multi processing (SMP). As shown in Figure 6, the CPU cores connect to the memory and disk via a system bus. This shared connection supports high speed inter-process communication, memory, and data sharing, and is easier to manage because it is a single physical unit.
Figure 6: SMP architecture
However, SMP systems have a limit when it comes to big data; they can scale only to the point where their system bus gets overloaded. Too many CPUs making simultaneous requests for data on the system bus creates a traffic jam. As usage grows, the system bus becomes a bottleneck and limits the total amount of processing that can take place on an SMP system. There are ways to mitigate this contention by creating localized subsets of CPUs and memory, but this only extends the limit.
One time-tested strategy for handling large amounts of data is to avoid this bottleneck by distributing data and processing across many servers, or nodes, each of which has its own memory and disk so they can share the workload. This approach, known as massively parallel processing (MPP), has been around for several decades and is the basis for many of the largest super computers in existence today. Due to their high cost and complexity, MPP systems have historically been used by the largest companies and governmental organizations.
This massively parallel architecture lies at the heart of Microsoft’s Parallel Data Warehouse system. Parallel Data Warehouse is a Microsoft SQL Server product designed to scale data warehouses from tens to hundreds of terabytes of data. It delivers the MPP architecture using an “appliance” model, providing preconfigured, optimized commodity hardware and software and a single point of support.
Figure 7 shows an abbreviated architecture for a SQL Server PDW MPP system. A user query request would come into the control node, which breaks the SQL into multiple parallel operations and distributes them out to the compute nodes where the actual data resides. A special module called Data Movement Services coordinates any needed data movement among nodes taking place between and handles any functions that need to be resolved centrally. When the compute nodes are finished, the control node handles post-processing and re-integration of results sets for delivery back to the users.
Figure 7: The SQL Server PDW massively parallel architecture
Each compute node is a separate SMP server running SQL Server. Compute nodes in current SQL Server PDW configurations ship with dual, hex-core CPUs, 96 GB of memory, and local tempdb workspace. They are connected together using dual InfiniBand network to support high-speed node-to-node data sharing for cross-node computations known as data shuffling. This network also connects the compute nodes to the control and administrative nodes to support high-speed data loading, extraction of query results, backup, and other administrative functions.
The disk subsystems for the compute nodes are managed by a storage area network (SAN) component with high-speed dual Fibre Channel connectivity. This data bus supports high-speed I/O, and failover redundancy. The compute nodes and disk drives are physically housed in the same rack, called a data rack.
There are three types of administrative service nodes that share the control rack with the control node. These include:
- Management Nodes, which provide the DBA or data center operations interface to access and manage the overall solution and support the system’s internal network.
- A Landing Zone Node, where cleansed data is staged and prepped before loading into the data warehouse.
- A Backup Node and the appropriate associated storage. The Backup Node provides high-speed integrated backup at the database level. This is tied to the organization’s overall backup strategy and systems.
The SQL Server PDW is a large-scale enterprise class system and has built-in redundancies:
- Primary data is stored as RAID1.
- Hardware redundancy includes redundant power supplies, spare disks, compute nodes, control nodes, and management servers, mostly designed to support automatic failover.
SQL Server Parallel Data Warehouse is sold as a data warehouse appliance: a set of commodity hardware and Microsoft software pre-configured to meet the needs of a range of data sizes and performance. This makes sense because configuring the individual components, network and connectivity throughput, and disk subsystem performance is a significant effort, more than most IT shops would care to take on. With the appliance, all components and network connectivity are carefully designed, configured, and balanced for optimal performance, and necessary software on all nodes is pre-installed and pre-configured.
The MPP architecture can be scaled up by adding racks of compute nodes. The base system starts with one rack. On an HP appliance, for example, a full rack holds 10 nodes, and additional 10-node racks can be added up to a total of 40 nodes. The 40-node limit is more due to the definition of the product and not an inherent limit of the system design. SQL Server PDW uses its backup and restore facility to make expanding a SQL Server PDW is fairly straight forward: back up the database, add the new rack, reconfigure, and restore. The database restoration automatically redistributes the data across all nodes.
Microsoft is working with several hardware vendors to offer SQL Server PDW systems. HP is the first to market with a publicly available product at this writing.
The physical architecture of distributed nodes with local data means the large data sets have to be distributed across the nodes in a way that will support both data load and query processes. The goal is to get each node and CPU core working as hard as possible on every query. In the data warehouse, fact tables are distributed evenly across nodes so each node will have work to do.
Efficient processing on nodes results when local fact table subsets can join to local dimensions tables, which can be achieved if dimension tables are replicated to all nodes. SQL Server PDW allows you to specify distributed or replicated tables at time of creation, and then transparently manages placing the appropriate data on the appropriate compute nodes at load time.
How does SQL Server PDW fit with the Kimball approach? When you compare it with our principles, it fits quite well. It provides good usability and flexibility because in most cases, you can build a set of atomic-level dimensional models with conformed dimensions. It performs well because the workload is distributed across all the compute nodes rather than bottlenecked on a single server. SQL Server PDW gets an additional performance boost at the node level because SQL Server has functions to support dimensional models, including star-join optimization. And SQL Server PDW’s support for replicated dimension tables allows many common query scenarios to be satisfied without more expensive data-shuffling operations.