In Part I we looked at the definition of Big Data and how it has evolve as a phenomenon. We also learned why Hadoop has become the de facto non-relational Big Data engine, its major selling points, and limitations.
In this installment I will continue the review of Big Data technologies by first looking at the Massively Parallel Processing (MPP) and how Relational Big Data technologies employ this technology in their engines to address Big Data challenges. We will proceed to look at a hybrid approach, where adaptors are allowing the Relational and non-relational sets of Big Data technologies to converge using SQL Server as an example.
We will also look at trends, such as how MPP technology is currently being employed natively in Hadoop and its implications for the Hadoop ecosystem. Finally we will explore some cloud-based Big Data solutions and an emerging data management approach referred to as Logical Data Warehousing (LDW).
MPP technology has actually been around for decades and has been employed to some extent in some database systems and supercomputing efforts. MPP systems are systems with the management capabilities to divide jobs and data across a variety of disks managed by a variety of CPUs spread across one or more servers. Typically, MPPs achieve this coordinated effort among the various processors using some form of messaging interface to communicate.
Big Data Warehouse Appliances
Database Systems that used MPP technology often came with tailored hardware equipment and thus were referred to as Appliances. Earlier Appliances were primarily pre-optimized as OLAP-oriented Enterprise Data Warehouse solutions, so they were sometimes referred to as Data Warehouse Appliances. The latest generation of appliances are being designed with Big Data solutions in mind and may be referred to as Big Data Warehouse Appliances, but for an Appliance to qualify as a Big Data engine, it must first satisfy the two major Big Data technological challenges;
- The ability to able to scale to storing petabytes of data.
- The ability to run large-scale parallel computations.
In the rest of the discussions, the use of the word Appliance will mean Big Data Warehouse Appliance i.e. an Appliances that qualifies as Big Data engine.
Big Data Warehouse Appliances are a combination of hardware and software designed with the ability to scale to data volumes in the multi-petabyte range. Unlike the non-relational Big Data Engines, Appliances come as an integrated set of servers, storage, operating system(s), DBMS and software specifically pre-installed and pre-optimized for Enterprise and Big Data warehousing.
The major selling points and why Appliances are attractive are the fact that;
- They are relatively easy to deploy and use
- They can be scaled out by simply purchasing additional pug-and-play components.
- They are SQL based and employs existing relational model designs.
- They are High-performing engines suited for low latency analytical processing.
Evolution of Appliances
Almost all of the first generation major MPP product vendors were pure-play companies. With the exception of Teradata and ParAccel, most of the major first generation companies have been absorbed or have become subsidiaries of some mega BI vendor. Some of the notable purchases were IBM’s acquisition of Netezza, EMC’s acquisition of Greenplum and HP’s acquisition of Vertica. Microsoft’s acquisition of DATAllegro also resulted in an MPP version of SQL Server, called the Parallel Data Warehouse Edition (PDW).
The good news is that the assimilation helped to bring Appliances to the fore. Generally, Appliances are now trending towards less expensive, high-performance, scalable virtualized data warehouse implementations that uses regular hardware and in some cases open source software. The smaller Appliance vendors try to differentiate themselves by providing specific functionality, such as extreme performance, in-memory analytics and others.
Before we look at SQL Server Parallel Data Warehouse (PDW) let’s try to understand why Appliances have become more popular in recent times.
The “Hybrid” Big Data Solution
Up to this point we’ve looked at two Big Data solutions mounting the attack against Big Data, namely non-relational (hadoop) and Relational (MPPs) engines. It should be apparent by now that each of these two technologies in themselves have some limitations in addressing the Big Data challenges that were outlined earlier in this discussions. Let’s recap why:
- The non-relational Engine (hadoop) is able handle unstructured data and large batch processing, but it is limited in performance when it comes to low latency analytical processing required for some near real-time predictive and granular analytics like customer segmentation, market basket and other analysis.
- MPPs are capable of such low latency analytical processing and in many cases allow jobs to run about 200x faster than its previous data warehousing boosted by in-memory technology. However, they also lack the capability to process large unstructured data like hadoop can accomplish on its HDFS clusters with MapReduce jobs.
However because the major pros of the two systems together seems to address the entire Big Data dilemma thoroughly, the immediate logical solution has been to build connector and adapters to bridge these two technologies, and that is what Appliance vendors are mostly doing. It is the use of the two technologies together is what is being referred to here as the hybrid approach. Even though there are no standard open connectivity protocols, vendors are building connectors for Hadoop either through collaborations or through their own effort. Secondly, they are making these products more appealing by also providing SQL like environments for running MapReduce jobs in hadoop directly from the their Appliances without the need for developers to learn MapReduce.
In most use cases, early adopters of the hybrid solution mostly archive large amounts of existing and extracted data at low cost on Hadoop and loads the data needed for analytics into the Appliance via the vendor’s proprietary connectors and adapters. Selectively they are able to push data analytics either to the platform purpose-built for analytics or keep it on Hadoop.
SQL Server Big Data
SQL Server Parallel Data Warehouse (PDW) Appliance comes as a scale-out pre-built hardware by HP and Dell with an operating system, storage, database management system (DBMS) and software. PDW is equipped with a system called PolyBase which offers organizations seeking a hybrid Big Data solution discussed above.
PolyBase is the data processing system that enables open connectivity between PDW and Hadoop allowing integrated query across Hadoop and relational data. It introduces the concept of external tables, where table schema are metadata that resides in the context of a SQL Server database and are applied to the actual table data that resides in HDFS when needed.
The Data Movement Service (DMS), a component of PDW, can be used to make parallel reads and imports of data from the Hadoop Distributed File System (HDFS), as well as do exports of PDW query results to HDFS files. The Hadoop Bridge in PolyBase is a managed interface that enables DMS to communication directly between HDFS data nodes and PDW compute nodes. With DMS, end users can perform high speed analysis on unstructured data without having to learn MapReduce. Apart from performing simple SELECT queries, users can also perform operations such as JOINs, GROUP BYs and more on millions of rows stored in a Hadoop cluster. Secondly users don’t have to depend on other processes to pre-load data first into the warehouse. Polybase is said to be capable of fully leveraging the larger compute and I/O power of the Hadoop cluster by moving work to Hadoop for processing even for queries that only reference PDW-resident data.
With Polybase data needed for lowest latency analytics can be loaded in memory for high performance. Data that is accessed with a moderate degree of frequency can also be stored in a data warehouse and data that is accessed infrequently can be processed or stored on inexpensive Hadoop clusters.
Microsoft plans other Computational capabilities and expansion for Polybase, you can read more about PDW and Polybase from here.
Hadoop Native MPPs (Real-time Query in Hadoop)
Even though Appliances are making real-time analytics on Hadoop data possible, the Hadoop ecosystem itself is evolving very quickly from batch processing to providing real-time queries capabilities without the need for other systems. Because of some of these developments, it seems the beneficial uses of Hadoop are actually yet to come.
One major approach that we are going to look at is the use of MPP technology natively in Hadoop. In this approach MPP technologies are implemented directly in the Hadoop ecosystem from the ground up. What this mean is that, when these systems are up and running, Hadoop is capable of large-scale data processing via MapReduce and also capable of real-time interactive queries on the same system using the same data and metadata. This approach therefore eliminates the need for Appliances in the hybrid approach we discussed above. Top on the list of vendor leading this approach Cloudera.
Cloudera claims its open source Massively Parallel Processing (MPP) query engine namely, Impala, runs natively on Apache Hadoop bringing scalable MPP database technology directly to Hadoop. Impala enables users to issue low-latency actual SQL queries to data stored in HDFS and Apache HBase without requiring data movement. Whereas a MapReduce job might take minutes or hours to complete, an Impala MPP based queries might return in milliseconds, allowing internal or external users to query HDFS or HBase in realtime. Because Impala is integrated from the ground up as part of the Hadoop ecosystem it is able to leverages the same flexible file and data formats, metadata, security and resource management frameworks used by MapReduce, Apache Hive, Apache Pig and other components of the Hadoop stack. Even though there are no well documented use cases (probably because it was just released into production) many of the well-known BI vendors like Pentaho, Clickview, Microstrategy and Tableau are known to be currently employing Impala
Infrastructure as a Service (IaaS) constitutes the various infrastructure services provided in the cloud, but the term Big Data IaaS as used here will refer to services that provides hardware, servers and networking components with petabytes scalable storage and compute in the cloud. Big Data IaaS are flexibility on-demand services that typically allow client to pays on a per-use basis with some allowing them to rent computing power by the minute. In this approach, clients are able to access a large amount of computing power without having to pre-commit by buying and managing any Big Data infrastructure. This solution seems to work for many clients because besides the other hustle-free benifits the vendors providing the services keep up with the best of Big Data technologies out there making it available immediately to their clients. The popular leading vendors and services in this domain are Amazon Web Services (AWS) and Microsoft Windows Azure.
Amazon Web Services (AWS)
Amazon is recognized as the leading vendor in this space and provides the most comprehensive of big data solution in the Cloud. AWS’s Elastic MapReduce services provides a managed, easy to use analytics platform built around the Hadoop framework. It is integrated with their name-your-price supercomputing system known as Spot Market that enables clients to choose their own price for the computing resources they need up to 1000s of instances. For instance their Elastic Compute Cluster (EC2) Web Services offers virtual machines and disk space, which can be allocated and deallocated quickly. AWS also offers on-demand access to terabytes of solid state storage, with the High I/O instance class accessible to other nosql data store such as Cassandra or MongoDB besides DynamoDB their proprietary nosql version that uses this service by default.
Microsoft Windows Azure HDInsight.
Microsoft provides Big Data IaaS through Windows Azure HDInsight, a windows service that provides Windows Azure with a 100% Apache compatible Hadoop distribution. It must be noted that HDInsight enables big data solutions to run on-premises on Windows Server or Linux.
It must also be noted that HDInsight can be used with SQL server to import data from Hadoop using Scoop but this is not to be confused with PDW or the hybrid solutions above which qualifies as a Big Data Solution in itself.
HDInsight however offers a lot as a Windows Hadoop framework. For instance it provides the Hive and abstractions for running MapReduce jobs. All other functionalities provided natively by Hadoop is made available here maintaining compatibility with existing Hadoop tools such as Pig, Hive, and Java. For instance you can submit MapReduce jobs as a JAR file directly in that ecosystem, a task you currently can’t accomplish through PDW Polybase yet.
Logical data warehouse and Data Virtualization
Closely related to some of the concepts in the current hybrid and cloud based Big Data solutions is an emerging data management approach which has come to be known as Logical Data Warehouse (LDW) first noted by Gartner. Classified as data virtualization in some circles, it is the idea of providing a unified Data Services over Big Data, traditional data warehouses and other distributed technologies that are available. In this approach, the idea is that not all data needs to be physically moved and any data repository can be part of the LDW as long as a transparent logical layer is defined for it. This completely "schema on read" approach will not requires pre-defined data models implemented as a tables.
Besides treating all these federated physical data assets as unified virtual DW, LDWs should eventually be able to also maximize throughput through load balancing and multiple workloads whether relational, non-relational, NoSQL, structured, semi-structured and unstructured data are integrated logically.
Currently two LDW approaches has been noted. In the first approach, vendors like Composite Software and Palantir technologies provides specialized data virtualization software and services for such purposes. In the second approach database management system vendors like Teradata, IBM, Oracle and others have introduced the external table capabilities (e.g. as employed by Polybase in PDW). External table system is a form of federation framework that let them access and incorporate data that is outside their databases into analyses by defining logical schemas over the data.
Both approaches are known to be gaining grounds, especially vendors using the first approach have some successful use cases in some governmental agencies and the financial sector. Cisco’s recent acquisition of Composite Software may suggest a lot of potential for this approach if their Cisco’s savviness in acquisitions and distributed technologies are anything to go by.
Even though LDWs and some of these new data management trends are in their infancy, emerging technologies often start new debates. The one that I came across and seemed compelling and worth discussing is the future of ETL. The question is, will ETL be necessary when these technologies are full-fledged?
Is the end Near for ETL as we know is?
LDW and the new trend in database management has started the Extraction, Tranformation, Load (ETL) debate with proponents contending that the end is indeed near For ETL. On first thought it is easy to dismiss them but if you understand the essence of ETL you may begin to ponder the question more.
ETL is a three stage sequential process E -> T -> L with the transformation occurring in the middle, but there is more to it. The idea is to increase performance and reduce overall processing time by performing the transformation not just in the middle, but by doing it in the pipeline between the extraction (source) and loading (destination) points. What this mean is that, all three processes of extraction, transformation and loading can be performed in that order, but if transformation does not take advantage of the pipeline between the extraction and destination then the process may not be considered as a true ETL process.
As discussed above, in full-fledged LDW systems data will not be moved physically over to other repositories therefore Data transformation activities rife for ETL processes may potentially occur at the source. This is what is leading to the argument that the push for ETL tool may die out. LDWs are currently in their infancy, so this is something to ponder. What are your thoughts?
Even though many companies acknowledges the need to leverage their Big Data as a valuable and strategic asset in other to become competitive, the major road block has been the strategic decision of choosing the systems with right scalability, performance and business-grade reliability.
We have seen that the Big Data solutions out there are not one-size-fit-all. The technologies are also evolving very quickly, making it a quagmire for most of those charged with the responsibilities of Big Data implementations, especially on the heels of well publicized initial Hadoop adoption challenges. Whiles many have the cold feet, other have jumped-in taking the hustle out of this dilemma by paying for everything Big Data in the cloud and letting the service provider worry about infrastructure and the evolving technologies and solution.
Even though new approaches are emerging, Appliances are still in play since they are currently the go-to for enterprise data warehousing for structured data supporting reporting and OLAP needs. It is however a general consensus that Hadoop is the choice for the efficient and cost-effective storage, analysis, and interpretation of the massive amounts of unstructured data. Despite Hadoop’s initial limitations, its ecosystem has evolved to provide access protocols and other packages to cover most of its initial limitations. Experts believe the initial hesitations in adoption is about to change, making way for the next phase of Hadoop adoption. This, many believe is going to be made possible by various technologies and setups; the most anticipated one being the idea of building MPP technology directly into Hadoop. For now it is clear that the Hadoop ecosystem is going to be around for some time. It is also clear that this ecosystem going to be the realm of most advanced analytics driven by collective intelligence. The earlier you get on-board the better, either through the cloud or on premise infrastructure.
The evolution of how data will be eventually managed still continues, leaning more towards "schema on read" and virtualization systems approaches like LDWs discussed above and how they pan out. For now there is no excuse for firms not starting to look at the best of these approaches for leveraging your Big Data.