Of course, your architecture depends on your business requirements along with technical, historical, and political factors. While business requirements are usually business-specific, we can start with a list of broad requirements, or architecture goals, that most organizations would support:
1. Excellent query performance for users
2. High ease of use
4. Enterprise-wide usage and value
5. Good maintainability
From these broad goals, we can derive the primary components of an enterprise data warehouse architecture as follows:
Excellent query performance is a given requirement. What counts as excellent depends on your user expectations; obviously response times in the seconds are desirable when possible, though in some cases, response times in the minutes or even hours may be considered excellent, given the data volumes and query complexities. Just to be clear, excellent performance should be accomplished in a fashion that is entirely transparent to the user. There should be no need for the user to learn which aggregates to use, or which data mart has the required data, or how to add an optimizer hint.
The DW/BI industry generally agrees that the dimensional model is the easiest to work with. Ease of use from a business user perspective is ultimately determined by the BI tools that front-end the data warehouse database. Developers, or any users, who are writing reports and applications that directly access the data warehouse database generally end up writing their access code in SQL. Ease of use from a developer’s perspective is driven mostly by the physical database model.
It is much easier to present a dimensional model in the BI tool metadata layer if the underlying atomic data model is already dimensional.
The first determinant of flexibility stems from the grain of the fact tables you create. If your fact tables are captured at the lowest level of detail available, known as the atomic level, you can always aggregate the data up to any attribute of any dimension. Therefore, your DW/BI system must capture the atomic detail to provide maximum flexibility.
Conformed dimensions are also a contributor to flexibility. They allow users to query data from separate business processes such as sales and inventory, and correctly combine the results on shared dimension attributes, such as product or region. In effect, they allow users to compare apples to apples across the enterprise.
Note that a correctly defined dimensional model has the exact same flexibility from an analytic query perspective as a normalized model; they are relationally equivalent. A query to sum up Sales by Region will give the exact same answer in either model.
A true enterprise information resource has three main components: all data is available to all users, data is aligned across disparate business processes, and there is one analytic system of record for each data element. Let’s examine each of these statements in turn.
All data must be available to any analyst who might need it because all data is useful to everyone in the organization. The analyst in Logistics needs to know sales by geography and distribution center. The analyst is Sales needs to know sales by customer and region. The analyst in Marketing needs to know sales by product. These are all different summary queries on the same atomic sales fact data. Do not let these departmental differences lead you to think departmental data marts might be a good solution. As soon as you limit Marketing to product summaries, they will insist on customer detail to support a customer segmentation analysis. Every analyst ends up needing access to all the data at the atomic level at some point.
Data must be aligned across business processes because that’s what allows users to combine data from multiple, disparate sources across the enterprise in a fast and correct manner. The data in the enterprise data warehouse needs to be integrated via the enterprise set of conformed dimensions identified in the bus matrix. Conformed dimensions and the work that goes into creating and maintaining them is a major component of this enterprise resource; conformed dimensions are the struts that hold the enterprise data framework together.
A single source reduces the confusion and wasted time that results from having multiple data marts with overlapping data content. Having a single analytic system of record may involve multiple physical copies for performance reasons, but this is a compromise. If multiple copies of the data are needed, these copies must be built from a single, central data warehouse database. While there may be transformations for analytic reasons, if the same tables and attributes exist in multiple places, they must be presented with the same names and definitions to avoid error and confusion.
The large-scale server power of a product like SQL Server PDW allows you to provide a true enterprise information resource: a single version of the truth without the extra time, resources, and maintenance required to copy data out to multiple data marts.
The simpler your architecture, the easier it will be to operate and maintain. A single, high-performance database with atomic-level detail and fast summarization based on a dimensional model is the simplest way to meet the broad enterprise requirements.
You may need to adjust your ideal DW/BI system architecture if it cannot meet the enterprise requirements. Performance is the most common area requiring compromise; if the ideal architecture is not working, it’s not so ideal. Before throwing in the towel on performance, it is important to make sure your ideal architecture is properly tuned. If it still doesn’t work, the most common compromise is data distribution.
Performance is platform-dependent. Indexes and aggregates are the two standard performance tools in the DW/BI system, and these vary widely across database product and platform. For example, in an SMP environment, it can take a long time to run a query that asks for total sales for the last five years. In this case, it makes sense to create aggregated tables once during the ETL process that can be used over and over to answer summary level queries. (Note that these aggregate tables need to be transparent to the user to maintain ease of use.)
However, the MPP environment offers a third performance tool: parallel processing. Distributing query tasks across multiple nodes may allow summary level queries to be answered on the fly. This greatly simplifies the design, tuning, and maintenance of indexes and aggregates. You can generally rely on the brute force power of the underlying parallel processing architecture for excellent query performance in the MPP environment.
If performance tuning or parallel processing isn’t enough, you may have to create separate subsets of the data warehouse and host them on downstream servers. These data marts may be departmental in focus; data is often limited to a few subject areas and summarized. (If the data marts contained atomic-level data from all business processes, you would be back where you started with the enterprise data warehouse.) From a performance point of view, the idea is to offload a subset of users and queries to a dedicated platform. This is a crude form of distributed processing, and is probably less effective than simply adding another rack to the SQL Server PDW machine where it could be used by the entire organization when needed.
There are times when this distributed strategy makes sense. Certain data may be useful or interesting only to a small analytic community. Other data may be sensitive and require strict physical access limitations. In some cases, the desire for a separate server is organizational; a certain department may insist on having its data on its own server. As we will describe in the implementation section, you may have existing data marts with extensive reports and applications built on them. In this case, it’s much easier to initially populate these downstream marts from the SQL Server PDW rather than rewrite the reports and applications to work directly from the SQL Server PDW.
In these cases, the SQL Server PDW can act as the central source of the distributed data warehouse. SQL Server PDW has a Remote Table Copy feature that will propagate tables to these downstream SQL Server systems at high speeds. The target systems need to be physically located close enough to the SQL Server PDW so they can connect to the InfiniBand network, since this is part of the speed component. If the downstream systems are designed based on Microsoft’s Fast Track architecture, the data transfer rates can be significant. The downstream systems can also be any data mart running SQL Server 2008 or above. This includes the new HP Business Data Warehouse, optimized for SQL Server 2008 R2, and the HP Business Data Warehouse, a BI appliance also optimized for SQL Server 2008 R2.
There are several ways to extend the functionality of the core data warehouse within the SQL Server platform. SQL Server Analysis Services online analytic processing (OLAP) brings more advanced analytic functionality and improved performance for complex queries. Analysis Services data mining brings predictive analytics that can leverage the ability of the parallel processing server to draw out valuable patterns and relationships from vast quantities of transactions.
Most organizations who have been working with DW/BI systems for several years have advanced beyond simple reporting. They are building complex analytic applications using predictive analytics and multi-faceted dashboards that draw key performance indicators from across the enterprise. It can be difficult to create the queries that populate these advanced BI tools because they need to make multiple passes against multiple fact tables at different levels of summary. In this environment, it is usually easier to create additional data sets that pre-integrate and calculate most of the analytics.
This can be done either as tables in the SQL Server PDW, or as a separate OLAP data mart. These types of calculated data sets in the relational data warehouse are often called snapshot fact tables or accumulating fact tables. Common snapshot tables include inventory balances at a point in time, or month end account balances in financial services.
The OLAP option is particularly compelling because OLAP databases, such as SQL Server Analysis Services, are designed to perform more advanced analytic calculations, and provide a performance boost through the creation and management of aggregates. The language used to access Analysis Services, called multidimensional expressions (MDX), was created to support analytics. It has a built in understanding of date relationships such as current month, year to date, and prior year. It also can navigate hierarchies, such as moving from district to region to country.
In either case, the SQL Server PDW core data warehouse would serve as the data foundation, and these analytic tables or marts would be built from its cleaned and conformed data store.
In an interesting combination of functionality, Analysis Services can also be used as a query management layer for SQL Server PDW in what is known as relational OLAP (ROLAP) mode. In this mode, Analysis Services retrieves data directly from SQL Server PDW at query time, rather than using a pre-loaded OLAP database within Analysis Services. It also provides full access to the advanced analytic capabilities offered by the MDX language. User queries are submitted to Analysis Services from the BI tool layer, translated into SQL, and submitted to the SQL Server PDW database.
Microsoft’s data mining functionality can drive a range of interesting predictive analytics including forecasting, recommendation engines, and customer segmentation. The data mining component itself is an Analysis Services feature, and runs on an Analysis Services server.
Parallel Data Warehouse can serve as the data source that feeds the data mining engine, thus enabling models based on the vast amounts of transaction level detail stored in SQL Server PDW, in combination with the richly attributed dimensions. Generating the input data sets is often the hard part of data mining because these data sets typically involve multiple full-table scans to identify behaviors and changes in behavior over time.
One of the original principles of the Kimball approach listed in the first section of this white paper is to provide a complete DW/BI solution. This includes providing user access for ad-hoc exploration and BI reports and applications that deliver value to the business that was identified in the requirements gathering process. Microsoft offers a set of reporting and analysis tools as part of its overall DW/BI product stack, and Parallel Data Warehouse is a fully participating member of this ecosystem. Reporting Services and Report Builder queries and reports, and third-party BI tools, can draw from SQL Server PDW like any other SQL Server database. The same is true for Microsoft Office tools such as Microsoft Excel and Microsoft PowerPivot. All of these user access methods can be hosted in Microsoft SharePoint and delivered in the context of a rich BI portal experience. .NET applications can access SQL Server PDW via ADO.NET drivers, and third-party tools can communicate with SQL Server PDW using OLE DB and ODBC. All of these drivers accompany the SQL Server PDW product.
In summary, the above goals lead us to the following components of the ideal enterprise data warehouse architecture:
|Atomic data||Flexibility, enterprise resource|
|Single data store||Enterprise resource, maintainability|
|Parallel processing and/or aggregates||Performance|
|Dimensional model||Ease of use for all user communities|
|Conformed dimensions||Enterprise resource, integration|
|Attribute change tracking||Enterprise resource, ease of use, accurate history|
Table 1: Architecture components and goals
I include attribute change tracking even though it is more a function of the ETL process because it is mandatory from a business perspective, and its ease of implementation is a function of the underlying data model. Therefore, we list is as part of the core data warehouse architecture. A graphical model of this architecture can be depicted as follows:
Figure 8 – High-level enterprise data warehouse in the DW/BI system architecture
Parallel Data Warehouse occupies center stage in this architecture. In many cases it can provide a no-compromise solution, with a single set of atomic-level data stored in dimensional models, using parallel processing to provide performance, and organized as an enterprise resource based on the bus matrix and conformed dimensions. SQL Server PDW can also take on many of the hard core ETL processes if need be, a function we will discuss in the implementation section coming up.