There are a lot of misconceptions about dimensional modeling and the Kimball approach to building a DW/BI system. It’s worth reading this section even if you are already familiar with the Kimball approach; you might be surprised at what you learn. The Kimball approach to creating an enterprise data warehouse has several core principles:
1. Follow a proven methodology; I recommend the Kimball Lifecycle.
2. Understand business requirements so you can engage the business, prioritize your efforts, and deliver business value.
3. Design the data warehouse data sets for flexibility, usability, and performance.
4. Build and deliver quick, business process-based increments within an enterprise data framework known as the data warehouse bus matrix.
5. Design and build a DW/BI system architecture based on your business requirements, data volumes, and IT systems environment.
6. Build out the extract, transformation, and loading (ETL) system with standard components to deal with common design patterns found in the analytic data environment.
7. Provide the complete solution, including reports, query tools, applications, portals, documentation, training, and support.
All of these principles are explored at length in The Data Warehouse Lifecycle Toolkit book, Second Edition (Wiley, 2008). We will examine a few of them here in detail.
The Kimball Lifecycle is a detailed methodology for designing, developing, and deploying data warehouse/business intelligence systems, as described in The Data Warehouse Lifecycle Toolkit, Second Edition. The diagram in Figure 1 summarizes the key steps in the Lifecycle.
The Lifecycle is an iterative approach, with each pass delivering a coherent set of data and an initial set of associated reports and applications. Each pass can typically be completed in 6 to 9 months, depending on the data complexity. Building out the full DW/BI system takes multiple iterations, each one loading a new data subject area, which plugs into the overall enterprise data framework called the bus matrix.
Figure 1: The Kimball DW/BI Lifecycle
The Kimball approach starts with understanding business requirements and determining how best to add value to the organization. The organization must agree on what the value of this data is before deciding to build a data warehouse to hold it. For example, capturing web browsing activity may allow you to gain deep insight into your customers’ behaviors and preferences, opening up new ways to better meet their needs. If you clearly identify and deliver business value, the resulting impact on the business should easily justify your SQL Server PDW investment.
The ideal starting point for most organizations is to perform an initial set of interviews to gather and prioritize enterprise-wide high level business requirements for information. The result is a priority ordered list of business processes that generate data, along with high value analytic opportunities supported by that data.
Once the list of business processes and associated opportunities has been identified and prioritized, the next step is to take the highest priority business process and gather detailed business requirements related to it. This second pass at requirements is much more focused on understanding the specifics around the required data source, including attributes, definitions, business rules, data quality, and the range of analytics and applications that will be built on top of this data set.
Once these detailed requirements are in place, the Lifecycle moves into the implementation phase beginning with design steps across three different tracks. The top track in Figure 1 is the technology track. The main goal here is to identify the functionality and associated tools needed to meet the identified business requirements.
The middle track in Figure 1 is the data track. The initial step is to define the logical data model needed to support the analytic requirements. In the Kimball approach, this is a dimensional model. Once the logical model is in place, the team can build the target database in the database environment. The nature of the physical model depends on the target platform. Many database products work best with a physical dimensional model, although a more normalized model may make sense on a few platforms. The last data step is to create the ETL system that will populate the target database as required. The ETL system is a significant effort, often consuming a majority of the initial project resources.
The bottom track in Figure 1 is concerned with the BI applications: the initial set of reports and analyses that will deliver business value to the organization. This track is split into two steps; the first is the design step where a small set of high value applications and reports are identified and specified in detail. The second step is the actual implementation where these applications and reports are built. This step often has to wait until near the end of the ETL development when data is actually available in the database. Note that these reports and analyses only serve as a starting point that helps solve a high-value problem. The dimensional model is not limited in any way to this subset of reports.
Once the three implementation tracks are complete, the Lifecycle comes back together to deploy the query tools, reports, and applications to the user community. This involves extensive communication, training, documentation, and support.
The next Lifecycle iteration usually begins during the deployment of the previous iteration, when the business analysts and designers can gather detailed requirements for the next highest priority business process, create the associated dimensional model, and start the process all over again. The Lifecycle’s incremental approach is a fundamental element that delivers business value in a short timeframe, while building a long-term, enterprise information resource.
The Enterprise Data Warehouse Bus Matrix is the data framework for the enterprise data warehouse. Figure 2 shows a simplified version of a bus matrix for a retail organization with a customer affinity program.
Figure 2: Example bus matrix
The row headers down the left side of the bus matrix define the organization’s primary business processes. A good way to think about these business processes is to think about the value chain of the organization. What are the activities in which your organization engages to provide your customers with the goods or services they need?
The column headers of the bus matrix represent the primary objects that participate in those business processes. Typical examples include customer, account, product, store, employee, patient, and date. These objects are called dimensions, and they must be pre-integrated to work with all the relevant business processes.
This pre-integration is called conforming, and it involves the hard organizational data governance work of deciding the standard names, descriptions, mappings, hierarchies, and business rules that will apply across the DW/BI system. This is essentially what master data management (MDM) is meant to do, and the DW/BI system benefits greatly from a separate MDM effort. In the absence of an MDM system, the DW/BI team must shoulder this dimension conforming effort. Once this definitional work is done, these dimensions become reusable components that can be applied to every associated business process. Most importantly, the conformed dimensions are the necessary framework for integration, where the results from two or more business process can be combined into a single BI deliverable.
Each row on the bus matrix is a business process data set that corresponds to a unit of work for the ETL system developers. Each business process data set needs a dedicated ETL module to extract the transaction facts, associate them with the conformed dimensions, and tie them together into a flexible dimensional model.
The Kimball approach to data modeling takes a pragmatic look at the underlying database platform and chooses the appropriate physical model based on usability, flexibility, performance, and maintenance on that platform.
Almost all dimensional models are classic star schemas, as shown in Figure 3. The numeric measurements (“facts”) of a business process are concentrated in the central fact table, and the context of the measurement is represented as a set of denormalized dimension tables, which surround the fact table. They keys that implement the joins between the dimension tables and the fact table should be anonymous integer keys. We call these surrogate keys.
Figure 3: An orders business process star schema
All camps are in agreement that the most user-accessible data model in the data warehouse is the dimensional model. For example, a 2006 study in the journal Decision Support Systems found that dimensional models were significantly easier to understand and remember how to use than other more normalized models.
There is a school of thought that calls for a normalized, third normal form model at the atomic level of the data warehouse. Its proponents argue that this gives the most flexibility. While this may be true from a transaction processing perspective, it is important to remember we are building an analytic database. Most transaction systems are based on third normal form data models with the atomic level detail transactions captured in normalized fact tables. The third normal form school keeps this model as the data foundation of the enterprise data warehouse. This then requires additional transformation steps to get the data into its presentation form for user consumption, often involving another physical layer of departmental data marts.
There is a commonly held belief that dimensional models are based on a set of reports or analyses and are therefore less flexible. This is false and has never been part of the Kimball approach. The normalized model and a properly designed, atomic-level dimensional model are relationally equivalent. They can answer the exact same set of analytic queries.
Flexibility comes in part from the level of detail captured in the model. Another common misconception is that dimensional models are summary only. In fact, a strong design goal in the dimensional model is to always capture data at the lowest level of detail available, called the atomic level. The presence of atomic-level data allows users to roll the data up to any level of summarization required. Any aggregation prior to inclusion in the enterprise data warehouse means some detail will not be available, thus reducing flexibility.
The dimensional model keeps the atomic-level fact tables in their normalized form (by normalizing the dimension tables out of the fact table) for smaller size and better performance, but keeps each dimension in denormalized (flat) form. Note that such flat dimension tables contain exactly the same information as fully normalized (snowflaked) dimension tables but do not implement the separate tables and extra keys required to complete the normalization process. The dimensional model simplifies the physical design by dramatically reducing the number of tables and joins required for a given analytic query, which improves performance on most market leading database products running on single servers. In fact, all the major SMP database products, including SQL Server 2008, have built-in performance optimizations that leverage the dimensional model (search the web for “star join optimization” for more information on this). Using a dimensional model at the physical level is also easier to manage than a normalized model. Because it is already dimensional there is no need for a translation layer or separate data marts to make it user-presentable.
As we’ll explore in the architecture section, parallel processing platforms such as the Parallel Data Warehouse work a bit differently. Data is distributed across many independent query nodes across the server. Each of these nodes may hold a subset of the fact data, which may need to join to all of the dimensions. On SQL Server PDW, the standard approach is to replicate all of the dimensions out to each node so the node can perform local joins, thus preserving the physical dimensional model. However, in rare cases it may make sense to normalize and/or distribute very large dimensions on SQL Server PDW to save time in the replication process and to save space on each compute node.
Let’s be clear on what we are saying. In the best of all possible worlds, you would load the atomic-level detail into the data warehouse for flexibility. The user data model would be dimensional for usability, and the physical data model would also be dimensional for simplicity and performance. Our experience in the real world backs this up. We have found the dimensional model to be the most usable, flexible, best performing, and most maintainable data structure for analytic purposes on most platforms. We only compromise this design when the platform requires it for performance and the users can be transparently shielded from any increase in complexity.
As we described in the bus matrix discussion, dimensions are the objects that participate in an organization’s business processes. We generally model these as one table per object. Building the dimension in the ETL system involves joining the various normalized description and hierarchy tables that populate the dimension attributes and writing the results into a single table.
Figure 4 shows an example of typical product-related attributes in a normalized model.
Figure 4: The normalized source tables for Product attributes
The base table is called Product and it connects to the Sales fact table at the individual product key level. From this set of tables, it’s possible to create analytic calculation such as SUM([Sales $ Amount]) by CategoryName, or by ProductColorGroupDescr, or any other attribute in any of the normalized tables that describe the product. It’s possible, but it is not easy.
In the dimensional version of the Product table, we would join the product-related tables from Figure 4 once, during the ETL process, to produce a single Product dimension table. Figure 5 shows the resulting Product dimension based on the tables and attributes in Figure 4.
Figure 5: The denormalized Product dimension
Obviously, it is still possible to calculate SUM([Sales $ Amount]) by CategoryName, or by ProductColorGroupDescr, or any other attribute in the Product dimension, which includes all the product related attributes from the normalized model. Note that the two models are equivalent from an analytic perspective.
Usability is significantly improved for BI application developers and ad-hoc users with the dimensional version. In this simple example, the ten tables that contain the 12 product attributes are combined into a single table. This 10 to 1 reduction in the number of tables the user (and optimizer) must deal with makes a big difference in usability and performance. When you apply this across the 15 or 20 dimensions you might typically find associated with a Sales business process, the benefits are enormous.
The main difference between the two approaches is that the normalized version is easier to build if the source system is already normalized; but the dimensional version is easier to use and will generally perform better for analytic queries.
Every analytic data store must provide a means to accurately track dimension attributes as they change over time. Tracking attribute changes allows the business to report on the state of the world as it was at any point in time, answering question like “What were sales by territory as of December 31st last year?” It also supports accurate causal analysis by associating the attribute values that were in effect when an event occurred with the event itself. For example, what postal code did a customer live in when they bought a certain product two years ago?
The most efficient way to capture these changes from both an ease of use and performance perspectives is to add a row to the dimension whenever an attribute changes by assigning a new surrogate key and capturing the effective date and end date for each row. These are commonly referred to as slowly changing Type 2 dimensions. You can see these control columns at the bottom of the Product dimension in Figure 5.
While tracking attribute changes over time places a burden on the ETL process, it improves performance for user queries because the joins between the facts and dimensions are simple equijoins on integer keys. This also improves ease of use because the BI semantic layer does not have to handle more complex, multi-column unequal joins in order to retrieve the correct dimension row for any given historical fact event.
Tracking changes over time is a mandatory business requirement, regardless of the underlying data model you use. It is possible to track changes in a normalized model, but the complexity of keeping multiple versions across dozens of tables associated with a single dimension is much greater than dealing with changes in a single, denormalized dimension table.
For a basic description of slowly changing dimensions, search www.kimballgroup.com for an article titled Many Alternate Realities (http://www.kimballgroup.com/html/articles_search/articles2000/0002IE.htmll).
For a discussion of more advanced change tracking techniques, see the article titled Slowly Changing Dimensions Are Not Always as Easy as 1, 2, 3 (http://www.kimballgroup.com/html/articles_search/articles%202005/0503IE.html).
For more information on tracking changes in a normalized model, please see Design Tip #90 Slowly Changing Entities at www.kimballgroup.com (http://www.kimballgroup.com/html/07dt/KU90SlowlyChangingEntities.pdf).
While the dimensional model provides the best performance and usability in a majority of standard scale DW/BI systems, it is not a one-size-fits-all answer. The underlying physical data structure needed to get the best performance on a given platform is the right choice, as long as it does not compromise the ease-of-use principle.
As we explore the Parallel Data Warehouse system, we will discuss the design and performance tradeoffs, and hopefully end up with the best of all possible worlds.