Blog Post

Consolidating Multiple Data Marts

,

One of the challenges that I've worked on in the past months involved adapting our technical architecture to address issues exposed by business re-engineering.  The BI group at CHC used to assign specific people to specific business units in order to deliver solutions to business needs.  The upside of that arrangement was that each business unit developed a strong relationship with a member of the BI team, and always felt forward motion on their issues.  The downside was that the BI team was fractured internally, technical implementations varied wildly, larger projects weren't staffed, and high-priority projects didn't attract the team's attention away from lower priority projects.
Our internal BI delivery model has changed to become more of a single team knocking down projects together, and as such, the disconnected architecture of the existing solutions becomes glaringly apparent.  The most obvious way it's visible is that it's very difficult to re-use work already done, and integrate "better" parts of one solution with the others.  One aspect of that is how the relational data is stored and (not) shared.

The Debt

Typically in a multi-purpose data warehouse (Kimball style), you'll have a bus matrix that identifies the subject areas (facts) and the business entities that apply to each of them (dimensions).  Also quite typically, the business entity tables (dimensions) are shared between the subject areas - conformed.  This can be a tricky part of the project, getting two parts of your organization to agree (precisely!) on a common (or compatible) definition of a shared entity - but assuming you can get that done, the next step is to technically build and maintain only one table.
Unfortunately, given the fractured nature of the past team, this wasn't the case.  One of the ETL developers here counted eighteen (18) different date dimension tables.  Eighteen.  Dates.  Yeah.  Even with a pretty dysfunctional organization, you'd think you could come to agreement that a day is a day - even if you might not agree on which attributes are in common between subject areas or what they look like.  Even if you couldn't do that, you'd think there wouldn't be 18 ways to represent dates.  And you'd be right.
This problem needed addressing - and it is a problem, because maintaining 18 different date tables is, quite frankly, ridiculous.  But trying to collapse eighteen tables into one isn't easy.  It's not quite as simple as making one common table and modifying all places that touch that table to refer to the common table.  We could make sweeping changes all at once, or work one BI application at a time.  If we're making changes to 18 BI applications at once, we could break them all.
If we attempt to implement incrementally we could just be adding a 19th date table, and not really get rid of any existing ones.
Of course, in a real-world scenario, things aren't even as simple as I've described.  Sometimes a dimension is constructed by application A, but application B also uses it.  So if we go and change application A to point at a new common dimension, we'll end up breaking B - perhaps silently.  It's this silent dependency that really paralyzes attempts to clean things up.

The Goals

So all I've exposed so far is how unhappy and hard to please that I am.  Now it's time to figure out what I'd be happy with!  Here's the goals for cleaning up:

  • Remove duplication of work by having one (1) dimension table per entity - where the business thinks it's feasible
  • Allow two or more dimension tables about the same entity in cases where the business can't/won't come to agreement
  • Make dependencies clear
  • Make breaking dependencies obvious as soon as possible
  • Permit changes to a central dimension table without breaking dependent entities
  • Design in the ability to start and stop refactoring at any time without having made the situation worse

Not a short, easy list of goals, is it?

Views to the Rescue

Views are a common weapon in the fight to add "abstraction layers" in systems.  These layers don't stop changes from breaking things, but they can help make fixing them faster.  They allow this by essentially making a "translation layer" that is easier and faster to reconfigure than the larger parts that they connect.  Of course, one of the main traps to avoid in using views in this way is making them contain too much translation, turning them into part of the ETL, and a shadow code base.
What we're going to do is:
  • Take the ETL out of application A and/or B, and make a new ETL for the shared, conformed dimension.  We'll call this "C" for now.
  • Replace A's dimension table with a view to C's table
  • Replace B's dimension table with a view to C's table

Now A and B are really referring to C.  This achieves the "remove duplication of work" goal.
To achieve the goal of making dependencies clear and remove the possibility of altering them unintentionally, we're going to use a special clause when we make the view: WITH SCHEMABINDING.  Read up in MSDN (look for SCHEMABINDING) - I'll wait.  To translate that into our specific situation, this means that C can't change it's table definition if it would break the view in A or B.  Dependency clarity.

A Specific Example (with Identical Tables)

Application A and B have "date" tables.  They're both defined exactly the same, and have a set of identical columns.  Critically, the KEY of both tables is defined the same.  A's table has some columns that B doesn't have, and vice versa.  In code (briefly - you'd have PKs, NOT NULL set appropriately... right?):
CREATE TABLE DimDate_A (SK_Date INT, RealDate DATE, CalendarYear INT)
CREATE TABLE DimDate_B (SK_Date INT, RealDate DATE, FiscalYear INT)
Step 1 is to create a new table for the conformed date dimension:
CREATE TABLE DimDate_C (SK_Date INT, RealDate DATE, CalendarYear INT, FiscalYear INT)
Step 2 is to combine the ETL for A and B into a process for C - use your tool of choice (such as SSIS)
Step 3 is to remove the ETL from A and B, drop A and B's tables, and create views instead:
DROP TABLE DimDate_A
DROP TABLE DimDate_B
CREATE VIEW Dim_Date_A WITH SCHEMABINDING AS SELECT SK_Date, RealDate, CalendarYear FROM DimDate_C
CREATE VIEW Dim_Date_B WITH SCHEMABINDING AS SELECT SK_Date, RealDate, FiscalYear FROM DimDate_C

More Complex Examples

This technique does work for more complex examples - I showed that at a presentation at last month's SQL Saturday.  If you missed SQL Saturday, Scott Stauffer is setting up a PASS chapter meeting in early August where I'll present it again, and have time for more interaction.

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating