SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Data Warehouse Loading

By Vincent Rainardi,

Loading data into a data warehouse should not be that difficult, right? A data warehouse consists of database tables, so, put it simply, all we need to do is to put some records into these tables, right? Well, that is generally the idea, but there are several things that we need consider as well.

Some of the points we may want to consider are listed below, which we will discuss one by one.

  1. When And How To Load
    • When not used, restrict access
    • Flipping
    • After disconnect from source
    • Many ways and source types
    • Bulk upload
    • Incremental
  2. The Stage
    • Database or files
    • Timestamp column
    • Source system column
  3. Data Firewall
    • Data Firewall / Data Quality / Information Quality
    • Bad/good data
    • Reject/correct/allow
    • Allowing bad data
    • Reporting and correction
    • Static values and statistical comparison
    • No data goes into DW unless …
    • If the data is crap the DW is useless.
  4. ODS
    • Operation data store or source
    • Key management
    • ODS firewall
  5. Loading Dimension Tables
    • Unknown records
    • Date and time of day dimension
    • Fiscal dates
    • Natural keys
    • Surrogate key management
    • Referential integrity
    • Static dimension: are they really?
    • No body touches the DW data directly
    • Mr. DWA not DBA
    • Upsert operation: by example.
    • Deletion
    • DDS dimension
    • Overlay: more than 1 source
    • SCD 2: validity range, active flag, insert and update
    • SCD 3: shifting the columns
  6. Loading fact tables:
    • Key mgt: slim dim table – customer
    • Natural key changes: examples
    • Unknown records
    • Upsert
    • Deletion
    • Source PK: if source allows dup? Dedup
    • Index: take them off (10% rule)
    • Internal (really?) DW entity: targets
    • Transactional and snapshot fact table
    • Purging/pruning: parameterize
    • Reloading history
  7. Logging: audit
    • Number of rows, timestamp: for performance analysis
    • Log the unknown
    • DQ log/audit
  8. Control: failure and restart.
    • Invocation and cycle dependency
    • Stage ETL invokes DW ETL: sync – minimise time lost
    • Two independent cycles – flexibility
    • Restart modes: from point of failure or beginning, dates
    • Simulation: power failure, source connectivity
    • Scenario: ERP down, backup overrun
  9. Restoring DW
    • Real life: Disk failure – RAID
    • Save stage
    • Backup to disk for 3 days
    • Tape > 3 days: test restore!
    • Offsite
  10. Multidimensional Database
    • Reprocess dimensions
    • Reprocess cubes - incremental
  11. 11. EAI and EII

Looking at the items above, it is going to be a very long article, so I split it into several parts. Please bear with me as we go through it. If you have any comments or questions or want to discuss anything with me, just click the link above and put them in the forum, I will be happy to reply or discuss them.

1. When And How To Load

We want to load the data warehouse ideally when it is not used by the users, i.e. when the users are not running queries or reports against the data warehouse. This is necessary to minimize the performance degradation experienced by the users, i.e. the user queries may significantly slower than normal rate because we are loading the warehouse. Sometimes this is also necessary to prevent anomalies or errors experienced by the users, i.e. users may experience data inconsistencies during the loading time. Because of this sometimes it is necessary to prevent users from accessing the data warehouse when we are populating it.

There is a technique that enables a data warehouse to be used when we are populating it. Basically we have 2 copies of the data warehouse tables. When we are populating copy A, all user queries are directed to copy B. Then when we finish populating copy A, all user queries are redirected to copy A and we populate copy B. This technique is quite a popular in custom data warehouse development and it is used in some off-the-shelf data warehouse applications such as E.piphany CRM (now acquired by SSA). There is one caveat when using this technique, apart from the obvious requirement of extra disk space: because of arbitrary table names we need to use dynamic SQL so it has an impact on the user query performance a little bit, as well as the loading performance.

To minimise the load on the source systems, we want to load our data warehouse after we have disconnected from the source. So we get the data out from the source systems as quickly as we possibly can into stage, then disconnect from the source systems, then start loading into the warehouse. If we do the loading at the same time as doing the extracting, the overall elapsed time could be shorter but the load on the source systems could be higher.

There are many ways to load a data warehouse. We can load from message queue, Changed Data Capture (CDC) files, web services or Enterprise Application Integration (EAI) server events. If we use ETL, to get good loading performance, ideally we want to populate the data warehouse using bulk upload. In SQL Server 2005 we want to use raw file format, which will give us the best performance. Also, ideally we want populate the data warehouse incrementally, i.e. load only the data which was changed since the last extraction.

2. The Stage

If your stage is a database rather than a file system, you would probably want to put add a few columns on each of your stage table. Columns such as time stamp (the time when the record was loaded into stage) and source system ID (which source system this data was coming from) could be quite useful when loading the stage data into the data warehouse.

General considerations about the benefits and the disadvantages of having a stage are discussed in this article.

3. Data Firewall

Also known as data quality or information quality. Some people would argue about the differences between these 3 terms but that discussion is for another article. For now all we need to know is that when loading data into data warehouse, we need to have some kind of data filter: bad data goes into quarantine area, good data goes into the warehouse. That is why this filter is called a data firewall, because of its similarity to IP filtering.

There are 3 possible things we can do to bad data: reject it, correct it, or allow it through without correction. Rejecting it means we don’t put it into the warehouse. Correction is modifying the data based on certain predefined rule, and we put it into the warehouse. The third one is to allow the bad data into the warehouse. In all three cases we need to log the bad data.

About allowing bad data into the warehouse, it is where the bad data is within the tolerance limit. It is a warning rather than an error. For example, the normal value is between 5000 and 6000. If the value is between 4000 and 5000 or between 6000 and 7000, it is still acceptable so we put it in the warehouse. If the value is less than 4000 or more than 7000 we reject it. If the value is 5500 (good data) we allow it through and we don’t log it but if the value is 6500 (bad data but within tolerance) we allow it through and we log it.

An example of correction that we do automatically to the data before allowing it to the warehouse is straightening abbreviations (such as changing “Herts” with “Hertfordshire” and “Adv.” or “Adv” with “Advanced”) and upper/lower case conversion (such as replacing “JOHN CUNNINGHAM” or “john Cunningham” or John CUnningham” with “John Cunningham”). Another example is changing “MR” and “mr” to “Mr”. You may have heard people using the term scrubbing or data scrubbing. Well this is it, above are real examples of data scrubbing. It may be boring and not sexy, but keeping the data tidy and neat is so core to data quality.

The quarantine area contains rejected data, which will then be processed (say daily) and reported to different data owner individuals who are responsible to make sure that these errors are then corrected in the source system. Next time the data is extracted from the source system, the ETL will pull the correct data into the warehouse.

The easiest type of data firewall rules is a comparison to a static number or a range, such as the example above. But in the data firewall we can also do a statistical comparison, i.e. if the value is within the last 90 days average plus minus 10%, then allow it through, otherwise reject the data. We can also compare the incoming data with the data that is already in the warehouse. We could also compare one part of the incoming data with another part the incoming data, for example in the case of header and detail table. But we have to be careful here: that other part that we are referencing to may not be correct or complete, especially if they are in different extraction cycles.

By far the most common type of firewall rules is database integrity rule, such as null/not null, foreign key / orphanage checking, and character length. Why would we want to check database rules if it is done at database level anyway? Because if it is done on the DQ system, it will be reported and corrected. And we can prevent the bad data from entering the warehouse. Or correct the data and allow it through. DQ system with its data firewall rules is much more flexible than the database referential integrity instruments. And it is all logged so it is auditable later on at any time.

It really pays to spend the time to build good solid firewall rules, because at the end of the day, if the quality of data in the warehouse is crap, then the data warehouse is useless. It is very important to restrict that the only way to load the data into the warehouse is through the data firewall. No data should go into the warehouse without going through the data firewall. I can’t stress this important enough. If there is only 1 thing that we are going to learn today from this article, I hope it is the data quality: never allow the data into the warehouse unless it goes through the firewall.

4. Operational Data Store (ODS)

Some people mixed operational data store (ODS) with operational data source. The former is part of data warehouse and the latter is the source system(s). ODS is the normalised database between the stage and the dimensional store. General advantages and disadvantages of having an ODS has been discussed on this article.

With regards to data warehouse loading, one advantage of having an ODS is making the key management in the dimensional store easier. The second advantage is it enables us to have an ODS firewall, i.e. a set of data quality rules which checks the incoming data from stage against the data in the ODS. It is easier and faster to do this checks against ODS than against the dimensional store, because the ODS is in (at least) 3rd normal form whilst the dimensional store is in denormalised form.

5. Loading Dimension Tables

In each of the dimension tables there should be a record with a key value of 0 (or –1 if you want) and the attributes are set to either 0 (for numeric columns), “UNKNOWN” (for character columns such as Description), “” or blank string (for other character columns) and low value date such as 1/1/1900 (for date columns). These records are called the unknown records. These records are not loaded on the normal daily load, but are created as part of data warehouse installation / setup.

There are also other data that are preloaded as part of data warehouse installation, such as the date dimension and the time of day dimension. Some people think it is good to populate the date dimension (one row for a day) for 100 or 200 years, so we don’t have to extend it ever. This is generally not a good practice as the date dimension will be referenced by (almost) every single fact table in the warehouse and it would slow down query performance. Some people do the opposite approach: populate the date dimension only for 1 or 2 years (to maximize query performance) and extend it every (other) year. This is also not a good approach because of the administration overhead and because of the risk: date dimension is used every where; if you mess around with it every year, chances are we could break the whole warehouse if we made a mistake. A better approach is to agree with the business users how far back we want to load the history. Normally this is 2-3 years and in extreme cases 10 years. Then we populate the date dimension since that year until the next 5 years. Every 5 years we extend it for another 5 years.

Fiscal dates are generally known for the next 5 years because they are constant. For example a fiscal period could be defined as 13 equal portions of a fiscal year consisting of 4 fiscal weeks each, and the fiscal year starts on 1st August until 31st July. Or the fiscal period may be divided into fiscal weeks using a pattern such as 445445445445, which means the first and second fiscal periods consist of 4 fiscal weeks, the third period consist of 5 fiscal weeks, the fourth and fifth periods consist of 4 fiscal weeks, the sixth period consist of 5 fiscal weeks, and so on.

If we know the fiscal calendar for the next 5 years, we load the fiscal calendar into the date dimension when preloading the warehouse at installation time. If the fiscal calendar changes every year and we only know it one year in advance, we load the fiscal calendar into the date dimension every year but we should still populate the date dimension for the next 5 years at installation time, leaving the fiscal date columns as their default or assumed value.

Each source table should ideally have a natural key, i.e. the column that we use to identify duplicates when we update the data warehouse dimension table. But, in reality, they don’t always have natural keys. In these cases when we upsert (update and insert) into dimension table we compare all columns.

Surrogate key management has always been the heart of loading dimension tables. The simplest way (and probably the best way) is to use identity column (or sequence column for those who uses Oracle) and let the RDBMS manage the surrogate key. I’ve seen cases when the key is not surrogate/meaningless and in those cases they are managed manually (using a subroutine), which not a good practice.

In most cases, a normal 4-byte integer (2 billion, ish) is enough for surrogate keys but there are cases that we need to use big int (if you use SQL Server), especially when it is a large dimension and when SCD 2 is in place. In any case the surrogate key should not be varchar (or varchar2 in Oracle), and it should not be decimal (or number in Oracle or Decimal in Teradata). But time and time again I found that intelligent varchar is used as surrogate keys. Not only this impacts both ETL and end user performance, but also it makes things more complicated.

Referential Integrity (RI) has always been a long standing debate on key management. One group is saying that we do need to impose RI because of data integrity. The other group is saying that we should not impose RI because it shows down performance a lot and RI is managed within the ETL anyway. RI is normally implemented using foreign keys but sometimes also in the form of cascade delete / cascade update, (after) trigger constraints and in SQL Server also check constraints. My view is that we need to implement RI, because in many cases the performance degradation is not that much and because data integrity is paramount in data warehousing.

Static dimensions are dimensions that we never update. They are setup at DW install and stay constant through out the year. Examples of these dimensions are some of the decode tables, for example: order type codes (O for order, Q for quote, etc), delivery codes (S for scheduled, OD for on demand, etc). These dimensions normally only contains very few rows, e.g. less than 10 rows. Larger decode tables which contains 25 rows or more are not really that static, they tend to change a few times a year. In these cases I’d advice to setup an ad hoc ETL to bring the updates into warehouse when required / on ad hoc basis. Remember to put a DQ on the ETL import. Remember also not to put these ad hoc ETL in the main batch or jobs but separate them and do not schedule them. They should be ad hoc (running on demand), not scheduled!

It is a matter of principle that no body touches the data warehouse directly. No body updates the data inside the warehouse directly! This includes Mr. DWA (Data Warehouse Administrator). Hmmm that’s a new role that developed in the recent years – attention Mr. DBAs! All updates must be done through ETL, which is data firewalled (and therefore quality assured) and logged (and therefore auditable), and under control system (and therefore recoverable on failure). Even those ‘nearly static’ dimensions absolutely need to be put on the ETL build. Yes without a doubt. Even if the changes only happens once a year on these dimension. It’s worth repeating it again: no changes go into the warehouse without going through data firewall. If the data is crap the data warehouse is useless.

In part 2 we will examine the basic steps to do update and insert operation (upsert) into a dimensional table.

Vincent Rainardi
17th October 2006

Total article views: 13883 | Views in the last 30 days: 14
Related Articles

Date Dimension Script with Fiscal Year

One dimension you can be positive will always make it’s way into your data warehouse is the Date dim...


Time Dimension

Time Dimension based on 4-5-4 fiscal calendar structure


Dynamically Create Warehouse Dimensions

Here's a data warehouse design pattern to speed up multi-value dimension creation.


Track source dates when loading a data warehouse

A primer on how to reduce network and source system load when reading a relational source into the d...


Junk dimensions

Junk dimensions are dimensions that contain miscellaneous data such as flags and indicators.  When d...