Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Upsert Dimension Table

By Vincent Rainardi,

Upsert Dimension Table
(Data Warehouse Loading – Part 2)

In part 1 we went through when and how to load, stage, ODS and some considerations for loading dimension tables. Now we’re getting down to earth here, i.e. let’s get into the real work. Let’s see the basic steps to do update and insert (upsert) into a dimension table. And then we finish the rest of Dimension Table Loading.

The dimension table we want to load has 5 columns: surrogate_key, natural_key, attribute1, attribute2 and load_time. Surrogate key column is the dimension’s primary key and it is an identity (1,1) column. Natural key is the primary key in the source system. Attribute1 and attribute2 are the dimension’s attributes. Load_time is the time stamp when the row was last changed; this will be useful for loading the changes into further dimensional store or into data marts later on.

Source table

natural_key attribute1 attribute2 created last_updated
10 A AAA 21/11/2005 02/10/2006
20 B BB 02/10/2006 02/10/2006

Dimension table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 21/11/2005

Note:
1. In this example the load_time, created and last_updated are all dates. In practice ideally they are date time, e.g. 02/10/2006 15:17:02.2341
2. attribute2 on the source table for natural_key 10 is ‘AAA’. Yes it is different from the dimension record where the attribute is ‘AA’.

Step 1. Create The Temp Table

SELECT * INTO #dim FROM dw.dbo.dimension WHERE 1 = 0

The temp table after it is created:

surrogate_key natural_key attribute1 attribute2 load_time
         

Note:
1. By deliberately specifying a false condition, we create an empty temp table, taking the structure of the target dimension table.
2. If the table is big it may be a better for performance to use a permanent table rather than a temporary table.
3. If you use SQL Server you may want / need to increase the size of the temp database depending on the size of your load.
4. SELECT INTO is better than CREATE TABLE because it is more flexible to adapt to structural changes.

Step 2. Populate The Temp Table

INSERT INTO #dim (surrogate_key, attribute1, attribute2, load_time)
SELECT ISNULL(dim.surrogate_key, 0), src. attribute1, src. attribute2, @current_load_time
FROM stg.dbo.source src
LEFT JOIN dw.dbo.dimension dim ON src.natural_key = dim.natural_key

The temp table after it is populated:

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 17/10/2006
0 20 B BB 17/10/2006

Source table

natural_key attribute1 attribute2 created last_updated
10 A AAA 21/11/2005 02/10/2006
20 B BB 02/10/2006 02/10/2006

Dimension table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 21/11/2005

Note:
1. Here we populate the temp table from the source table on the stage.
2. Notice that we take the surrogate key from the dimension table, by joining with the source table on the natural key.
3. Because it is a left join, dim.surrogate_key will be NULL if the row doesn’t exist on the dimension table.
4. The NULL surrogate keys are the replaced by 0, using ISNULL function.
5. Notice that we don’t populate the load_time with getdate(), but with a variable named @current_load_time. This variable is populated with getdate() at the beginning of the loading batch and used by all processes in the batch. This is necessary so that in the event of failure, we know the point in time we have to restart the process from.
6. If you use a snowflake schema rather than a star schema, on this step you also need to look up the foreign keys on the child dimension table.

Step 3. Update The Dimension Table

UPDATE dim
SET dim.attribute1 = tmp.attribute1,
dim.attribute1 = tmp.attribute2,
dim.load_time = tmp.load_time
FROM dw.dbo.dimension dim
INNER JOIN #dim tmp
ON tmp.surrogate_key = dim.surrogate_key
WHERE tmp.surrogate_key <> 0
AND (dim.attribute1 <> tmp.attribute1 OR dim.attribute2 <> tmp.attribute2)

Temp table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 17/10/2006
0 20 B BB 17/10/2006

Dimension table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 17/10/2006

Note:
1. Here we update the dimension table, based on the data on the temp table. In this case attribute2 was updated from ‘AA’ to ‘AAA’.
2. We only update the rows where the tmp table’s surrogate key is not 0, i.e. the rows already exist on the dimension table. For the rows where the surrogate key is 0 (not exist on the dimension table), we will insert them into the dimension table later on.
3. Notice that when updating rows we update the load time column as well.
4. The last line is used to specify which changes we want to pickup.

Step 4. Insert Into Dimension Table

INSERT INTO dw.dbo.dimension (natural_key, attribute1, attribute 2, load_time)
SELECT natural_key, attribute1, attribute2, load_time
FROM #dim
WHERE surrogate_key = 0

Temp table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 17/10/2006
0 20 B BB 17/10/2006

Dimension table

surrogate_key natural_key attribute1 attribute2 load_time
1 10 A AAA 17/10/2006
2 20 B BB 17/10/2006

Note:
1. Notice that we don’t specify SK column when inserting, i.e. let the RDBMS handle it.
2. When creating / setting up the data warehouse, we set the surrogate key column on the dimension table to identity(1,1).

In every step above we need to do error handling and logging. Error handling is important because if the loading fails on any steps, we need to be able to recover from the failure. Logging is important to know what exactly happened on each steps, i.e. how many records are processed, etc.

At the end of the program, we should not forget to clean everything up, i.e. drop the temp table(s), follow control protocol e.g. set the process to complete state, etc.

The above code shows how to do upsert with native SQL codes, which is very fast. But it is worth to note here, that good dedicated ETL tools such as Informatica and Data Integrator have the facilities to do in-memory lookups which has very good performance too. Disk-base lookup is definitely not they way to go here, as they are slow. Mainly because there are a lot of SQL statements to execute, i.e. one for each lookup, and each of these statements ‘touches’ the disk, which is a costly operation.

Deletion

If in the source table a row is deleted, before we delete the row in the dimension table, we must check that the surrogate key in that deleted row is not used in any of the fact tables. And if we use snowflake schema, we should also check any dimension tables that refer to this dimension table. If the surrogate key is used anywhere else, then we can not delete the row.

The above causes an ambiguity: if a dimension row exists in the data warehouse, how can we tell if it exists in the source system? Because of that we don’t usually delete rows from dimension tables. Instead, we mark that the row is ‘inactive’. Some people call this “soft delete”. In each dimension table, we have several standard columns at the end: 3 timestamp columns (last updated time, created time, and deleted time), 1 ‘IsActive’ flag column. It is this IsActive column that we update from 1 to 0 when the row is deleted in the source system.

DDS Dimension

Dimensional Data Store (DDS) is a star schema (or snowflake) relational database which is populated from the ODS. DDS is the used to populate the multidimensional databases such as Cognos PowerPlay, Hyperion Essbase, Business Object Universe or SQL Server Analysis Services. DDS is also used by report (such as Reporting Services) and dashboard programs. Excel BI Add On and Proclarity Desktop also connect to DDS to get dimensional data.

When we populate a dimensional table in the DDS based on the data in the ODS, we have 2 options, e.g. there are 2 ways of doing it: a) empty and reload, or b) incremental. Empty and reload is literally that: empty the DDS dimension table and the reload all records from ODS. Incremental means that we load the changed data from ODS. The change data is identified using the 3 timestamp columns mentioned earlier.

When emptying a DDS table we should not use “delete from table”, i.e. delete all records. Instead, we need to use “truncate table” command. This applies to both SQL Server 2005 and Oracle 10g. Truncate is instantaneous, because it is not logged. Delete takes a lot longer because it is logged. In Teradata we do not have truncate. Your best bet is to use DELETE ... ALL. Multiload delete step can also be used, but it could be a bit slower. When you use incremental approach, try to get the 3 timestamp columns indexed if possible, as this will significantly reduce the loading time, especially if it is a large dimension.

Generally speaking, truncate and reload is faster than incremental. If the number of rows is below 10,000 and the dimension width (total bytes in a row) is less than 1000, truncate and reload is faster. If your dimension has more than 10,000 rows and the width is more than 1000, then there is a significant chance that incremental approach can beat truncate and reload performance.

Overlay

Consider a case in the airline industry, where you have a dimension called crew, which contains the pilots, co-pilots, navigators and cabin crew. There are 2 source systems that can be used to populate this dimension: flight information system (FIS) and human resources information system (HRIS). The FIS has an air¬_crew table, which is keyed on crew_id. The HRIS has an employee table, which is keyed on employee_id. Fortunately, the air_crew table also contains employee_id. The employee table is richer than the air_crew table, i.e. it contains 36 attributes including title, name, address, employment band, benefits employment start date, normal working hours, salary, normal position, secondary position, rank, employment status, date of birth, bank details, date created and date last modified. The air_crew table on the other hand, only contains title, name, positions plus 2 timestamp columns. The employee table only contains permanent employee (700 records), it does not contain non-employees or temps, where as the air_crew table contains all crews on every flights (1000 records), both temp and permanent.

The crew dimension in the data warehouse contains title, name, address, working hours positions, date of birth, employment status and rank, plus the standard dimensional control columns. In this case we need to load from 2 sources. This is called overlay. How do you load with overlay? Which table is the base table and which one is the top up? Employee table is richer by columns but not complete by rows, air_crew is not rich but complete by rows.

As a principle we need to take the one that is complete rows as the base table (primary) e.g. the blue box above. And the richer table by column as the top up table (secondary), e.g. the yellow box above. Here is how it works: load the base first, by using upsert, setting the missing attribute as their standard values or blank. After we have a complete number of rows in the dimension table, we then load the secondary table, by using update. We don’t do insert or delete when loading the secondary, unless we are using SCD2 – see below. Don’t forget to put ISNULL (if you use SQL Server) to change null into 0 or low value date or blank string.

SCD 2

Slowly Changing Dimension (SCD) type 2 is basically a technique to keep the history on a dimension. This is done by setting the existing dimension record as ‘outdated’ or ‘not valid anymore’ and then we create a new record containing the new attributes and stamp it as ‘valid’, as illustrated below:

Before the change

surrogate_key natural_key attribute effective_date expired_date most_recent
1 10 A 1/1/1900 00:00:00 31/12/2200 00:00:00 1

After the change

surrogate_key natural_key attribute effective_date expired_date most_recent
1 10 A 1/1/1900 00:00:00 6/10/2006 10:21:19 0
2 10 B 6/10/2006 10:21:20 31/12/2200 00:00:00 1

The concept of SCD was first described by Ralph Kimball and Margie Ross in their book The Data Warehouse Toolkit, chapter 4. They later on wrote a few articles on SCD: SCD Are Not Always As Easy As 123 (Margy Ross and Ralph Kimball, March 2005), Delivering Both Historical and Current Perspectives (Margy Ross, May 2004), Combining SCD Techniques (Margy Ross, October 2000) and Processing Slowly Changing Dimensions During Initial Data Load (Lawrence Corr, June 2000). If SCD is new for you, it would help if you familiar yourself with the concept by reading these materials.

Now let’s talk about loading data into an SCD type 2 dimension table. Each SCD 2 table should have effective data and expiry date columns. Otherwise if we are asked what is the condition of the dimension at a certain date in the past, we couldn’t answer it. Using these 2 date column we know the condition (attribute values) at any time.

When a dimension record is created, the Effective Date column is set to a low value date (such as 1/1/1900) and the Expired Date is set to a high value date, such as 31/12/2100. This is to indicate that the record is valid at any time, i.e. there is no time limit. When we have an update, we set the Effective Date column to current time. Then we insert a new row, setting the Effective Date column to current time and the valid to a high value date. This way, we explicitly declare the validity period of both the old row and the new row. Please see the table above (the one labelled as "After the change") for illustration.

The Effective Date and the Expired Date is very useful but unfortunately it is slow. The most frequently use of an SCD 2 dimension is to support fact table roll up with current attribute value. To do it using Effective Date and Expired Date, our query would be something like where Effective_Date < @todays_date and Expired_Date > @todays_date. So came the idea of most_recent column. This is because most of the query is about the active row, not the history. most_recent flag is binary or bit column which indicates which rows is the active one. Now our query becomes where most_recent = 1, which is a lot quicker to execute, especially if the most_recent column is indexed. Well, the most_recent column should be indexed any way.

In practice SCD 2 is used in conjunction with SCD 1. Please read this article by Ralph Kimball and Margy Ross titled "SCD are not always as easy as 123”.

SCD 3

In Slowly Changing Dimension type 3 we keep the old attribute as a column.

Before the change

surrogate_key natural_key attribute1 effective_date1 attribute2 effective_date2
1 10 A 26/12/2005 00:00:00    

After the change

surrogate_key natural_key attribute1 effective_date1 attribute2 effective_date2
1 10 B 6/10/2006 00:00:00 A 26/12/2005 00:00:00

The effective date columns help us determine the value of the attribute in a specific point in time. Yes we need to put effective date columns for each attribute. Yes we can have as many columns as we like for each attribute, for example if we want to keep 5 previous values, then we need 6 pairs of columns. In practicality this is a painful and slow process. Both for updating and for querying. Assuming we keep 5 previous values, for every updated row, we need to move attribute4 to attribute5 (override), attribute 3 to 4, 2 to 3, 1 to 2 and the new attribute value to 1. And the effective date columns too. See why it’s slow?

The query is fast only if we want to know the current values, because current values are kept in attribute1 column. But if we want to know the value of the attribute at a particular point of time in the past, it is complicated and slow. In practical implementation this structure is not effective to answer questions like "give me the value of the attributes on a certain date", or rolling up historical fact to current attribute values, so in practice SCD 3 is used in conjunction with SCD 2, as illustrated by Margy Ross in her article Combining SCD Techniques (October 2000), particularly her last paragraph as quoted here:
“We've described a hybrid approach that combines the three fundamental SCD techniques. We're creating new rows to capture change (Type 2), adding attributes to reflect an alternative view of the world (Type 3), which are overwritten for all earlier dimension rows for a given product (Type 1). As a student recently suggested, perhaps we should refer to this as Type 6 (2+3+1)…”

In the next part we will continue with loading the fact tables.

Vincent Rainardi
20th October 2006

Total article views: 11496 | Views in the last 30 days: 20
 
Related Articles
SCRIPT

Enforcing surrogate key uniqueness in slowly changing dimension

Enforcing surrogate key uniqueness regards natural key in slowly changing dimension

FORUM

Order by in Dimension attribute of SSAS

Order by in Dimension attribute of SSAS

FORUM

maximum length of Dimension Attribute

maximum length of Dimension Attribute

FORUM

Writing a sql for Dimension attribute

Writing a sql for Dimension attribute

BLOG

Degenerate Dimensions

Degenerate dimensions, also called fact dimensions, are standard dimensions that are constructed fro...

Tags
 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones