Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS

By Harsh Bhaiya,

Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS

Surrogate-keys are imperative to data warehousing. The performance advantages of implementing surrogate-keys along with their usefulness in deploying slowly changing dimensions are well documented in many industry standard books. Yet there are complexities in implementing them such as maintaining referential integrity without hurting the speed of the surrogate-key replacement pipeline and correctly sequencing the surrogate-keys for the data coming in at irregular intervals. These challenges can impede the data warehouse availability timelines.

SQL Server Integration Services provides a framework for efficiently developing such essential data-warehousing ETL mechanisms. This article provides an innovative approach for generating surrogate-keys for Type 1 and Type 2 slowly changing dimensions and then implementing a highly scalable, high-throughput surrogate-key replacement pipeline for an EDW using SSIS that scales on enterprise-class hardware for a highly parallelized, large scale data warehouse load.

Prerequisites

         Introductory knowledge of fact tables, dimensions and star schema is useful.

What is a Type 1 or Type 2 dimension?

As found in some popular articles, Type 1 dimension is the most effective way of presenting the AS-IS view of the business in a data-warehouse but it does not maintain any history of the dimensional changes. Type 2 dimension, on the other hand, helps in tracking history. It not only provides the AS-IS view, but also the AS-WAS view of the business for any point in time. Reference and additional information on slowly changing dimensions is provided on the link: http://www.dbmsmag.com/9604d05.html

What are Surrogate-Keys?

A standard data warehouse implementation consists of generating surrogate-keys for all members in the dimensions. Natural keys or codes coming from a source OLTP system can be quite large in size and/ or inconsistent in nature. Data warehouse architects like to replace these source keys with a nice four byte integer in order to ensure consistency, as well as reduce the size of the tables. These keys also help in the creation of Type 1 and Type 2 slowly changing dimensions.

Process Overview

The process involves generating unique surrogate-keys for every dimension member in the data warehouse, managing referential integrity, and then constructing a surrogate-key pipe-line for replacing all the incoming fact data with the newly generated surrogate keys. Reference and additional information on surrogate-keys is provided on the following link: http://www.dbmsmag.com/9806d05.html

The techniques in this article tie together some of the industry standard concepts and ideas available in popular books, blogs and articles with the technological features of SSIS 2005. The example used here has been simplified for easy understanding, and presented in the most granular way so that it can be leveraged for any complex and large-scale objectives.

The slowly changing dimension task (wizard) that comes with SSIS does not generate surrogate keys. Also, using an identity column is not a scalable approach for generating surrogate-keys because it requires constant reading and writing to the database. In order to make a scalable solution, we need to read from the database just once to get the highest value of existing surrogate keys, and then programmatically generate new keys without having to read from the database again. The following link serves as a reference for this concept: http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx

The tables below show example columns for Type 1 and Type 2 dimension tables. These tables contain one natural key column (CODE), one surrogate-key column (SK) and one attribute column (DESC). The Type 2 dimension table additionally contains a creation timestamp (TS) and a current record identifier (CURR).

Figure : Sample Type 1 dimension

Figure : Sample Type 2 dimension

Note that dimension member with code C has multiple records in the Type 2 dimension table. Every time a member incurs a change, a new record is created to track the history of changes. Type 1 dimension, in contrast, only keeps the last incurred change. Also note that dimension member with code Z in Figure 2 was at one point unknown (see SK = 6). It means that the code was first encountered in the fact data before it showing up in the definition lists feeding the dimension tables. Code Z was therefore considered an Early Fact and its attributes were assigned an unknown status. The correct attribute definitions were imparted to it in the next ETL runs as they became available (see SK = 8).

In order to implement the surrogate-key mechanism, three Data Flow tasks are created on the Control Flow window enumerated for updating attribute changes, inserting new dimension members and inserting early facts. For a rapidly changing dimension, the updates should precede the inserts. But for relatively slower changing dimension, running inserts prior to updates will prove quicker.

Figure : Data Flow Task 1: Update Attrib Changes

The attribute updating process starts with querying the source (staging data) for the dimensions, and filtering out the existing dimension members through a Lookup task based on the natural keys, and then writes the list of new (or non-existing) dimension members to a Raw File. These Raw Files must be placed on high performance storage subsystems since they would require fast I/O.

From the existing dimension members, a subsequent Lookup task picks only the dimension records where the attribute changes have occurred and then feeds them to an OLE DB command task that updates dimension table with changes (see Figure 3).

The lookups for attribute changes can be further optimized by using Checksum values instead of column to column comparisons. OLE DB Command Transform uses a Native OLE DB\SQL Native Client connection for updating the attributes. The following SQL command updates the dimension data by taking changed attributes as parameters.

UPDATE [dbo].[DIM_dimensionname_Type1]

SET [dimensionname_DESC] = ? WHERE [dimensionname_CODE] = ?

Clustered index on the dimension tables might be helpful in such update tasks, if used tactically and dropped during the insert tasks.

Figure 4: Data Flow Task 2: Insert New Dims

The previous data flow task (1) had detected the new dimension members and written them to a Raw File, eliminating the need for repeating the lookup.

These new dimension members stored in the Raw File (see Figure 4) are now passed through a Script task that generates surrogate-keys and then inserts them into the dimension table via an OLE DB destination task. The inserting process is sped up by avoiding check constraints.

A sample script for generating the surrogate-keys for new dimensions are displayed below. Please note that there is an internal connection formed for retrieving the highest existing surrogate-key value from within the script through the .NET Provider\SqlClient Data Provider (see Sub AcquireConnection on line 14).

Script Transform Script

Imports System

Imports System.Data

Imports System.Data.SqlClient

Imports System.Math

Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

 

Declare the variables

Public Class ScriptMain

Inherits UserComponent

Dim Counter As Integer = 0

Dim connMgr As IDTSConnectionManager90

Dim sqlConn As SqlConnection

 

Form .net database connection

Public Overrides Sub AcquireConnections(ByVal Transaction As Object)

connMgr = Me.Connections.PLAYDWNET

sqlConn = CType(connMgr.AcquireConnection(Nothing), SqlConnection)

End Sub

 

Initialize the counter

Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer)

Counter = Counter + 1

Row.dimensionnameSK = Counter

End Sub

 

Preexecute to fetch the highest existing sk

Public Overrides Sub PreExecute()

MyBase.PreExecute()

Dim sqlComm As New SqlCommand("select max(dimensionname_SK) as LAST_SK from DIM_dimensionname_Type1", sqlConn)

Dim r As SqlDataReader = sqlComm.ExecuteReader()

While r.Read()

Counter = CInt(r("LAST_SK"))

End While

r.Close()

End Sub

 

End Class

Variations for the Type 2 Dimension

In contrast to a Type 1 dimension, a Type 2 dimension design inserts a new record for attribute changes hence retaining the history records instead of updating them. The current record identifier for each existing dimension record where the change took place is changed to false or 0 and the same dimension member is inserted again with a new surrogate-key along with the changed attributes, the creation timestamp and a true or 1 value for the current record identifier.

Prior to a lookup, the existing dimension members are filtered down to the current records only using the query below in order to economize the operations on a multi-million member Type 2 dimension.

SELECT

dimensionname_CODE

, dimensionname_DESC

, dimensionname_CURR

FROM DIM_dimensionname_Type2

WHERE dimensionname_CURR = 1

The resulting list is used to identify the new and the changed members, and accordingly, the current record identifiers for the expiring records are updated to 0 via the following statement:

UPDATE [dbo].[DIM_dimensionname_Type2]

SET [dimensionname_CURR] = 0 WHERE [dimensionname_CODE] = ?

New surrogate-keys are generated along with the current creation timestamps and the current record identifiers are set to 1 using the following snippet:

Script Transform Script

Counter = Counter + 1

Row.dimensionnamesurrogate-key = Counter

Row.dimensionnameCURR = 1

Row.dimensionnameTS = Now()

Figure 5: Data Flow Task 3: Insert Unknown for Early Fact

 

The incoming fact records typically land in staging tables or in flat files depending on the architecture. A group by operation on the fact records can get a unified list of the incoming dimension members. This list can then be used in a lookup to identify and handle all the dimension members that are still unknown to the dimension tables and can cause referential integrity problems.

Script Transform Script

Counter = Counter + 1

Row.dimensionnamesurrogate-key = Counter

Row.dimensionnameDESC = "Unknown"

Row.dimensionnameCURR = 1

Row.dimensionnameTS = Now()

There can be three options for handling referential integrity:

  1. During extraction, using an outer-join with the dimension tables;
  2. Before insert, using the aggregate transform to obtain a list of early facts;
  3. During insert, using a special lookup table updated in cache to track the unknowns.

This article describes the No. 2 approach, which has an overhead of aggregation (group by) prior to the pipeline but has the advantage of simplicity later.

Even if there are multiple unknown natural keys at one time, a separate surrogate-key is assigned to each of them they can be differentiated later. Blanks and nulls must be explicitly handled in a dimensional design by specifying a valid description such as Blank and ensuring that it accounts for the business rules or anomalies in the transactional systems that generated blank values in the first place.

How to Use the Type 2 Dimension

Subselect SQL Query

         AS-IS view of the data:

The following query brings all the active records.

select

dimensionname_SK

, dimensionname_CODE

, dimensionname_DESC

 

from dbo.DIM_dimensionname_Type2

 

where dimensionname_CURR = 1

order by dimensionname_CODE

         AS-WAS view of the data:

The following query brings all valid records, as of a particular date:

With Tempquery (WAS_SK, MAX_TS) AS

(select

max(dimensionname_surrogate key) as dimensionname_WASSK

,max(dimensionname_TS) as dimensionname_MAXTS

from dbo.DIM_dimensionname_Type2

where dimensionname_TS <= ('2005-11-19')

group by dimensionname_CODE)

 

select

dimensionname_SK

, dimensionname_CODE

, dimensionname_DESC

from dbo.DIM_dimensionname_Type2

 

where

dimensionname_SK in (select WAS_SK from Tempquery)

 

order by dimensionname_CODE

More information on using Type 2 dimenesion is provided on the following link:

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2000%20/KimballDT8Perfectly.pdf

Further Enriching the Type 2 Dimension

Other things can be added to enrich the Type 2 dimension:

  • Expiration Date: This article uses only the Creation Date as an example. Storing Expiration Date can be helpful in AS-WAS queries.
  • Change Version Number: A change number to the dimension records can also be assigned to quickly track how many times a record has been versioned, starting from 0.

 

Figure 6: Surrogate-Key Pipeline

In the pipeline seen in Figure 6, all the surrogate-key replacements on each fact data row for every conformed dimension occur uninterruptedly. As soon as a row gets a replacement for one lookup and moves on to the next lookup, the subsequent row follows suit. A cluster of rows are written to the disk only after they go though their surrogate key replacements in memory. The dimension lookup tables are cached for faster access. Moreover, both incoming fact data and lookup tables can be presorted for efficient replacements.

We can even have multiple surrogate-key replacement pipelines running in parallel using conformed dimensions to meet an aggressive service level agreement. Since referential integrity was already resolved, the replacement process then became very effective.

Download the code

Conclusion

This article covered an uncomplicated yet highly scalable method of generating surrogate-keys for Type 1 and Type 2 dimensions, and then replacing the natural keys in the fact data with the appropriate surrogate-keys by using a high performance surrogate-key pipeline. SSIS provides a graciously scalable framework for implementing such data warehousing procedures that can take advantage of multi-processor servers in scaling up.

Total article views: 11442 | Views in the last 30 days: 6
 
Related Articles
SCRIPT

Enforcing surrogate key uniqueness in slowly changing dimension

Enforcing surrogate key uniqueness regards natural key in slowly changing dimension

BLOG

Surrogate Keys

When building a data warehouse, it is important that primary keys of dimension tables remain stable....

ARTICLE

Slowly changing dimensions using T-SQL MERGE

Of all the technical solutions to the problem of slowly changing dimensions, the T-SQL MERGE stateme...

FORUM

Change Dimension ID. HOW?

How do I change dimension ID?

FORUM

Slowly Changing dimensions Design and ETL Help

Slowly Changing dimensions Design and ETL Help

 
Contribute

Join the most active online SQL Server Community

SQL knowledge, delivered daily, free:

Email address:  

You make SSC a better place

As a member of SQLServerCentral, you get free access to loads of fresh content: thousands of articles and SQL scripts, a library of free eBooks, a weekly database news roundup, a great Q & A platform… And it’s our huge, buzzing community of SQL Server Professionals that makes it such a success.

Join us!

Steve Jones
Editor, SQLServerCentral.com

Already a member? Jump in:

Email address:   Password:   Remember me: Forgotten your password?
Steve Jones