Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS

,

Generating Surrogate-Keys for Type 1 and Type 2 dimensions using SSIS

Surrogate-keys are imperative

to data warehousing. The performance advantages of implementing surrogate-keys along

with their usefulness in deploying slowly changing dimensions are well

documented in many industry standard books. Yet there are complexities in

implementing them such as maintaining referential integrity without hurting the

speed of the surrogate-key replacement pipeline and correctly sequencing the

surrogate-keys for the data coming in at irregular intervals. These challenges can

impede the data warehouse availability timelines.

SQL Server Integration Services

provides a framework for efficiently developing such essential data-warehousing

ETL mechanisms. This article provides an innovative approach for generating surrogate-keys

for Type 1 and Type 2 slowly changing dimensions and then implementing a highly

scalable, high-throughput surrogate-key replacement pipeline for an EDW using

SSIS that scales on enterprise-class hardware for a highly parallelized, large

scale data warehouse load.

Prerequisites

        

Introductory knowledge of fact tables, dimensions

and star schema is useful.

What is a Type 1 or Type 2 dimension?

As found in some popular

articles, Type 1 dimension is the most effective way of presenting the AS-IS

view of the business in a data-warehouse but it does not maintain any history

of the dimensional changes. Type 2 dimension, on the other hand, helps in tracking

history. It not only provides the AS-IS view, but also the AS-WAS view of the

business for any point in time. Reference and additional information on slowly

changing dimensions is provided on the link: http://www.dbmsmag.com/9604d05.html

What are Surrogate-Keys?

A standard data warehouse implementation

consists of generating surrogate-keys for all members in the dimensions. Natural keys or codes coming from a source

OLTP system can be quite large in size and/ or inconsistent in nature. Data warehouse architects like to replace

these source keys with a nice four byte integer in order to ensure consistency,

as well as reduce the size of the tables.

These keys also help in the creation of Type 1 and Type 2 slowly

changing dimensions.

Process Overview

The process involves generating

unique surrogate-keys for every dimension member in the data warehouse, managing

referential integrity, and then constructing a surrogate-key pipe-line for

replacing all the incoming fact data with the newly generated surrogate keys. Reference

and additional information on surrogate-keys is provided on the following link:

http://www.dbmsmag.com/9806d05.html

The techniques in this article

tie together some of the industry standard concepts and ideas available in

popular books, blogs and articles with the technological features of SSIS 2005.

The example used here has been simplified for easy understanding, and presented

in the most granular way so that it can be leveraged for any complex and

large-scale objectives.

The slowly changing dimension

task (wizard) that comes with SSIS does not generate surrogate keys. Also,

using an identity column is not a scalable approach for generating surrogate-keys

because it requires constant reading and writing to the database. In order to

make a scalable solution, we need to read from the database just once to get

the highest value of existing surrogate keys, and then programmatically

generate new keys without having to read from the database again. The following

link serves as a reference for this concept:

http://sqljunkies.com/WebLog/sqlbi/archive/2005/05/30/15684.aspx

The tables below show example

columns for Type 1 and Type 2 dimension tables. These tables contain one

natural key column (CODE), one surrogate-key column (SK) and one attribute column

(DESC). The Type 2 dimension table additionally contains a creation timestamp

(TS) and a current record identifier (CURR).

Figure : Sample Type 1

dimension

Figure : Sample Type 2

dimension

Note that dimension member with

code C has multiple records in the Type 2 dimension table. Every time a member incurs a change, a new

record is created to track the history of changes. Type 1 dimension, in contrast, only keeps the

last incurred change. Also note that dimension

member with code Z in Figure 2 was at one point unknown (see SK = 6). It means that the code was first encountered

in the fact data before it showing up in the definition lists feeding the

dimension tables. Code Z was therefore considered an Early Fact and its

attributes were assigned an unknown status. The correct attribute definitions were

imparted to it in the next ETL runs as they became available (see SK = 8).

In order to implement the surrogate-key

mechanism, three Data Flow tasks are created on the Control Flow window enumerated

for updating attribute changes, inserting new dimension members and inserting

early facts. For a rapidly changing dimension, the updates should precede the

inserts. But for relatively slower changing dimension, running inserts prior to

updates will prove quicker.

Figure : Data Flow Task 1: Update Attrib Changes

The attribute updating process

starts with querying the source (staging data) for the dimensions, and filtering

out the existing dimension members through a Lookup task based on the natural keys, and then writes the list of new

(or non-existing) dimension members to a Raw

File. These Raw Files must be

placed on high performance storage subsystems since they would require fast

I/O.

From the existing dimension

members, a subsequent Lookup task

picks only the dimension records where the attribute changes have occurred and

then feeds them to an OLE DB command

task that updates dimension table with changes (see Figure 3).

The lookups for attribute

changes can be further optimized by using Checksum

values instead of column to column comparisons.

OLE DB Command Transform uses a Native

OLE DB\SQL Native Client connection for updating the attributes. The

following SQL command updates the dimension data by taking changed attributes

as parameters.

UPDATE [dbo].[DIM_dimensionname_Type1]

SET [dimensionname_DESC] = ? WHERE

[dimensionname_CODE] = ?

Clustered index on the

dimension tables might be helpful in such update tasks, if used tactically and

dropped during the insert tasks.

Figure 4: Data Flow Task 2: Insert New Dims

The previous

data flow task (1) had detected the new dimension members and written them to a

Raw File, eliminating the need for repeating

the lookup.

These new dimension members stored

in the Raw File (see Figure 4) are now

passed through a Script task that

generates surrogate-keys and then inserts them into the dimension table via an OLE DB destination task. The inserting

process is sped up by avoiding check constraints.

A sample script for generating the

surrogate-keys for new dimensions are displayed below. Please note that there

is an internal connection formed for retrieving the highest existing surrogate-key

value from within the script through the .NET

Provider\SqlClient Data Provider (see Sub AcquireConnection on line 14).

Script Transform Script

Imports System

Imports System.Data

Imports System.Data.SqlClient

Imports System.Math

Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper

Imports Microsoft.SqlServer.Dts.Runtime.Wrapper

 

Declare the variables

Public Class

ScriptMain

Inherits UserComponent

Dim Counter As Integer = 0

Dim connMgr As

IDTSConnectionManager90

Dim sqlConn As

SqlConnection

 

Form .net database connection

Public Overrides Sub AcquireConnections(ByVal

Transaction As Object)

connMgr = Me.Connections.PLAYDWNET

sqlConn = CType(connMgr.AcquireConnection(Nothing), SqlConnection)

End Sub

 

Initialize the counter

Public Overrides Sub Input0_ProcessInputRow(ByVal

Row As Input0Buffer)

Counter =

Counter + 1

Row.dimensionnameSK = Counter

End Sub

 

Preexecute to fetch the highest existing sk

Public Overrides Sub PreExecute()

MyBase.PreExecute()

Dim sqlComm As New SqlCommand("select

max(dimensionname_SK) as LAST_SK from DIM_dimensionname_Type1",

sqlConn)

Dim r As

SqlDataReader = sqlComm.ExecuteReader()

While r.Read()

Counter = CInt(r("LAST_SK"))

End While

r.Close()

End Sub

 

End Class

Variations for the Type 2 Dimension

In contrast to a Type 1

dimension, a Type 2 dimension design inserts a new record for attribute changes

hence retaining the history records instead of updating them. The current record

identifier for each existing dimension record where the change took place is

changed to false or 0 and the same dimension member is inserted again with a

new surrogate-key along with the changed attributes, the creation timestamp and

a true or 1 value for the current record identifier.

Prior to a lookup, the existing

dimension members are filtered down to the current records only using the query

below in order to economize the operations on a multi-million member Type 2

dimension.

SELECT

dimensionname_CODE

, dimensionname_DESC

, dimensionname_CURR

FROM DIM_dimensionname_Type2

WHERE dimensionname_CURR = 1

The resulting list is used to

identify the new and the changed members, and accordingly, the current record identifiers

for the expiring records are updated to 0 via the following statement:

UPDATE [dbo].[DIM_dimensionname_Type2]

SET [dimensionname_CURR] = 0 WHERE [dimensionname_CODE] = ?

New surrogate-keys are

generated along with the current creation timestamps and the current record identifiers

are set to 1 using the following snippet:

Script Transform Script

Counter = Counter + 1

Row.dimensionnamesurrogate-key =

Counter

Row.dimensionnameCURR = 1

Row.dimensionnameTS = Now()

Figure 5: Data Flow Task 3: Insert

Unknown for Early Fact

 

The incoming fact records typically

land in staging tables or in flat files depending on the architecture. A group by operation on the fact records can

get a unified list of the incoming dimension members. This list can then be used in a lookup to

identify and handle all the dimension members that are still unknown to the

dimension tables and can cause referential integrity problems.

Script Transform Script

Counter = Counter + 1

Row.dimensionnamesurrogate-key =

Counter

Row.dimensionnameDESC = "Unknown"

Row.dimensionnameCURR = 1

Row.dimensionnameTS = Now()

There can be three options for

handling referential integrity:

  1. During extraction, using an outer-join with

    the dimension tables;

  2. Before insert, using the aggregate transform to

    obtain a list of early facts;

  3. During insert, using a special lookup table

    updated in cache to track the unknowns.

This article describes the No.

2 approach, which has an overhead of aggregation (group by) prior to the

pipeline but has the advantage of simplicity later.

Even if there are multiple

unknown natural keys at one time, a separate surrogate-key is assigned to each

of them they can be differentiated later. Blanks and nulls must be explicitly

handled in a dimensional design by specifying a valid description such as

Blank and ensuring that it accounts for the business rules or anomalies in

the transactional systems that generated blank values in the first place.

How to Use the Type 2 Dimension

Subselect SQL Query

        

AS-IS view of the data:

The following query brings all

the active records.

select

dimensionname_SK

, dimensionname_CODE

, dimensionname_DESC

 

from dbo.DIM_dimensionname_Type2

 

where dimensionname_CURR = 1

order by dimensionname_CODE

        

AS-WAS view of the data:

The following query brings all valid

records, as of a particular date:

With Tempquery (WAS_SK, MAX_TS) AS

(select

max(dimensionname_surrogate

key) as

dimensionname_WASSK

,max(dimensionname_TS) as dimensionname_MAXTS

from dbo.DIM_dimensionname_Type2

where dimensionname_TS <=

('2005-11-19')

group by

dimensionname_CODE)

 

select

dimensionname_SK

, dimensionname_CODE

, dimensionname_DESC

from dbo.DIM_dimensionname_Type2

 

where

dimensionname_SK

in (select WAS_SK from Tempquery)

 

order by dimensionname_CODE

More information on using Type

2 dimenesion is provided on the following link:

http://www.kimballgroup.com/html/designtipsPDF/DesignTips2000%20/KimballDT8Perfectly.pdf

Further Enriching the Type 2 Dimension

Other things can be added to

enrich the Type 2 dimension:

  • Expiration Date: This article uses only the Creation Date

    as an example. Storing Expiration

    Date can be helpful in AS-WAS queries.

  • Change Version Number: A change number to the dimension records

    can also be assigned to quickly track how many times a record has been

    versioned, starting from 0.

 

Figure 6: Surrogate-Key Pipeline

In the pipeline seen in Figure

6, all the surrogate-key replacements on each fact data row for every conformed

dimension occur uninterruptedly. As soon as a row gets a replacement for one

lookup and moves on to the next lookup, the subsequent row follows suit. A

cluster of rows are written to the disk only after they go though their surrogate

key replacements in memory. The dimension lookup tables are cached for faster

access. Moreover, both incoming fact data and lookup tables can be presorted

for efficient replacements.

We can even have multiple surrogate-key

replacement pipelines running in parallel using conformed dimensions to meet an

aggressive service level agreement. Since

referential integrity was already resolved, the replacement process then became

very effective.

Download the code

Conclusion

This article covered an

uncomplicated yet highly scalable method of generating surrogate-keys for Type

1 and Type 2 dimensions, and then replacing the natural keys in the fact data with

the appropriate surrogate-keys by using a high performance surrogate-key pipeline.

SSIS provides a graciously scalable framework for implementing such data

warehousing procedures that can take advantage of multi-processor servers in

scaling up.

Rate

Share

Share

Rate