A surrogate key is an auto generated value, usually integer, in the dimension table. It is made the primary key of the table and is used to join a dimension to a fact table. Among other benefits, surrogate keys allow you to maintain history in a dimension table. Despite of the their popularity, SSIS doesn’t have a built in solution for generating surrogate keys. Let’s take a look at a few alternatives in this post.
First, create the following table. We will import data from Person.Person table (AdventureWorks sample) into this table. Note PersonSK is the surrogate key.
CREATE TABLE [dbo].Person
[PersonSK] INT IDENTITY(1,1) NOT NULL
, [FirstName] NVARCHAR(50) NULL
, [LastName] NVARCHAR(50) NULL
, CONSTRAINT PK_PersonSK PRIMARY KEY [PersonSK]
Drag a data flow task on to the control flow and configure the OLE DB source.
Next, drag OLE DB destination and connect it to the source. Specify the connection manager, and choose Table or view – fast load as the data access mode. This performs a bulk insert and is the fastest of all.
Destination table has three columns, but source has only two columns. In the mappings page, map input and output columns for FirstName and LastName and ignore the mapping for PersonSK.
When you run the package, becasue PersonSK is an identity column in the table, SQL Server will automatically generate values for you. This solution is easy and fast, but sometimes depending on your ETL methodology, you can’t rely on IDENTITY().
I frequently use Script Transformation. The steps are nicely written in this post by Phil Brammer (b). This is simply a script used as a transformation. The script generates a row number for each row passed through the data flow.
SSIS (data flow) engine generated the new row number when using script transformation. Instead, you can use ROW_NUMBER() when working with SQL Server data source to let the database engine do the work for you. This can be faster.
If you’re doing an incremental load, first find the maximum key value from the destination. I’ll use the following query. It’ll return zero if there were no rows, else it returns the maximum value.
SELECT ISNULL(MAX(PersonSK),0) SK
Add an Execute SQL Task to the control flow and set the result property to single row. Then, add a variable to hold the return value.
Next, connect a data flow task to the execute sql task. We will use the following SQL statement in the OLE DB source editor. In addition to the LastName and FirstName columns, we are using ROW_NUMBER() function to generate a unique number for every row.
ROW_NUMBER() OVER(ORDER BY [LastName]) SK
The query will generate numbers starting from 1 for each row, but while loading to destination we don’t want to start from 1. We want to find the maximum value in the destination and start from the next highest value. So, I’m adding the max value to every row number using parameters in the OLE DB source.
In the OLE DB destination, check the box that says Keep Identity. By do this we are asking SSIS to keep the identity values that are generated by the source. In the mappings page, you’ll see a new input that was created in the OLE DB source. Map it to PersonSK surrogate key.
Go ahead and run the package. If it’s all good you will see cute little green tick marks like below.
In this example, we looked at different options to generate surrogate keys while loading dimensions using SSIS. We used IDENTITY() constraint in SQL Server. We talked about Script Component. Finally, we saw making use ROW_NUMBER() function. Last option is twice as fast as using the Script Component with around 20,000 rows and an index on LastName column.