Data warehouse load processes typically begin with steps to load dimension tables with master data and create surrogate dimension keys. These steps are then followed by fact table loading procedures, which lookup the surrogate dimension keys created in the previous dimension loading steps. The diagram below illustrates typical logic used in fact loading.
Figure 1 - Sample Fact Load Logic
But what happens when a lookup does not have a match, as shown in the diagram? In data warehousing, this scenario occurs with early arriving facts, where a fact contains master dimension data that has not yet been loaded in a dimension. Common suggested solutions to this problem include:
- Assigning the dimension key field a pre-determined "unknown member" value (i.e. -1 or 0)
- Redirecting the non-matching rows to an error file or table
- Creating and assigning a new surrogate dimension key on the fly
The first two options require further offline analysis as well as re-processing or reloading of the early arriving facts. This process can become cumbersome and difficult to manage. The third case would not require further offline effort, but it is difficult to find an explanation of exactly how to do this.
This article describes a solution built with Microsoft's SQL Server 2008 Integration Services (SSIS) to do the following:
- Create inferred dimension members and new surrogate key values during fact table loading
- Process inferred dimension members with the SSIS Slowly Changing Dimension transform
The article begins by explaining key concepts used. It then continues with an illustrated example of how the solution works, followed by step by step instructions on how to build and test the solution. The solution described is similar to a process outlined in Microsoft's Project Real but with key enhancements for flexibility and reusability.
How Inferred Members Work
The concept behind this solution is to create inferred members, or placeholder rows, in a dimension table for items that appear in transactional data but do not yet appear in the dimension. These inferred member rows are loaded with the minimum amount of information needed to support lookup transformations during fact table loading, with all other dimension attributes loaded with default values or NULL. Once master data becomes available for these inferred members, our dimension loading task will automatically update these rows with values for all additional dimension attributes.
The inferred member solution requires the following primary components:
- Fact-load SSIS Data Flow task with reusable SSIS Script Component
- SQL Server stored procedure to insert inferred dimension members
- Dimension table including Boolean SCDInferredMember_Ind field
- Dimension-load SSIS task using Inferred Member Support
In the fact load task, the Script Component passes parameters from the data stream to the stored procedure. The stored procedure inserts a new dimension row, generates a new surrogate key value, and returns the newly created surrogate key to the Script Component. The Script Component then assigns this value to the dimension key field in the fact processing data stream.
SSIS tasks used to load dimensions are configured to update the inferred member records when master is available.
To demonstrate this solution, we have created a database called DataWarehouse with a simple star schema consisting of one fact table (fact_Sales) and two dimension tables (dim_Product and dim_Date). dim_Product is a Type 1 Slowly Changing Dimension and its business key is Product_Nbr. We also have staging tables for product master data (stg_Product) and sales transactions (stg_Sales). The diagram below illustrates the tables used in this solution.
Figure 2 - Sample Data Warehouse Tables
The diagrams below illustrate sample data used in the initial data warehouse load. Note the initial stg_Sales data highlighted in red includes Products that are not available in the initial product master (stg_Product) table.
Figure 3 - stg_Product Sample Data
Figure 4 - stg_Sales Sample Data
Executing the SSIS dimension task for the first time loads the initial stg_Product data into dim_Product. Note in the results shown below that all attributes of dim_Product are loaded, and SCDInferredMember_Ind is set to 0 (false) for all rows. The SCDInferredMember_Ind field will be used later by the dimension processing task.
Figure 5 - Initial dim_Product Load
Load fact_Sales and Create Inferred Dimension Members
Once the dimension table is loaded, the SSIS fact load task is executed to load fact_Sales with data from stg_Sales. The diagram below shows the rows that were inserted into the fact_Sales table. Note that all fact rows have been loaded with surrogate Product_Key values even though the sales rows for products EE505 and FF606 did not have corresponding rows in dim_Product. The Product_Key values highlighted in red are new surrogate key values generated by the fact load task.
Figure 6 - Initial fact_Sales Load
Looking at dim_Product in the diagram below, we see that new rows were inserted for products EE505 and FF606. Notice for the new rows highlighted in red that the Product_Name field contains a default value(UNKNOWN), the SCDInferredMember_Ind is set to 1 (True), and EE505 is only inserted once even though it appears more than once in the stg_Sales data.
Figure 7 - Newly Created Inferred Members in dim_Product
Load dim_Product Once New Master Data is Available
After updated product master data is available in stg_Product, running the dimension load SSIS task inserts rows for new products and updates the inferred member rows created previously.
The diagram below illustrates the updated product master data in stg_Product. The products highlighted in red are newly available.
Figure 8 - New stg_Product Data
The diagram below shows the data in dim_Product after the dimension loading process has been executed again. Note that for products EE505 and FF606, the Product_Name field has been updated, SCDInferredMember_Ind is set to 0 (False), and the LastUpdated_Tms has been modified.
Figure 9 - New and Updated dim_Product Data
This illustration has shown how the solution creates inferred dimension members during the process of loading the fact table. It has also shown how these inferred members are handled once their master data becomes available.
How to Build the Solution
The following sections describe in detail how the solution works, and include step by step instructions on how to create the objects used.
Step 1 - Create Database Objects
The SQL script below creates the data warehouse tables, creates the stored procedure used to insert inferred dimension members, and inserts static data into dim_Date.
The stored procedure called InsertInferredDimensionMember accepts the input parameters listed below. A script component inside our fact load data flow task will pass these parameters and call the stored procedure.
- DimTableName - Name of dimension table into which inferred member rows are inserted
- ColumnNameList - Comma separated list of dimension table fields to be populated. This list will include business key field names, as well as any other fields to be populated with default values.
- ColumnValueList - Comma separated list of values to load into fields listed above
The stored procedure builds and executes a dynamic SQL statement that inserts one row into the specified table with the values of the fields specified. Since the SQL statement is dynamic, this single stored procedure can be called by any fact load process to create dimension members in any dimension table. The identity property of the surrogate key field assigns surrogate key values for new rows. The stored procedure uses an output parameter to return the identity value (surrogate key) created with the insert.
CREATE TABLE dim_Date( Date_Key INT NOT NULL, [Date] DATETIME NOT NULL, Inserted_Tms DATETIME DEFAULT GETDATE() NOT NULL, InsertedBy_Name VARCHAR(50) DEFAULT SYSTEM_USER NOT NULL, CONSTRAINT PK_dim_Date PRIMARY KEY CLUSTERED (Date_Key), CONSTRAINT UK_dim_Date UNIQUE ([Date]) ) GO CREATE TABLE dim_Product( Product_Key INT IDENTITY(1,1), Product_Nbr VARCHAR(10) NOT NULL, Product_Name VARCHAR(50) NULL, SCDInferredMember_Ind BIT DEFAULT 0 NOT NULL, Inserted_Tms DATETIME DEFAULT GETDATE() NOT NULL, InsertedBy_Name VARCHAR(50) DEFAULT SYSTEM_USER NOT NULL, LastUpdated_Tms DATETIME DEFAULT GETDATE() NULL, LastUpdatedBy_Name VARCHAR(50) DEFAULT SYSTEM_USER NULL, CONSTRAINT PK_dim_Product PRIMARY KEY CLUSTERED (Product_Key), CONSTRAINT UK_dim_Product UNIQUE (Product_Nbr) ) GO CREATE TABLE fact_Sales( Product_Key INT NOT NULL, OrderDate_Key INT NOT NULL, Order_Nbr INT NOT NULL, OrderLine_Nbr INT NOT NULL, Sales_Qty INT NULL, Sales_Amt MONEY NULL, InsertedBy_Name VARCHAR(50) DEFAULT SYSTEM_USER NOT NULL, Inserted_Tms DATETIME DEFAULT GETDATE() NOT NULL, LastUpdatedBy_Name VARCHAR(50) DEFAULT SYSTEM_USER NULL, LastUpdated_Tms DATETIME DEFAULT GETDATE() NULL, CONSTRAINT PK_fact_Sales PRIMARY KEY NONCLUSTERED (Product_Key, OrderDate_Key, Order_Nbr, OrderLine_Nbr), CONSTRAINT FK_fact_Sales_dim_Product FOREIGN KEY (Product_Key) REFERENCES dbo.dim_Product(Product_Key), CONSTRAINT FK_fact_Sales_dim_Date FOREIGN KEY (OrderDate_Key) REFERENCES dbo.dim_Date(Date_Key) ) GO CREATE TABLE stg_Product( Product_Nbr VARCHAR(10) NOT NULL, Product_Name VARCHAR(50) NULL ) GO CREATE TABLE stg_Sales( Order_Nbr INT NULL, OrderLine_Nbr INT NULL, Order_Dt DATETIME NULL, Product_Nbr VARCHAR(10) NULL, Sales_Qty INT NULL, Sales_Amt MONEY NULL ) GO CREATE PROCEDURE [dbo].[InsertInferredDimensionMember] @DimTableName VARCHAR(1000), @ColumnNameList VARCHAR(1000), @ColumnValueList VARCHAR(1000), @SurrogateKey INT OUT AS BEGIN DECLARE @SQLString VARCHAR(MAX) /*Build SQL INSERT query*/ SET @SQLString = 'INSERT INTO ' + RTRIM(@DimTableName) + '(' + @ColumnNameList +') VALUES (' + @ColumnValueList + ')' -- PRINT (@SQLString) /*Execute SQL INSERT to create new dimension row*/ EXEC (@SQLString) /*Retrieve new value of the identity field and return it in an output parameter*/ SET @SurrogateKey = @@IDENTITY END GO /*Insert dim_Date rows*/ INSERT INTO dim_Date (Date_Key, [Date]) SELECT 20100802, '2010-08-02' UNION ALL SELECT 20100803, '2010-08-03' UNION ALL SELECT 20100804, '2010-08-04' UNION ALL SELECT 20100805, '2010-08-05' UNION ALL SELECT 20100806, '2010-08-06'
Step 2 - Create Fact Table Load SSIS Task including Script Component
The process of loading fact_Sales is contained in a single SSIS Data Flow task. The diagram below shows the components used in the Data Flow task, including the components used to create the inferred dimension members. The numbered components are explained in further detail below.
Figure 10 - Fact Load Data Flow Components
1. The Lookup Product_Key component is set to Ignore Failure if no matching dimension record is found, meaning that all rows will continue in the same stream regardless if the Lookup returned a match. The diagram below illustrates how the Ignore Failure option of the lookup is set.
Figure 11 - Lookup Transform Configuration
2. A Derived Column task creates and assigns values to the fields listed below. These fields will be consumed by the upcoming Script Component. Since these column names are not specific to the dimension being loaded, the code inside the script component is entirely reusable.
- SurrogateKey - Value of dimension key field being looked up and created (Product_Key in our example)
- DimensionTableName - Name of dimension table where inferred members are created
- BKColumnList - Comma separated list of field names that make up the business key of the dimension
- BKColumnValues - Comma separated list of business key value fields
- InsertColumnList - Comma separated list of fields to be inserted by stored procedure. This list will include the business key fields, as well as any other required fields, or fields for which default values are supplied.
- InsertColumnValues - Comma separated list of values to be inserted by stored procedure, including single quotes
The diagram below illustrates how these fields are set. During processing, the SurrogateKey value will be NULL for the rows that did not have a Lookup match. Notice that we have supplied a value of 1 (True) for the SCDInferredMember_Ind field. A value of True for the SCDInferredMember_Ind is crucial for the dimension table processing that will occur later. Notice also that we have supplied a value of "UNKNOWN" for the Product_Name field. Since the fields in dimension table may be used in queries, reports, or OLAP dimensions, it is preferable to load a default value for these fields, rather than leaving them NULL.
Figure 12 - Assigning Values to New Fields
The Data Viewer below shows an example of the values used in these fields during processing. Note that single quotes are not required for the BKColumnValues fields, but are required for the InsertColumnValues field if required by the target SQL Server data type.
Figure 13 - Data Passed to Script Component
3. All rows in the Data Flow task pass through the Script Component. For rows that have NULL SurrogateKey value, the Script Component calls the stored procedure to create and return a SurrogateKey value. Rows with a non-NULL SurrogateKey value simply pass through the component unchanged.
The Script Component requires the fields shown below for processing. Note that all fields are set to ReadOnly except for the SurrogateKey field, which is set to ReadWrite.
Figure 14 - Script Component Configuration
The code in the Script Component requires a reference to an OLE DB connection manager named "Connection". This non-specific name is a key aspect of the reusability of the code in the Script Component. The diagram below shows the DataWarehouse connection manager is selected and named "Connection" for use in the code.
Figure 15 - Script Component Connection Manager
The script uses the code shown below. This example is illustrated with Visual Basic, but with SSIS 2008 C# can also be used. The code is written so that it is entirely reusable in other script components to create inferred members for other dimensions during any fact table load process.
As mentioned earlier, every row in the Data Flow passes through this component, and rows with a non-NULL SurrogateKey value pass through unchanged. The main portion of the script is only called to action for rows that have a NULL SurrogateKey value. The script uses a hash table to store values of business key / surrogate key combinations for inferred members created during the execution of the task. The script first checks the hash table to see if a new surrogate key has already been generated for the BusinessKey during this execution. If a match is found in the hash table, the script returns the surrogate key value from the hash table, and continues on to process the next row. If no matching row is found in the hash table, only then is the stored procedure executed to create a new dimension row and return a surrogate key. The business key value and new surrogate key value are then added to hash table. This method ensures that the stored procedure is only called once for each unique business key that appears in the stream.
Using the sample data from above as an example, the stg_Sales data contains two rows with Product_Nbr of EE505 that does not exist in dim_Product. When the first row passes through the script component, the stored procedure will be executed to create a dimension row for Product_Nbr EE505. When the second row passes through, the script will find a matching record in the hash table, and not execute the stored procedure again.
Imports System Imports System.Data Imports System.Data.OleDb Imports System.Collections Imports System.Math Imports Microsoft.SqlServer.Dts.Pipeline.Wrapper Imports Microsoft.SqlServer.Dts.Runtime.Wrapper <Microsoft.SqlServer.Dts.Pipeline.SSISScriptComponentEntryPointAttribute()> _ <CLSCompliant(False)> _ Public Class ScriptMain Inherits UserComponent 'Stored procedure to insert inferred members Private oCommand As New OleDbCommand("dbo.InsertInferredDimensionMember") 'Declare parameters to be used by stored procedure Private parmDimTableName, parmColumnName, parmColumnValue, parmSurrogateKey As OleDbParameter 'Declare hash table used to store surrogate keys created during this execution Private htDimCache As New Generic.SortedDictionary(Of String, Integer) Public Overrides Sub Input0_ProcessInputRow(ByVal Row As Input0Buffer) 'If lookup did not have a match, SurrogateKey value will be NULL If Row.SurrogateKey_IsNull Then Dim sDimTableName As String = Row.DimensionTableName Dim sBusinessKeyColumnNameCache As String = Row.BKColumnList Dim sBusinessKeyValueCache As String = Row.BKColumnValues Dim iSurrogateKey As Integer = 0 Dim sColumnInsertName As String = Row.InsertColumnList Dim sColumnInsertValue As String = Row.InsertColumnValues 'Get the surrogate key for the business key if it has already been created during this execution 'If a surrogate key has not been created, execute the stored procedure to create a new surrogate key 'and add business key and surrogate key to cache If Not htDimCache.TryGetValue(sBusinessKeyValueCache, iSurrogateKey) Then iSurrogateKey = Me.ExecuteProcedure(sDimTableName, sColumnInsertName, sColumnInsertValue) htDimCache.Add(sBusinessKeyValueCache, iSurrogateKey) End If Row.SurrogateKey = iSurrogateKey End If 'ELSE: row passes through unchanged End Sub Private Function ExecuteProcedure(ByVal TableName As String, ByVal ColumnNameList As String, ByVal ColumnValueList As String) As Integer 'Retrieve parameter values to pass to stored procedure, execute the stored procedure, ' and return newly created surrogate key value parmDimTableName.Value = TableName parmColumnName.Value = ColumnNameList parmColumnValue.Value = ColumnValueList parmSurrogateKey.Value = 0 oCommand.ExecuteNonQuery() Return CInt(parmSurrogateKey.Value) End Function Public Overrides Sub PreExecute() MyBase.PreExecute() With oCommand .CommandType = CommandType.StoredProcedure .Connection = New OleDbConnection(Connections.Connection.ConnectionString) .Connection.Open() With .Parameters parmDimTableName = .Add("@DimTableName", OleDbType.VarChar, 1000) parmColumnName = .Add("@ColumnNameList", OleDbType.VarChar, 1000) parmColumnValue = .Add("@ColumnValueList", OleDbType.VarChar, 1000) parmSurrogateKey = .Add("@SurrogateKey", OleDbType.Integer) parmSurrogateKey.Direction = ParameterDirection.InputOutput End With .Prepare() End With End Sub Public Overrides Sub PostExecute() MyBase.PostExecute() 'Empty hash table htDimCache = Nothing 'Close connection oCommand.Connection.Close() End Sub End Class
4. A Derived Column transform sets Product_Key value to the SurrogateKey value, which includes newly created surrogate key values.
Figure 16 - Assign Surrogate Key Value to Dimension Key
Step 3 - Create Dimension Load SSIS Package with Inferred Member Support
The process of loading the dim_Product table is contained in a single Data Flow task, shown in the diagram below. The Data Flow task contains an OLE DB source to read from stg_Product, followed by a Slowly Changing Dimension (SCD) transform. When creating this task, simply connect the output of the OLE DB Source to the SCD transform. The components shown under the SCD transform are generated automatically after configuring the SCD transform.
Figure 17 - dim_Product Data Flow Task
Begin the configuration of the SCD by selecting table to be loaded (dim_Product), and mapping the fields as shown below. Note that in this case, the Dimension Columns that do not have Input Columns all have default values defined in the table and do not require inputs to be mapped to them.
Figure 18 - SCD Configuration
The dimension in this example is a Type 1 SCD. Choose Product_Name to be a Changing Attribute, as shown below.
Figure 19 - SCD Attribute Configuration
Next, select the "Enable inferred member support" box, choose the option for "Use a Boolean column to indicate whether the current record is an inferred member", and choose the SCDInferredMember_Ind field from the drop down. These selections are shows in the diagram below.
Figure 20 - Enable SCD Inferred Member Support
By selecting a Boolean indicator field, the SCD transform will update existing dimension rows where the selected Boolean field has a value of True. Using the "All columns with a change type are null" choice will not work in our case because we have chosen to supply non-NULL default values for our change column (i.e. we supplied the value of "UNKNOWN" for the Product_Name field).
Using a single field to flag inferred members (and using the same field name in all dimension tables) is the preferred method for a number of reasons. First, it is clean and simple to implement. Secondly, most dimension tables have attributes that do not allow NULL values, and if they do allow NULLs, it is often desirable to provide non-NULL default values as we have done in this example. Finally, it streamlines the auditing of dimension tables. For example using a query such as "SELECT COUNT(*) FROM <<TABLE_NAME>> WHERE SCDInferredMember_Ind = 1" is much simpler and cleaner than queries like "SELECT COUNT(*) FROM dim_Product WHERE Product_Name IS NULL" for each dimension.
It is important to note that when using Inferred Member support for a Type 2 SCD, the SSIS SCD transform will identify existing inferred dimension rows and update those rows with master data. It will not insert new rows if there is a matching inferred member row in the table.
Finally, we modify the two OLE DB Update commands to update the LastUpdatedBy_Name and LastUpdated_Tms fields, as show in red in the diagrams below. Alternatively, these fields could be updated through a database trigger. Note the query on the left is used by the SCD transform to update inferred member rows with master data, and was automatically configured to set the SCDInferredMember_Ind to 0 (False) for inferred members being updated. The query on the right is used to update existing "real" dimension rows with Type 1 changes.
Figure 21 - OLE DB Updates from SCD Transform
How to Run and Test the Solution
The following steps describe in detail how to create sample data and run and test the components just created.
1. Load Initial Sample Data
The following SQL script inserts the initial set of stg_Product and stg_Sales data.
/*Insert stg_Product rows*/ INSERT INTO stg_Product (Product_Nbr, Product_Name) SELECT 'AA101', 'Apple' UNION ALL SELECT 'BB202', 'Banana' UNION ALL SELECT 'CC303', 'Cherry' UNION ALL SELECT 'DD404', 'Doughnut' /*Insert stg_Sales rows*/ INSERT INTO stg_Sales (Order_Nbr, OrderLine_Nbr, Order_Dt, Product_Nbr, Sales_Qty, Sales_Amt) SELECT 10001, 1, '2010-08-02', 'AA101', 10, 1000.00 UNION ALL SELECT 10001, 2, '2010-08-02', 'BB202', 6, 150.00 UNION ALL SELECT 10002, 1, '2010-08-03', 'AA101', 8, 800.00 UNION ALL SELECT 10003, 1, '2010-08-03', 'CC303', 5, 75.00 UNION ALL SELECT 10003, 2, '2010-08-04', 'DD404', 10, 1000.00 /*Insert early arriving fact rows*/ INSERT INTO stg_Sales (Order_Nbr, OrderLine_Nbr, Order_Dt, Product_Nbr, Sales_Qty, Sales_Amt) SELECT 10004, 1, '2010-08-05', 'EE505', 12, 96.00 UNION ALL SELECT 10005, 1, '2010-08-05', 'EE505', 1, 8.00 UNION ALL SELECT 10005, 2, '2010-08-05', 'FF606', 5, 30.00
2. Execute dim_Product Load
Execute the dim_Product load task to create rows in dim_Product for the initial stg_Product data.
3. Execute fact_Sales Load
Execute the fact_Sales load task to load the fact_Sales table, and create inferred dimension records for products in the stg_Sales data that do not exist in dim_Product.
4. Load New Product Master Data
Execute the following SQL script to insert newly available product master data into stg_Product.
INSERT INTO stg_Product (Product_Nbr, Product_Name) SELECT 'EE505', 'Egg' UNION ALL SELECT 'FF606', 'Frankfurter' UNION ALL SELECT 'GG707', 'Gorgonzola'
5. Re-execute dim_Product Load
Run the dim_Product load task again to insert new dimension rows and to use the inferred member support to update inferred member rows with newly available product master data.
This article has illustrated a complete SSIS solution for the common data warehouse loading problem of processing fact rows that do not have matching dimension members. It uses components that are entirely reusable by any fact load process to load any dimension table.
As mentioned earlier, this solution is modeled after a similar solution used in Microsoft's Project Real. Jake Kremer and Abdul Wajeed Mohammed from Project Leadership Associates (PLA) contributed to this article.