SQLServerCentral Article

Implementing, loading and optimizing a Type 2 Slowly Changing Fact Table

,

Recently I was asked to see if I could optimize a debt collection report system for a medium sized company (100k customers). The problem was that the daily overnight data processing procedure sometimes ran for hours, and it would not finish before the next morning. Attempts to get the data online usually caused even more problems, sometimes even invalid data records being inserted.

The system worked perfect, and fast, when it was first implemented, but that was when there was no, or nearly no, data. Now, after being in use for nearly two years, the main table contained 200 million rows. We quickly found some essential indexes were missing. Nobody had been checking on the database. What was more disturbing was that the system contained much more data than one would expect. Apparently the data processing procedure had created many useless, duplicate rows.

We decided to redesign the whole database, and reconsider the data processing procedure. In this article I will describe one of the essential elements of this new design: the implementation of a Type 2 Slowly Changing Fact (SCF) table. Although information on implementing such a table is available on the Internet, we ran into problems and had to combine several solutions. Therefore, I decided to write this article. I hope it helps you when you have to tackle similar problems.

The samples are written for SQL Server 2008 R2, but it should also work for other versions supporting the MERGE statement.

Requirements

Our debt collection system is required to store all unpaid invoices that are overdue and order them based on the number of days the invoice is overdue. The system should store the data of the individual invoices, provide a source for consolidated summary data, and supply data to generate reminder letters. Invoice data is provided by an ETL system overnight, where we have a limited time window to process the data.

As stated before we will discuss the implementation of a Slowly Changing Fact table in this article. Although interesting, the details of the debt processing procedure won't be discussed. The essential requirements for this part of the process are:

  • The system should provide daily overdue invoice data for all customers.
  • For each invoice there should be one record (one day state) defining it as paid.
  • There is no need to further store information about paid invoices
  • However, since errors in the ETL data frequently occur, it should be possible to 'reopen' an invoice once recorded as paid.
  • Customers should be classified according to the maximum number of days their invoices are overdue.
  • Data loads for the same day should be allowed to repeat, overwriting records inserted/updated for that day.

The existing implementation simply stores a copy of all overdue invoices for every day data processed in one large table. This table contains now 200 million, mostly identical, rows. In addition, our existing system contains too many records for paid invoices, which were incorrectly inserted by our data processing algorithm. This typically happens when the process was killed and restarted.

Implement the Invoices table as a Slowly Changing Fact 

A SCF stores, similar to a Slowly Changing Dimension, a new record in the database when the data changes. In addition validity data is stored, indicating when a certain set of data is valid. Data for any given date can be retrieved as follows:

SELECT *
 FROM SCF_Invoices
 WHERE OnDate BETWEEN ValidFrom AND ValidTo

An excellent article discussing the different possible implementations of Slowly Changing Dimensions using the MERGE statement has been written by Adam Aspin in 2011. I will use his implementation of SCD type 2 as a starting point. But before that, I will introduce the tables used in this article

DI_Customer:

Contains Customer definitions

CREATE TABLE DI_Customer (
  CustomerID int IDENTITY(1,1) NOT NULL,
  CustomerIDO int NOT NULL,  -- Customer ID from the ETL system
  CustomerName varchar(50) NOT NULL,
  CustomerAddress varchar(500) OT NULL
)

DI_Bucket:

This dimension is used to classify Customers based on the maximum number of days their invoices are overdue

CREATE TABLE DI_Bucket (
  BucketID int IDENTITY(1,1) NOT NULL,
  BucketName varchar(50) NOT NULL,
  BucketFromDays int NOT NULL,
  BucketToDays int NOT NULL
)

Some sample data

INSERT INTO [dbo].[DI_Bucket]
  ([BucketName], [BucketFromDays], [BucketToDays])
VALUES ('In time', 0, 30),
  ('OverDue', 31, 45),
  ('OverDue2', 46, 60),
  ('OverDue', 61, 70),
  ('OverDue4', 71, 89),
  ('Enforcement', 81, 999999)

SCF_Invoices:

The Slowly Changing Fact table.

CREATE TABLE dbo.SCF_Invoices(
( InvoiceID int IDENTITY(1,1) NOT NULL,
  InvoiceIDO int NOT NULL,  -- Invoice ID from the ETL system
  CustomerID int NOT NULL,
  DueDate date NOT NULL,
  IsOpen bit NOT NULL DEFAULT 1,  --Indicates that the Invoice was not (fully) paid yet
  BucketID int NOT NULL,
  OrigAmnt numeric(16,2) NOT NULL,  --The original Invoice amount
  RemAmnt numeric(16,2) NOT NULL,  --The remaining, unpaid, Invoice amount
  ValidFrom date NOT NULL, 
  ValidTo DATE NOT NULL DEFAULT '99991231', 
  IsCurrent BIT NOT NULL DEFAULT 0, 
  CONSTRAINT [FK_SCF_Invoices_Customer] FOREIGN KEY (CustomerID) REFERENCES DI_Customer(CustomerID), 
  CONSTRAINT [FK_SCF_Invoices_Bucket] FOREIGN KEY (BucketID) REFERENCES DI_Bucket(BucketID),
) 

In addition we will define a table valued function returning fact data for any given day. This TVF replaces the original fact table for retrieving daily data.

CREATE FUNCTION tvf_F_Invoices(@OnDate date) 
RETURNS TABLE 
AS RETURN 
( SELECT 
    InvoiceID, 
    InvoiceIDO, 
    CustomerID, 
    DueDate, 
    IsOpen, 
    BucketID, 
    OrigAmnt, 
    RemAmnt, 
    @OnDate AS OnDate, 
    ValidFrom, 
    ValidTo, 
    IsCurrent 
  FROM SCF_Invoices WHERE @OnDate BETWEEN ValidFrom AND ValidTo 
)

Note that we did not define a primary key for the SCF table yet. This will be discussed later. I did, however, define two foreign key constraints on the Dimension tables. We will later see why.

Daily Invoice data will be supplied in a staging table. This table supplies information about all outstanding Invoices on a given day as well as Invoices paid during the previous month. This will make sure the procedure will get at least once a closing record for every paid invoice.

CREATE TABLE dbo.ST_Invoices(
( OnDate date NOT NULL,
  InvoiceIDO int NOT NULL,  -- Invoice ID from the ETL system
  CustomerID int NOT NULL,
  DueDate date NOT NULL,
  OrigAmnt numeric(16,2) NOT NULL,  --The original Invoice amount
  RemAmnt numeric(16,2) NOT NULL,  --The remaining, unpaid, Invoice amount
) 

Define the SCF processing statement

Now we have all the information and can write our SCF processing procedure. First of all we have to define the select statement returning the source data as it should be merged in the SCF table.

SELECT st.InvoiceIDO,
    ISNULL(cus.CustomerID,-1) AS CustomerID,  -- Return Unknown Customer (-1) when not matched
    st.DueDate,
    CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen,
    ISNULL(buc.BucketID,-1) as BucketID,  --Return Unknown (-1) when not matched
    st.OrigAmnt,
    st.RemAmnt
  FROM ST_Invoices st
  LEFT JOIN DI_Customer cus
    ON cus.CustomerIDO = st.CustomerIDO
  LEFT JOIN DI_Bucket buc
    ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays
 WHERE st.OnDate=@OnDate

Implementing the SCF MERGE statement seems now straightforward with the instructions given by Adam Aspin:

CREATE PROCEDURE LOAD_SCF_Invoices 
AS
BEGIN
  SET NOCOUNT ON
  DECLARE @Today date, @Yesterday date
  SELECT @Today = MAX(OnDate) FROM ST_Invoices
  SELECT @Yesterday = DATEADD(dd, -1, @Today)
  --The OUTER insert inserts a new valid record for all records flagged as no longer active
  INSERT INTO SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent)
  SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, @Today AS ValidFrom, CASE WHEN IsOpen=0 THEN @Today ELSE '99991231' END AS ValidTo, 1
  FROM
  ( -- Merge statement
    MERGE INTO SCF_Invoices AS DST
    USING 
      ( SELECT st.InvoiceIDO,
               ISNULL(cus.CustomerID,-1) AS CustomerID,   -- Return Unknown Customer (-1) when not matched
               st.DueDate,
               CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen,
               ISNULL(buc.BucketID,-1) as BucketID,    --Return Unknown (-1) when not matched
               st.OrigAmnt,
               st.RemAmnt
          FROM ST_Invoices st
          LEFT JOIN DI_Customer cus
            ON cus.CustomerIDO = st.CustomerIDO
          LEFT JOIN DI_Bucket buc
            ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays
         WHERE st.OnDate = @Today
      ) AS SRC
      ON (SRC.InvoiceIDO = DST.InvoiceIDO)
     AND (DST.IsCurrent = 1)    -- This conditrion limits the MERGE to Current invoice records only, and can make use of a conditional index
      -- Insert a new record if the Invoice is not already present
     WHEN NOT MATCHED THEN 
   INSERT (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent)
   VALUES (SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, @Today, '99991231', 1)
     -- Check if an existing Invoice record has changed data
     WHEN MATCHED 
      AND (SRC.IsOpen = 1 OR DST.IsOpen = 1) --Don't process changes on closed Invoices
      AND (DST.CustomerID <> SRC.CustomerID
       OR DST.DueDate <> SRC.DueDate
       OR DST.IsOpen <> SRC.IsOpen
       OR DST.BucketID <> SRC.BucketID
       OR DST.OrigAmnt <> SRC.OrigAmnt
       OR DST.RemAmnt <> SRC.RemAmnt
      )
      -- When Invoice data changed, set the existing record as no longer active, and the validity period to end Yesterday
     THEN UPDATE 
      SET DST.IsCurrent = 0, 
     DST.ValidTo = @Yesterday
    OUTPUT SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, $Action AS MergeAction
  ) AS MRG
  WHERE MRG.MergeAction = 'UPDATE';  
END

Note the small differences compared with the procedure given by Adam Asppn

  • The (IsCurrent = 1) condition should be moved from the IS MATCHED condition to the main merge search condition, this will enable the merge to use a conditional index
  • Since we require that closed Invoices are not visible after the date they are recorded as IsOpen=0, ValidTo should be set to @Today, and not 9999-12-31 (infinite) as is usual for  a Slowly Changing Dimension
  • Because we are interested in Open documents mostly, there is no need to check for changes when both source and destination records have IsOpen=0, hence the additional IS MATCHED condition: AND (SRC.IsOpen = 1 OR DST.IsOpen = 1)

However, running this query will result in an error:

The target table 'dbo.SCF_Invoices' of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint 'FK_SCF_Invoices_Customer'. 

The error is caused by is a known bug in SQL Server. The problem is using the OUTPUT clause of a DML statement with the INSERT... SELECT syntax. Workarounds are either to not have foreign keys referencing the dimension table or to INSERT the results of the MERGE... OUTPUT into a temp table then INSERT into the actual dimension table.

I decided to use the latter workaround and create a temporary table. Temporarily dropping the Foreign Key constraints is not acceptable. The final load procedure is as follows:

CREATE PROCEDURE LOAD_SCF_Invoices
AS
BEGIN
  SET NOCOUNT ON
  DECLARE @Today date, @Yesterday date
  SELECT @Today = MAX(OnDate) FROM ST_Invoices
  SELECT @Yesterday = DATEADD(dd, -1, @Today)
  -- To avoid the error: The target table 'dbo.SCF_CASE_DETAILS' of the INSERT statement cannot be on either side of a (primary key, foreign key) relationship when the FROM clause contains a nested INSERT, UPDATE, DELETE, or MERGE statement. Found reference constraint 'FK_SCF_CASE_DETAILS_CUSTOMER'.
  -- Insert in two steps. The outer insert will firs insert in a temporary table, then the temporary data is insert in [dbo].[SCF_CASE_DETAILS] 
  -- Create the temporyry table 
  CREATE TABLE #SCF_Invoices
  ( InvoiceIDO int NOT NULL,    -- Invoice ID from the ETL system
    CustomerID int NOT NULL,
    DueDate date NOT NULL,
    IsOpen bit NOT NULL DEFAULT 1,    --Indicates that the Invoice was not (fully) paid yet
    BucketID int NOT NULL,
    OrigAmnt numeric(16,2) NOT NULL,    --The original Invoice amount
    RemAmnt numeric(16,2) NOT NULL,        --The remaining, unpaid, Invoice amount
    ValidFrom date NOT NULL, 
    ValidTo DATE NOT NULL DEFAULT '99991231', 
    IsCurrent BIT NOT NULL DEFAULT 0, 
  )    
  -- The OUTER insert inserts a new valid record for all records flagged as no longer active
  -- INSERT in the temporary table
  INSERT INTO #SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent)
  SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, @Today AS ValidFrom, CASE WHEN IsOpen=0 THEN @Today ELSE '99991231' END AS ValidTo, 1
  FROM
  ( -- Merge statement
    MERGE INTO SCF_Invoices AS DST
    USING 
      ( SELECT st.InvoiceIDO,
               ISNULL(cus.CustomerID,-1) AS CustomerID,   -- Return Unknown Customer (-1) when not matched
               st.DueDate,
               CASE WHEN st.RemAmnt > 0 THEN 1 ELSE 0 END AS IsOpen,
               ISNULL(buc.BucketID,-1) as BucketID,    --Return Unknown (-1) when not matched
               st.OrigAmnt,
               st.RemAmnt
          FROM ST_Invoices st
          LEFT JOIN DI_Customer cus
            ON cus.CustomerIDO = st.CustomerIDO
          LEFT JOIN DI_Bucket buc
            ON DATEDIFF(dd,st.DueDate,st.OnDate) BETWEEN buc.BucketFromDays AND buc.BucketToDays
         WHERE st.OnDate = @Today
      ) AS SRC
      ON (SRC.InvoiceIDO = DST.InvoiceIDO)
     AND (DST.IsCurrent = 1)    -- This condition limits the MERGE to Current invoice records only, and can make use of a conditional index
      -- Insert a new record if the Invoice is not already present
     WHEN NOT MATCHED THEN 
   INSERT (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent)
   VALUES (SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, @Today, '99991231', 1)
     -- Check if an existing Invoice record has changed data
     WHEN MATCHED 
      AND (SRC.IsOpen = 1 OR DST.IsOpen = 1) --Don't process changes on closed Invoices
      AND (DST.CustomerID <> SRC.CustomerID
       OR DST.DueDate <> SRC.DueDate
       OR DST.IsOpen <> SRC.IsOpen
       OR DST.BucketID <> SRC.BucketID
       OR DST.OrigAmnt <> SRC.OrigAmnt
       OR DST.RemAmnt <> SRC.RemAmnt
      )
      -- When Invoice data changed, set the existing record as no longer active, and set the validity period to end Yesterday
     THEN UPDATE 
      SET DST.IsCurrent = 0, 
     DST.ValidTo = @Yesterday
    OUTPUT SRC.InvoiceIDO, SRC.CustomerID, SRC.DueDate, SRC.IsOpen, SRC.BucketID, SRC.OrigAmnt, SRC.RemAmnt, $Action AS MergeAction
  ) AS MRG
  WHERE MRG.MergeAction = 'UPDATE';  
  -- Finally insert the data from the temporary table
  INSERT INTO SCF_Invoices (InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent)
  SELECT InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt, ValidFrom, ValidTo, IsCurrent
    FROM #SCF_Invoices
END

Deleting data from the SCF table

Another requirement of the system is that data loads for the same day should be allowed to repeat. That means we have to delete, or invalidate, all changes made by the previous load. Of course, we cannot simply delete records from our SCF table. We also have to update the previous valid document and set IsCurrent=1, as well as change ValidTo.

To make things even more complicated the newly generated InvoiceID's could be referenced in another table. Deleting the newly inserted records would in that case result in a REFERENTIAL ERROR. To avoid referential errors I decided not to delete the records, but set their validity interval to a negative value, i.e. ValidTo < ValidFrom using the following code:

--DON'T DELETE, to avoid REFERENCE constraint conflicts, instead set the Validity of the record to an impossible value
  UPDATE SCF_Invoices
     SET ValidTo = @Yesterday,
         IsCurrent =0
   WHERE ValidFrom >= @Today

 --Correct SCF records
  UPDATE SCF_Invoices
     SET ValidTo = CASE WHEN IsOpen = 1 THEN '99991231' ELSE ValidFrom END,
         IsCurrent =1
   WHERE ValidFrom < @Today
     AND ValidTo >= @Yesterday

This piece of code should be inserted in the procedure LOAD_SCF_Invoices before the MERGE statement.

Indexes

The next thing to consider is which indexes to create. The standard approach is to create a PRIMARY KEY on the surrogate key InvoiceID. This is an ideal PRIMARY KEY, inserted records are always appended at the end of the table. It does, however, not provide and special advantages when querying the Slowly Changing Fact table.

In our case, we are usually querying the old Invoice table by Date:

SELECT * FROM F_Invoices WHERE OnDate=@OnDate

These references will typically be replaced by querying the table valued function we created before

SELECT * FROM tvf_F_Invoices(@OnDate)

The SCF condition - WHERE OnDate BETWEEN ValidFrom AND ValidTo - strongly benefits from a covering index on ValidFrom. Especially when ValidFrom is close to the last Date for which data is available. This corresponds also with our business interest. We will usually query the latest Invoice data. Therefore I decided to create the following Index:

ALTER TABLE SCF_Invoices ADD CONSTRAINT PK_SCF_Invoices PRIMARY KEY CLUSTERED 
( ValidFrom ASC,
  InvoiceIDO ASC
)

Since data is added for consecutive days, this PRIMARY KEY also will ensure new records are always appended at the end of the table.

I also wanted to create an index to aid the MERGE statement. Since every Invoice in the SCF table always has one record with (IsCurrent = 1) and this record is theone which should be updated, it is enough to consider only records which fulfill this condition. Furthermore, since the Source and Destination are joined on InvoiceIDO the following conditional covering index strongly optimizes the MERGE since it will evaluate only Current records.

CREATE NONCLUSTERED INDEX IX_IsCurrent_InvoiceIDO_InclData ON SCF_Invoices
( ValidFrom ASC,
  InvoiceID ASC
)
INCLUDE ( IsCurrent, ValidTo, InvoiceIDO, CustomerID, DueDate, IsOpen, BucketID, OrigAmnt, RemAmnt)
 WHERE(IsCurrent = 1)

Probably these indexes are not ideal. The optimal indexes also depend strongly on how the database is queried.  I would be happy to hear if any of you has any better suggestions.

This concludes my story about the implementation of a Slowly Changing Fact table. And the result? I was able to reduce the Invoice table from the original 200 million records to 5 million. A substantial reduction; which also had its effects on the responsiveness of the database. The loading process for example, was reduced from 1.5 hours, originally, or exceptionally even several hours, to a mere 5 minutes in average!

A nice result I would say. I hope my story will help you to tackle similar problems. 

Resources

Rate

4.07 (15)

You rated this post out of 5. Change rating

Share

Share

Rate

4.07 (15)

You rated this post out of 5. Change rating