Best practice approach to surrogate keys for new records in SCD2 processing

  • I'd like your opinion on the best practice for SCD2 processing with respect to surrogate keys. A typical scenario would be a customer number.

    Say we have this schema (simplified, I'm not including valid from/valid to dates)

    PK (identity column) CustomerNumber (surrogate key) BusinessKey (say FirstName LastName DateOfBirth) SatelliteVariables (whatever triggers an SCD2 change) CurrentRecord

    And say an initial load:

    1 1001 John A Y
    2 1002 Mary B Y
    3 1003 Jack C Y

    So, the identity column generated the PK, and "some process" generated the customer number (SK) based on the BusinessKey (name) for SCD2 "new" records.

    Say the next load has:

    Mary D (update)
    Mark E (new)
    John F (update)
    Suri G (new)

    The new data could have (at least) two approaches:

    1) CustomerNumber (surrogate key) is based on the max previous Customer Number:

    1 1001 John A N
    2 1002 Mary B N
    3 1003 Jack C Y
    4 1002 Mary D Y
    5 1004 Mark E Y <<< max (1003) + 1
    6 1001 John F Y
    7 1005 Suri G Y <<< max (now 1004) + 1

    So, the previously assigned CustomerNumber is carried forward on the change record, and new records get assigned the "max+1" CustomerNumber

    2) CustomerNumber (surrogate key) is based on the PK (resulting in gaps in the CustomerNumber)

    1 1001 John A N
    2 1002 Mary B N
    3 1003 Jack C Y
    4 1002 Mary E Y
    5 1005 Mark D Y <<< PK (5) + offset (1000)
    6 1001 John F Y
    7 1007 Suri G Y <<< PK (7) + offset (1000)

    So, the previously assigned CustomerNumber is carried forward on the change record, and new records get derived from the PK

    Questions:

    1) What is the best practice? Or is each approach equivalent? Should I even care about "gaps" in the CustomerNumber?
    2) For a very large table, would #2 perform better than #1?
    3) Based on your best practice, what approach do you use to implement this? Insert trigger? Function+Default Value? Something else?
    4) Do you have any example code to get me started?

    (Note: I'm trying to adopt the code from https://www.sqlservercentral.com/Forums/Topic1851256.aspx to implement #1, although this use case is a bit more complex. #2 should be simple to implement. Regardless, I wanted to post this to get your thoughts on best practice.)

    Apologies if this is a noob question. I Googled "sql server scd2 surrogate key customer number" before posting. I'm happy to be pointed to a link and educate myself, rather than being spoon fed, if you'll point me in the right direction.

    (The Customer Number analogy is just an example. My actual scenario is a single table, not a full fledged dimensional data warehouse)

    Thanks...

  • One thing i will recommend is to grab a copy of The Data Warehouse Toolkit.
    This should answer your questions around this.
    Its a great book and well worth it.

  • This sort of works for #1, but I'm not sure it's best practice.  It's moving some of my application logic into the database layer - I'm not sure if that's a good thing or not.

    ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_cr]
    GO

    ALTER TABLE [tmp].[test] DROP CONSTRAINT [DF_test_SK]
    GO

    DROP TABLE [tmp].[test]
    GO

    SET ANSI_NULLS ON
    GO

    SET QUOTED_IDENTIFIER ON
    GO

    SET ANSI_PADDING ON
    GO

    CREATE TABLE [tmp].[test](
    [PK] [int] IDENTITY(1,1) NOT NULL,
    [SK] [int] NOT NULL,
    [BK] [varchar](10) NOT NULL,
    [sv] [varchar](10) NULL,
    [cr] [char](1) NOT NULL,
    CONSTRAINT [PK_test] PRIMARY KEY CLUSTERED
    (
    [PK] ASC
    )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
    ) ON [PRIMARY]

    GO

    ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_SK] DEFAULT ((-1)) FOR [SK]
    GO

    ALTER TABLE [tmp].[test] ADD CONSTRAINT [DF_test_cr] DEFAULT ('Y') FOR [cr]
    GO

    CREATE TRIGGER tmp.trgTest_UpdateSK
    ON tmp.test
    INSTEAD OF INSERT
    AS
    BEGIN
    DECLARE @MaxSK INT
    SET @MaxSK = (SELECT COALESCE(MAX(SK),0) FROM tmp.test)

    ;WITH Inserts AS (
    SELECT
    @MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK
    ,i.BK
    ,i.sv
    FROM
    INSERTED i
    LEFT JOIN
    tmp.test t
    ON
    i.BK=t.BK
    WHERE
    t.BK IS NULL
    )
    ,Updates AS (
    SELECT
    t.SK
    ,i.BK
    ,i.sv
    FROM
    INSERTED i
    LEFT JOIN
    tmp.test t
    ON
    i.BK=t.BK
    WHERE
    t.BK IS NOT NULL AND t.cr='Y'
    )
    INSERT INTO tmp.test
    (SK,BK,sv)
    SELECT * FROM Updates
    UNION
    SELECT * FROM Inserts
    END
    GO

    INSERT INTO tmp.test
    (BK,sv)
    VALUES
    ('John','A'),('Mary','B'),('Jack','C')
    GO
    SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

    INSERT INTO tmp.test
    (BK,sv)
    VALUES
    ('Mary','D'),('Mark','E'),('John','F'),('Suri','G')
    GO
    SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

    INSERT INTO tmp.test
    (BK,sv)
    VALUES
    ('John','H'),('Sara','I'),('Mark','J'),('Walt','K')
    GO
    SELECT * FROM tmp.test t ORDER BY t.BK,t.PK

    But I don't understand why this fails?  (code excerpt only):

    ;WITH Inserts AS (
    SELECT
    1000 + @MaxSK + ROW_NUMBER() OVER (ORDER BY t.PK) AS SK
    ,i.BK
    ,i.sv
    FROM
    INSERTED i
    LEFT JOIN
    tmp.test t
    ON
    i.BK=t.BK
    WHERE
    t.BK IS NULL
    )

  • From the paucity of replies, perhaps the original question belonged in a data warehousing forum rather than here.

    In my actual application, I've got an application layer (SAS) conducting ETL against the database layer (SQL Server) via ODBC.  This is analogous to any ETL tool or .Net providing the application layer against the SQL Server database layer.

    I've gone with approach #2 from my original post, with gaps in the CustomerNumber, primarily because it was easier to implement, but also because when CustomerNumber = PK, I know that was the initial record (a nice to have only - it could also be derived based on minimum ValidFrom date).  

    The bulk of the logic is in the application layer.  The application layer code knows the difference between a new record (no existing CustomerNumber) and an update record (existing CustomerNumber), and can inject the existing CustomerNumber into the INSERT statement.

    The AFTER INSERT trigger is simply:

    CREATE TRIGGER [content].[trgUpdateRecnum]
    ON [content].[COD_URF]
    AFTER INSERT
    AS
    BEGIN
    SET NOCOUNT ON
    UPDATE t
    SET
    t.RECNUM = t.COD_URF_ID
    FROM
    [content].[COD_URF] t
    INNER JOIN
    INSERTED i
    ON
    t.COD_URF_ID = i.COD_URF_ID
    WHERE
    t.RECNUM = -1
    END


    I've read some blogs recommending against the use of triggers, but in this case it seems the easiest approach.

    (I've got "The Data Warehouse Toolkit, Second Edition" from a colleague, but a review of the book's index on "Slowly Changing Dimensions" does not indicate a best practice approach to my OP).

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply