Using a view to implement SCD2 processing (preventing deadlocks)

  • Hi,

    Question #1:

    Scenario:
    * I have a series of SCD2 source tables which are updated twice a week.
    * I'm creating a downstream series of SCD2 target tables which are updated monthly.
    * I am using a delta feed for the staging tables, reading rows changed in the source table between start and end date, where that window slides forward each update.
    * The SCD2 change columns are named replica_valid_from_date, replica_valid_to_date (datetime2), and replica_current_record.
    *The source tables logically delete a row by setting replica_valid_to_date to "now" and replica_current_record='N'.
    * I'm only interested in the current row from the upstream SCD2 table at the time of my data refresh.  IOW my "temporal granularity" (not sure if that's a good term?) is greater for my SCD2 tables than the source SCD2 tables.

    I am using a view to determine the SCD2 changes required.  For example:

    CREATE VIEW [dwh].[vwDIAGNOSIS]
    AS
    -- We only care about the most recent row from the delta extract
    WITH cteLastRow AS (
      SELECT src.FACILITY_STAY_EPISODE_ID
        ,src.FACILITY_STAY_ID
        ,src.FACILITY_ID
        ,src.diagnosis_type
        ,src.diagnosis_code
        ,src.clinical_codeset
        ,src.clinical_codeset_curr
        ,src.diagnosis_code_curr
        ,src.diagnosis_code_ccl
        ,src.condition_onset_flag
        ,src.replica_valid_from_date
        ,src.replica_valid_to_date
        ,src.replica_current_record
        ,src.replica_uuid
        ,src.CHECKSUM
      FROM (
      SELECT *
        ,RowNum = ROW_NUMBER() OVER (
         PARTITION BY src.FACILITY_STAY_EPISODE_ID, src.diagnosis_type
         ORDER BY replica_valid_from_date DESC
        )
      FROM trn.DIAGNOSIS src
      ) src
      WHERE RowNum=1
    ),
    -- We only care about current rows from the target table
    cteCurrentRow AS(
      SELECT *
      FROM dwh.DIAGNOSIS tgt
      WHERE tgt.CURRENT_ROW='Y'
    )
    SELECT *
    FROM (
    SELECT CASE
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NULL)  THEN 'INSERT'
    -- WHEN (src.FACILITY_STAY_EPISODE_ID IS NULL  AND tgt.FACILITY_STAY_EPISODE_ID IS NOT NULL) THEN 'DELETE' -- This does not work with a delta feed
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NOT NULL) THEN
        CASE
         WHEN (src.replica_valid_to_date < '5999-01-01') THEN 'DELETE'
         ELSE
         CASE
           WHEN (src.CHECKSUM = tgt.CHECKSUM) THEN 'NOCHANGE'
           ELSE 'CHANGE'
         END
        END
       END AS ACTION
      ,src.FACILITY_STAY_EPISODE_ID
      ,src.FACILITY_STAY_ID
      ,src.FACILITY_ID
      ,src.diagnosis_type
      ,src.diagnosis_code
      ,src.clinical_codeset
      ,src.clinical_codeset_curr
      ,src.diagnosis_code_curr
      ,src.diagnosis_code_ccl
      ,src.condition_onset_flag
      ,src.replica_valid_from_date
      ,src.replica_valid_to_date
      ,src.replica_current_record
      ,src.replica_uuid
      ,src.CHECKSUM
    FROM cteLastRow src
    FULL JOIN cteCurrentRow tgt
    ON  src.FACILITY_STAY_EPISODE_ID=tgt.FACILITY_STAY_EPISODE_ID
    AND  src.diagnosis_type=tgt.diagnosis_type
    ) scd
    WHERE scd.ACTION IN ('INSERT','CHANGE','DELETE')
    GO

    (For now ignore the SELECT *, I'll clean up the code later).

    FACILITY_STAY_EPISODE_ID is an SK representing the natural keys between the source (staging) table and target table.

    What's sort of cool is, even after inserting rows, this view will still return the same results until the old rows are closed out (post processing, see below).  So, if the target is 10M rows, but the staging only has 1K inserts and changes, even after inserting the 1K new/changed rows, the view returns the same original 1K rows (since the new rows are NOCHANGE and filtered out).

    HOWEVER, when I use this view as a source in my SSDT job, I MUST ensure Table Lock is unticked in the target table, since the view is reading from the same table that is being updated.  Regardless, this seems to work well.  I like that I can query the view in SSMS during development to see the changes that will be inserted.

    Question:  Is this a good approach?  Will the view perform as well as embedding the code in the SSDT source table?  Any ideas on making this perform better?  (I'm not using the lookup transformation to detect changes due to target table data volumes).

    Question #2:

    I'm also using another view to close out old rows after the SCD2 inserts & changes are inserted:

    CREATE VIEW [scd].[vwDIAGNOSIS]
    AS

    -- Use the LEAD() windowing function over the ENTIRE TABLE to set VALID_TO and CURRENT_ROW
    WITH cteUpdateSCD2Columns
    AS (
    SELECT FACILITY_STAY_EPISODE_ID
      ,diagnosis_type
      ,LEAD(VALID_FROM,1,{D '9999-12-31'}) OVER (
        PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
        ORDER BY VALID_FROM
       ) AS VALID_TO
      ,CASE
        WHEN LEAD(VALID_FROM,1,NULL) OVER (
          PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
          ORDER BY VALID_FROM
        ) IS NULL THEN 'Y' ELSE 'N'
       END AS CURRENT_ROW
    FROM dwh.DIAGNOSIS
    --WHERE CURRENT_ROW='Y' -- This fails for previously deleted rows
    )

    -- Join with view to only update changed or deleted rows
    SELECT src.ACTION
      ,src.FACILITY_STAY_EPISODE_ID
      ,src.diagnosis_type
      ,scd.VALID_TO
      ,scd.CURRENT_ROW
    FROM cteUpdateSCD2Columns scd
    INNER JOIN dwh.vwDIAGNOSIS src
    ON  src.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
    AND  src.diagnosis_type=scd.diagnosis_type
    WHERE src.ACTION IN ('CHANGE','DELETE')
    GO

    The CTE will close out all rows except the latest one (within the keys), and set the latest one as the current row.

    But it does this for the entire table.  This would be very inefficient for a huge table (or maybe not...perhaps the join is inefficient...that's why I'm posting!).  Again, if my target is 10M rows, my changes were only 1K rows, and my previous view still knows those changes even after the insert, then I can use that view to limit the updated rows.

    My post-processing code then becomes:

    -- Execute this as two separate batches
    -- Post-processing for changes
    UPDATE  dwh.DIAGNOSIS
    SET   VALID_TO=scd.VALID_TO
        ,CURRENT_ROW=scd.CURRENT_ROW
    FROM   scd.vwDIAGNOSIS scd
    WHERE  DIAGNOSIS.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
    AND   DIAGNOSIS.diagnosis_type=scd.diagnosis_type
    AND   scd.ACTION='CHANGE'

    -- Post-processing for deletes
    UPDATE  dwh.DIAGNOSIS
    SET   VALID_TO=GETDATE()
        ,CURRENT_ROW='N'
    FROM   scd.vwDIAGNOSIS scd
    WHERE  DIAGNOSIS.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
    AND   DIAGNOSIS.diagnosis_type=scd.diagnosis_type
    AND   scd.ACTION='DELETE'

    This post-processing is not working correctly due to deadlocks:  The view is using the scd.vwDIAGNOSIS view to limit the scope of the updates, but that view is reading the target table that is being updated.

    Question:  Same as before, is this a good approach?  If so, is there a setting I can use to prevent the deadlocks (i.e. TRANSACTION ISOLATION LEVEL ???).  Table hints?

    I did check out https://docs.microsoft.com/en-us/sql/t-sql/statements/set-transaction-isolation-level-transact-sql but TBH it went over my head.

    Appreciate any advice before I pull the pin and convert all this to a stored procedure using temp tables.

    Thanks...

  • Thanks for all the replies...nah, just kidding.  Damn, I must ask the really hard questions 😉

    The first view is now:

    CREATE VIEW [dwh].[vwDIAGNOSIS]
    AS
    -- We only care about current rows from the target table
    WITH cteCurrentRow AS(
      SELECT FACILITY_STAY_EPISODE_ID, diagnosis_type, CHECKSUM
      FROM dwh.DIAGNOSIS tgt WITH (NOLOCK)
      WHERE tgt.CURRENT_ROW='Y'
    )
    SELECT *
    FROM (
    SELECT CASE
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NULL)  THEN 'INSERT'
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NOT NULL) THEN
        CASE
         WHEN (src.replica_valid_to_date < (SELECT DataExtract_EndTime FROM meta.vwDataExtract_EndTime)) THEN 'DELETE'
         ELSE
         CASE
           WHEN (src.CHECKSUM = tgt.CHECKSUM) THEN 'NOCHANGE'
           ELSE 'CHANGE'
         END
        END
       END AS ACTION
      ,src.FACILITY_STAY_EPISODE_ID
      ,src.FACILITY_STAY_ID
      ,src.FACILITY_ID
      ,src.diagnosis_type
      ,src.diagnosis_code
      ,src.clinical_codeset
      ,src.clinical_codeset_curr
      ,src.diagnosis_code_curr
      ,src.diagnosis_code_ccl
      ,src.condition_onset_flag
      ,src.replica_valid_from_date
      ,src.replica_valid_to_date
      ,src.replica_current_record
      ,src.replica_uuid
      ,src.CHECKSUM
    FROM trn.DIAGNOSIS src
    LEFT JOIN cteCurrentRow tgt
    ON  src.FACILITY_STAY_EPISODE_ID=tgt.FACILITY_STAY_EPISODE_ID
    AND  src.diagnosis_type=tgt.diagnosis_type
    ) scd
    WHERE scd.ACTION IN ('INSERT','CHANGE','DELETE')
    GO

    Again, this view is used in SSDT as an OLEDB data source, writing to an OLEDB data target, and that target is dwh.DIAGNOSIS, which is read by the view.  The properties of the target table are Keep identity; Keep nulls; Table lock; and Check constraints are all unchecked, and the Maximum insert commit size 2147483647

    The WITH (NOLOCK) hint means it's a dirty read.  The processing only inserts rows, either new inserts or change rows.

    So, is the dirty read and/or the Table lock (unticked) a Really Bad Idea?

    If so, can anyone enlighten me on 1) database properties Allow Snapshot Isolation & Is Read Committed Snapshot On, and 2) SET TRANSACTION ISOLATION LEVEL SNAPSHOT???  

    The post-processing is now:

    CREATE VIEW [scd].[vwDIAGNOSIS]
    AS

    -- Use the LEAD() windowing function over the ENTIRE TABLE to set VALID_TO and CURRENT_ROW
    WITH cteUpdateSCD2Columns
    AS (
    SELECT FACILITY_STAY_EPISODE_ID
      ,diagnosis_type
      ,LEAD(VALID_FROM,1,{D '9999-12-31'}) OVER (
        PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
        ORDER BY VALID_FROM
       ) AS VALID_TO
      ,CASE
        WHEN LEAD(VALID_FROM,1,NULL) OVER (
          PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
          ORDER BY VALID_FROM
        ) IS NULL THEN 'Y' ELSE 'N'
       END AS CURRENT_ROW
    FROM dwh.DIAGNOSIS WITH (NOLOCK)
    )

    -- Join with view to only update changed or deleted rows
    SELECT src.ACTION
      ,src.FACILITY_STAY_EPISODE_ID
      ,src.diagnosis_type
      ,scd.VALID_TO
      ,scd.CURRENT_ROW
    FROM cteUpdateSCD2Columns scd
    INNER JOIN dwh.vwDIAGNOSIS src WITH (NOLOCK)
    ON  src.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
    AND  src.diagnosis_type=scd.diagnosis_type
    WHERE src.ACTION IN ('CHANGE','DELETE')
    GO

    And this stored procedure:

    ALTER PROCEDURE [scd].[sp_MoH_PostProcess_DIAGNOSIS]
    AS
    BEGIN

      SET NOCOUNT ON

      -- Only process current rows - once a row is closed out it must never be changed again
      BEGIN TRANSACTION

       -- Execute this as two separate batches
       -- Post-processing for changes
       PRINT 'Processing CHANGES'

       UPDATE  dwh.DIAGNOSIS
       SET   VALID_TO=scd.VALID_TO
          ,CURRENT_ROW=scd.CURRENT_ROW
       FROM   scd.vwDIAGNOSIS scd WITH (NOLOCK)
       WHERE  DIAGNOSIS.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
       AND   DIAGNOSIS.diagnosis_type=scd.diagnosis_type
       AND   DIAGNOSIS.CURRENT_ROW='Y'
       AND   scd.ACTION='CHANGE'

       PRINT CONCAT(@@rowcount,' rows affected
       ')

       -- Post-processing for deletes
       PRINT 'Processing DELETES'

       UPDATE  dwh.DIAGNOSIS
       SET   VALID_TO=GETDATE()
          ,CURRENT_ROW='N'
       FROM   scd.vwDIAGNOSIS scd WITH (NOLOCK)
       WHERE  DIAGNOSIS.FACILITY_STAY_EPISODE_ID=scd.FACILITY_STAY_EPISODE_ID
       AND   DIAGNOSIS.diagnosis_type=scd.diagnosis_type
       AND   DIAGNOSIS.CURRENT_ROW='Y'
       AND   scd.ACTION='DELETE'

       PRINT CONCAT(@@rowcount,' rows affected
       ')

      COMMIT

    END

    Again, any enlightenment on why WITH (NOLOCK) is a bad idea - and suitable alternative approaches - would be much appreciated.

    I could change the stored procedure to read from the scd.vwDIAGNOSIS view and insert the SK's into a temp table, if that would be a better design???

    Thanks...

  • tl;dr:  Is WITH (NOLOCK) always bad practice?

    Well kind of disappointing that there's no comments or advice on this post, but it happens.  I acknowledge that could certainly be my fault; perhaps too long and too much detail???  Anyway, there's always StackOverflow 😉

    This post is just to wrap up with what appears to be working OK, in case others find this post in a future search, and perhaps find some of the ideas or approaches useful to their particular scenarios.

    1) I use this view to determine what action I need to take with respect to SCD2 processing.  It is a separate view since management would like metrics on INSERT/CHANGE/(Logical)DELETE/NOCHANGE.

    ALTER VIEW [scd].[vwDIAGNOSIS_Action]
    AS
    -- We only care about current rows from the target table
    -- Return the SK, NK, and CHECKSUM
    WITH cteTarget AS(
      SELECT tgt.DIAGNOSIS_ID, tgt.FACILITY_STAY_EPISODE_ID, tgt.diagnosis_type, tgt.[CHECKSUM]
      FROM dwh.DIAGNOSIS tgt WITH (NOLOCK)
      WHERE tgt.CURRENT_ROW='Y'
    )
    SELECT CASE
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NULL)  THEN 'INSERT'
        WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NOT NULL) THEN
        CASE
         WHEN (src.replica_valid_to_date < (SELECT DataExtract_EndTime FROM meta.vwDataExtract_EndTime)) THEN 'DELETE'
         ELSE
         CASE
           WHEN (src.CHECKSUM = tgt.CHECKSUM) THEN 'NOCHANGE'
           ELSE 'CHANGE'
         END
        END
       END AS ACTION
      ,src.DIAGNOSIS_ID AS SRC_DIAGNOSIS_ID
      ,tgt.DIAGNOSIS_ID AS TGT_DIAGNOSIS_ID
    FROM trn.DIAGNOSIS src
    LEFT JOIN cteTarget tgt
    ON  src.FACILITY_STAY_EPISODE_ID=tgt.FACILITY_STAY_EPISODE_ID
    AND  src.diagnosis_type=tgt.diagnosis_type

    Note the WITH (NOLOCK) hint on the target table.  Also note that, during this ETL processing, no one else will be accessing this table. 

    I don't know if the dirty read would allow reading new rows that the processing is inserting as the reading takes place - I'm not an expert on these SQL Server internals.  But the processing appears to be working OK.

    Both the source and target tables have an identity column named <TABLENAME>_ID as the PK, i.e. DIAGNOSIS_ID.  However, they are different values in the two tables.  So, I have to join on the NK's between the source and target.  However, once the join has happened, I can use the PK's from the two tables to constrain the affected rows and improve performance (see below).

    2) I use this view as the datasource in SSDT to load the existing target table dwh.DIAGNOSIS.  I only need to inject new or changed rows.  A logical delete implies that the row already exists in the target table, I just need to close it out during post-processing.  And I ignore NOCHANGEs to columns of non-interest via the CHECKSUM above.

    ALTER VIEW [dwh].[vwDIAGNOSIS]
    AS
    SELECT scd.ACTION
      ,src.FACILITY_STAY_EPISODE_ID
      ,src.FACILITY_STAY_ID
      ,src.FACILITY_ID
      ,src.diagnosis_type
      ,src.diagnosis_code
      ,src.clinical_codeset
      ,src.clinical_codeset_curr
      ,src.diagnosis_code_curr
      ,src.diagnosis_code_ccl
      ,src.condition_onset_flag
      ,src.replica_uuid
      ,src.CHECKSUM
    FROM trn.DIAGNOSIS src
    INNER JOIN scd.vwDIAGNOSIS_Action scd
    ON  src.DIAGNOSIS_ID=scd.SRC_DIAGNOSIS_ID
    WHERE scd.ACTION IN ('INSERT','CHANGE')

    I'm using an INNER JOIN between the source table and the view + WHERE clause to constrain rows to only INSERTs. and CHANGEs.

    Note that Table Lock must be unchecked in the OLEDB target table in SSDT.

    Note also that, even after the processing has finished, the action view (first view above) still returns the exact same results for as before for INSERT/CHANGE/DELETE, since 1) the source data is still in place (it hasn't been truncated), and 2) any new rows are NOCHANGE rows.

    3)  I use this view in post-processing to close out affected rows in the target table:

    ALTER VIEW [scd].[vwDIAGNOSIS_Close]
    AS

    -- Use the LEAD() windowing function over the ENTIRE TABLE to set VALID_TO and CURRENT_ROW
    WITH cteTarget
    AS (
    SELECT DIAGNOSIS_ID AS TGT_DIAGNOSIS_ID
      ,LEAD(VALID_FROM,1,{D '9999-12-31'}) OVER (
        PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
        ORDER BY VALID_FROM
       ) AS VALID_TO
      ,CASE
        WHEN LEAD(VALID_FROM,1,NULL) OVER (
          PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
          ORDER BY VALID_FROM
        ) IS NULL THEN 'Y' ELSE 'N'
       END AS CURRENT_ROW
    FROM dwh.DIAGNOSIS WITH (NOLOCK)
    )

    -- Join with view to only update changed or deleted rows
    SELECT scd.ACTION
      ,scd.SRC_DIAGNOSIS_ID
      ,tgt.TGT_DIAGNOSIS_ID
      ,tgt.VALID_TO
      ,tgt.CURRENT_ROW
    FROM cteTarget tgt
    INNER JOIN scd.vwDIAGNOSIS_Action scd
    ON  tgt.TGT_DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
    WHERE scd.ACTION IN ('CHANGE','DELETE')

    Note the WITH (NOLOCK) hint on the target table.

    Note also that I only need to close out rows that were in the target table previously, i.e. those target SK's captured in the action view.  So, the CTE looks ahead across the entire table to get the VALID_FROM date of the next row, and sets VALID_TO to that date on the current row.  And the INNER JOIN constrains the affected rows to only CHANGEs and DELETEs of the old rows in the target table.

    4) I use this stored procedure to actually implement the processing.  It is in an SQL Task in SSDT:

    ALTER PROCEDURE [scd].[sp_MoH_PostProcess_DIAGNOSIS]
    AS
    BEGIN

      SET NOCOUNT ON

      -- Only process current rows - once a row is closed out it must never be changed again
      BEGIN TRANSACTION

       -- Execute this as two separate batches
       -- Post-processing for changes
       PRINT 'Processing CHANGES'

       UPDATE  dwh.DIAGNOSIS
       SET   VALID_TO=scd.VALID_TO
          ,CURRENT_ROW=scd.CURRENT_ROW
       FROM   dwh.DIAGNOSIS tgt
       INNER JOIN scd.vwDIAGNOSIS_Close scd
       ON   tgt.DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
       WHERE  tgt.CURRENT_ROW='Y'
       AND   scd.ACTION='CHANGE'

       PRINT CONCAT(@@rowcount,' rows affected
       ')

       -- Post-processing for deletes
       PRINT 'Processing DELETES'

       UPDATE  dwh.DIAGNOSIS
       SET   VALID_TO=GETDATE()
          ,CURRENT_ROW='N'
       FROM   dwh.DIAGNOSIS tgt
       INNER JOIN scd.vwDIAGNOSIS_Close scd
       ON   tgt.DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
       WHERE  tgt.CURRENT_ROW='Y'
       AND   scd.ACTION='DELETE'

       PRINT CONCAT(@@rowcount,' rows affected
       ')

      COMMIT

    END

    Note that, after the close out processing, the action view no longer returns INSERTS or CHANGES, since 1) the old rows have been closed out, 2) we're only concerned with current rows in the target table, 3) any inserted rows are now NOCHANGEs.

    I can now truncate the source table if I wish, in preparation for the next load.

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply