tl;dr: Is WITH (NOLOCK) always bad practice?
Well kind of disappointing that there's no comments or advice on this post, but it happens. I acknowledge that could certainly be my fault; perhaps too long and too much detail??? Anyway, there's always StackOverflow 😉
This post is just to wrap up with what appears to be working OK, in case others find this post in a future search, and perhaps find some of the ideas or approaches useful to their particular scenarios.
1) I use this view to determine what action I need to take with respect to SCD2 processing. It is a separate view since management would like metrics on INSERT/CHANGE/(Logical)DELETE/NOCHANGE.
ALTER VIEW [scd].[vwDIAGNOSIS_Action]
AS
-- We only care about current rows from the target table
-- Return the SK, NK, and CHECKSUM
WITH cteTarget AS(
SELECT tgt.DIAGNOSIS_ID, tgt.FACILITY_STAY_EPISODE_ID, tgt.diagnosis_type, tgt.[CHECKSUM]
FROM dwh.DIAGNOSIS tgt WITH (NOLOCK)
WHERE tgt.CURRENT_ROW='Y'
)
SELECT CASE
WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NULL) THEN 'INSERT'
WHEN (src.FACILITY_STAY_EPISODE_ID IS NOT NULL AND tgt.FACILITY_STAY_EPISODE_ID IS NOT NULL) THEN
CASE
WHEN (src.replica_valid_to_date < (SELECT DataExtract_EndTime FROM meta.vwDataExtract_EndTime)) THEN 'DELETE'
ELSE
CASE
WHEN (src.CHECKSUM = tgt.CHECKSUM) THEN 'NOCHANGE'
ELSE 'CHANGE'
END
END
END AS ACTION
,src.DIAGNOSIS_ID AS SRC_DIAGNOSIS_ID
,tgt.DIAGNOSIS_ID AS TGT_DIAGNOSIS_ID
FROM trn.DIAGNOSIS src
LEFT JOIN cteTarget tgt
ON src.FACILITY_STAY_EPISODE_ID=tgt.FACILITY_STAY_EPISODE_ID
AND src.diagnosis_type=tgt.diagnosis_type
Note the WITH (NOLOCK) hint on the target table. Also note that, during this ETL processing, no one else will be accessing this table.
I don't know if the dirty read would allow reading new rows that the processing is inserting as the reading takes place - I'm not an expert on these SQL Server internals. But the processing appears to be working OK.
Both the source and target tables have an identity column named <TABLENAME>_ID as the PK, i.e. DIAGNOSIS_ID. However, they are different values in the two tables. So, I have to join on the NK's between the source and target. However, once the join has happened, I can use the PK's from the two tables to constrain the affected rows and improve performance (see below).
2) I use this view as the datasource in SSDT to load the existing target table dwh.DIAGNOSIS. I only need to inject new or changed rows. A logical delete implies that the row already exists in the target table, I just need to close it out during post-processing. And I ignore NOCHANGEs to columns of non-interest via the CHECKSUM above.
ALTER VIEW [dwh].[vwDIAGNOSIS]
AS
SELECT scd.ACTION
,src.FACILITY_STAY_EPISODE_ID
,src.FACILITY_STAY_ID
,src.FACILITY_ID
,src.diagnosis_type
,src.diagnosis_code
,src.clinical_codeset
,src.clinical_codeset_curr
,src.diagnosis_code_curr
,src.diagnosis_code_ccl
,src.condition_onset_flag
,src.replica_uuid
,src.CHECKSUM
FROM trn.DIAGNOSIS src
INNER JOIN scd.vwDIAGNOSIS_Action scd
ON src.DIAGNOSIS_ID=scd.SRC_DIAGNOSIS_ID
WHERE scd.ACTION IN ('INSERT','CHANGE')
I'm using an INNER JOIN between the source table and the view + WHERE clause to constrain rows to only INSERTs. and CHANGEs.
Note that Table Lock must be unchecked in the OLEDB target table in SSDT.
Note also that, even after the processing has finished, the action view (first view above) still returns the exact same results for as before for INSERT/CHANGE/DELETE, since 1) the source data is still in place (it hasn't been truncated), and 2) any new rows are NOCHANGE rows.
3) I use this view in post-processing to close out affected rows in the target table:
ALTER VIEW [scd].[vwDIAGNOSIS_Close]
AS
-- Use the LEAD() windowing function over the ENTIRE TABLE to set VALID_TO and CURRENT_ROW
WITH cteTarget
AS (
SELECT DIAGNOSIS_ID AS TGT_DIAGNOSIS_ID
,LEAD(VALID_FROM,1,{D '9999-12-31'}) OVER (
PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
ORDER BY VALID_FROM
) AS VALID_TO
,CASE
WHEN LEAD(VALID_FROM,1,NULL) OVER (
PARTITION BY FACILITY_STAY_EPISODE_ID, diagnosis_type
ORDER BY VALID_FROM
) IS NULL THEN 'Y' ELSE 'N'
END AS CURRENT_ROW
FROM dwh.DIAGNOSIS WITH (NOLOCK)
)
-- Join with view to only update changed or deleted rows
SELECT scd.ACTION
,scd.SRC_DIAGNOSIS_ID
,tgt.TGT_DIAGNOSIS_ID
,tgt.VALID_TO
,tgt.CURRENT_ROW
FROM cteTarget tgt
INNER JOIN scd.vwDIAGNOSIS_Action scd
ON tgt.TGT_DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
WHERE scd.ACTION IN ('CHANGE','DELETE')
Note the WITH (NOLOCK) hint on the target table.
Note also that I only need to close out rows that were in the target table previously, i.e. those target SK's captured in the action view. So, the CTE looks ahead across the entire table to get the VALID_FROM date of the next row, and sets VALID_TO to that date on the current row. And the INNER JOIN constrains the affected rows to only CHANGEs and DELETEs of the old rows in the target table.
4) I use this stored procedure to actually implement the processing. It is in an SQL Task in SSDT:
ALTER PROCEDURE [scd].[sp_MoH_PostProcess_DIAGNOSIS]
AS
BEGIN
SET NOCOUNT ON
-- Only process current rows - once a row is closed out it must never be changed again
BEGIN TRANSACTION
-- Execute this as two separate batches
-- Post-processing for changes
PRINT 'Processing CHANGES'
UPDATE dwh.DIAGNOSIS
SET VALID_TO=scd.VALID_TO
,CURRENT_ROW=scd.CURRENT_ROW
FROM dwh.DIAGNOSIS tgt
INNER JOIN scd.vwDIAGNOSIS_Close scd
ON tgt.DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
WHERE tgt.CURRENT_ROW='Y'
AND scd.ACTION='CHANGE'
PRINT CONCAT(@@rowcount,' rows affected
')
-- Post-processing for deletes
PRINT 'Processing DELETES'
UPDATE dwh.DIAGNOSIS
SET VALID_TO=GETDATE()
,CURRENT_ROW='N'
FROM dwh.DIAGNOSIS tgt
INNER JOIN scd.vwDIAGNOSIS_Close scd
ON tgt.DIAGNOSIS_ID=scd.TGT_DIAGNOSIS_ID
WHERE tgt.CURRENT_ROW='Y'
AND scd.ACTION='DELETE'
PRINT CONCAT(@@rowcount,' rows affected
')
COMMIT
END
Note that, after the close out processing, the action view no longer returns INSERTS or CHANGES, since 1) the old rows have been closed out, 2) we're only concerned with current rows in the target table, 3) any inserted rows are now NOCHANGEs.
I can now truncate the source table if I wish, in preparation for the next load.