What is the best strategy to check for changes to a record in DWH ETL processes

  • Hi All,

    We have a current ETL process that takes a delta and uses MERGE to decide whether the data needs to be updated.  The current process uses an audit table (effectively a hand-built CDC) to drive the delta but this has proved to be unreliable.  Most tables do have a ROWSTAMP field so we are going to use this to drive the delta instead. (applying CDC is not an option unfortunately)

    We have about 120 tables to migrate each run, The current process has a separate stored procedure for each table that has a hard-coded list of fields doing a null = null or value - value comparison E.g. (and this is a fairly narrow table...)

    AND
    (
    SRC.[Asset_ID] <> TGT.[Asset_ID] OR (SRC.[Asset_ID] IS NULL AND TGT.[Asset_ID] IS NOT NULL) OR (SRC.[Asset_ID] IS NOT NULL AND TGT.[Asset_ID] IS NULL)
    OR SRC.[Attribute_ID] <> TGT.[Attribute_ID] OR (SRC.[Attribute_ID] IS NULL AND TGT.[Attribute_ID] IS NOT NULL) OR (SRC.[Attribute_ID] IS NOT NULL AND TGT.[Attribute_ID] IS NULL)
    OR SRC.[IsDeleted_BT] <> TGT.[IsDeleted_BT] OR (SRC.[IsDeleted_BT] IS NULL AND TGT.[IsDeleted_BT] IS NOT NULL) OR (SRC.[IsDeleted_BT] IS NOT NULL AND TGT.[IsDeleted_BT] IS NULL)
    OR SRC.[Location_ID] <> TGT.[Location_ID] OR (SRC.[Location_ID] IS NULL AND TGT.[Location_ID] IS NOT NULL) OR (SRC.[Location_ID] IS NOT NULL AND TGT.[Location_ID] IS NULL)
    OR SRC.[Position_ID] <> TGT.[Position_ID] OR (SRC.[Position_ID] IS NULL AND TGT.[Position_ID] IS NOT NULL) OR (SRC.[Position_ID] IS NOT NULL AND TGT.[Position_ID] IS NULL)
    OR SRC.[Type_ID] <> TGT.[Type_ID] OR (SRC.[Type_ID] IS NULL AND TGT.[Type_ID] IS NOT NULL) OR (SRC.[Type_ID] IS NOT NULL AND TGT.[Type_ID] IS NULL)
    OR SRC.[Condition_ID] <> TGT.[Condition_ID] OR (SRC.[Condition_ID] IS NULL AND TGT.[Condition_ID] IS NOT NULL) OR (SRC.[Condition_ID] IS NOT NULL AND TGT.[Condition_ID] IS NULL)
    OR SRC.[Condition_AsAt_DT] <> TGT.[Condition_AsAt_DT] OR (SRC.[Condition_AsAt_DT] IS NULL AND TGT.[Condition_AsAt_DT] IS NOT NULL) OR (SRC.[Condition_AsAt_DT] IS NOT NULL AND TGT.[Condition_AsAt_DT] IS NULL)
    OR SRC.[Priority_ID] <> TGT.[Priority_ID] OR (SRC.[Priority_ID] IS NULL AND TGT.[Priority_ID] IS NOT NULL) OR (SRC.[Priority_ID] IS NOT NULL AND TGT.[Priority_ID] IS NULL)
    OR SRC.[Manufacturer_ID] <> TGT.[Manufacturer_ID] OR (SRC.[Manufacturer_ID] IS NULL AND TGT.[Manufacturer_ID] IS NOT NULL) OR (SRC.[Manufacturer_ID] IS NOT NULL AND TGT.[Manufacturer_ID] IS NULL)
    OR SRC.[Pattern_ID] <> TGT.[Pattern_ID] OR (SRC.[Pattern_ID] IS NULL AND TGT.[Pattern_ID] IS NOT NULL) OR (SRC.[Pattern_ID] IS NOT NULL AND TGT.[Pattern_ID] IS NULL)
    OR SRC.[Material_ID] <> TGT.[Material_ID] OR (SRC.[Material_ID] IS NULL AND TGT.[Material_ID] IS NOT NULL) OR (SRC.[Material_ID] IS NOT NULL AND TGT.[Material_ID] IS NULL)
    OR SRC.[Colour_ID] <> TGT.[Colour_ID] OR (SRC.[Colour_ID] IS NULL AND TGT.[Colour_ID] IS NOT NULL) OR (SRC.[Colour_ID] IS NOT NULL AND TGT.[Colour_ID] IS NULL)
    OR SRC.[Comment_Lines_VC] <> TGT.[Comment_Lines_VC] OR (SRC.[Comment_Lines_VC] IS NULL AND TGT.[Comment_Lines_VC] IS NOT NULL) OR (SRC.[Comment_Lines_VC] IS NOT NULL AND TGT.[Comment_Lines_VC] IS NULL)
    OR SRC.[Risk_ID] <> TGT.[Risk_ID] OR (SRC.[Risk_ID] IS NULL AND TGT.[Risk_ID] IS NOT NULL) OR (SRC.[Risk_ID] IS NOT NULL AND TGT.[Risk_ID] IS NULL)
    OR SRC.[Group_ID] <> TGT.[Group_ID] OR (SRC.[Group_ID] IS NULL AND TGT.[Group_ID] IS NOT NULL) OR (SRC.[Group_ID] IS NOT NULL AND TGT.[Group_ID] IS NULL)
    OR SRC.[Image_Reference_VC] <> TGT.[Image_Reference_VC] OR (SRC.[Image_Reference_VC] IS NULL AND TGT.[Image_Reference_VC] IS NOT NULL) OR (SRC.[Image_Reference_VC] IS NOT NULL AND TGT.[Image_Reference_VC] IS NULL)
    OR SRC.[Media_Reference_VC] <> TGT.[Media_Reference_VC] OR (SRC.[Media_Reference_VC] IS NULL AND TGT.[Media_Reference_VC] IS NOT NULL) OR (SRC.[Media_Reference_VC] IS NOT NULL AND TGT.[Media_Reference_VC] IS NULL)
    OR SRC.[Current_Value_VC] <> TGT.[Current_Value_VC] OR (SRC.[Current_Value_VC] IS NULL AND TGT.[Current_Value_VC] IS NOT NULL) OR (SRC.[Current_Value_VC] IS NOT NULL AND TGT.[Current_Value_VC] IS NULL)
    OR SRC.[Quantity_DC] <> TGT.[Quantity_DC] OR (SRC.[Quantity_DC] IS NULL AND TGT.[Quantity_DC] IS NOT NULL) OR (SRC.[Quantity_DC] IS NOT NULL AND TGT.[Quantity_DC] IS NULL)
    OR SRC.[Fitted_Renewed_Date_DT] <> TGT.[Fitted_Renewed_Date_DT] OR (SRC.[Fitted_Renewed_Date_DT] IS NULL AND TGT.[Fitted_Renewed_Date_DT] IS NOT NULL) OR (SRC.[Fitted_Renewed_Date_DT] IS NOT NULL AND TGT.[Fitted_Renewed_Date_DT] IS NULL)
    OR SRC.[Default_Expected_Life_IN] <> TGT.[Default_Expected_Life_IN] OR (SRC.[Default_Expected_Life_IN] IS NULL AND TGT.[Default_Expected_Life_IN] IS NOT NULL) OR (SRC.[Default_Expected_Life_IN] IS NOT NULL AND TGT.[Default_Expected_Life_IN] IS NULL)
    OR SRC.[Expected_Life_IN] <> TGT.[Expected_Life_IN] OR (SRC.[Expected_Life_IN] IS NULL AND TGT.[Expected_Life_IN] IS NOT NULL) OR (SRC.[Expected_Life_IN] IS NOT NULL AND TGT.[Expected_Life_IN] IS NULL)
    OR SRC.[Default_Replacement_Type_ID] <> TGT.[Default_Replacement_Type_ID] OR (SRC.[Default_Replacement_Type_ID] IS NULL AND TGT.[Default_Replacement_Type_ID] IS NOT NULL) OR (SRC.[Default_Replacement_Type_ID] IS NOT NULL AND TGT.[Default_Replacement_Type_ID] IS NULL)
    OR SRC.[Replacement_Type_ID] <> TGT.[Replacement_Type_ID] OR (SRC.[Replacement_Type_ID] IS NULL AND TGT.[Replacement_Type_ID] IS NOT NULL) OR (SRC.[Replacement_Type_ID] IS NOT NULL AND TGT.[Replacement_Type_ID] IS NULL)
    OR SRC.[Default_Replacement_Pattern_ID] <> TGT.[Default_Replacement_Pattern_ID] OR (SRC.[Default_Replacement_Pattern_ID] IS NULL AND TGT.[Default_Replacement_Pattern_ID] IS NOT NULL) OR (SRC.[Default_Replacement_Pattern_ID] IS NOT NULL AND TGT.[Default_Replacement_Pattern_ID] IS NULL)
    OR SRC.[Replacement_Pattern_ID] <> TGT.[Replacement_Pattern_ID] OR (SRC.[Replacement_Pattern_ID] IS NULL AND TGT.[Replacement_Pattern_ID] IS NOT NULL) OR (SRC.[Replacement_Pattern_ID] IS NOT NULL AND TGT.[Replacement_Pattern_ID] IS NULL)
    OR SRC.[Default_Replacement_Expected_Life_IN] <> TGT.[Default_Replacement_Expected_Life_IN] OR (SRC.[Default_Replacement_Expected_Life_IN] IS NULL AND TGT.[Default_Replacement_Expected_Life_IN] IS NOT NULL) OR (SRC.[Default_Replacement_Expected_Life_IN] IS NOT NULL AND TGT.[Default_Replacement_Expected_Life_IN] IS NULL)
    OR SRC.[Replacement_Expected_Life_IN] <> TGT.[Replacement_Expected_Life_IN] OR (SRC.[Replacement_Expected_Life_IN] IS NULL AND TGT.[Replacement_Expected_Life_IN] IS NOT NULL) OR (SRC.[Replacement_Expected_Life_IN] IS NOT NULL AND TGT.[Replacement_Expected_Life_IN] IS NULL)
    OR SRC.[Initial_Override_Replacement_Date_DT] <> TGT.[Initial_Override_Replacement_Date_DT] OR (SRC.[Initial_Override_Replacement_Date_DT] IS NULL AND TGT.[Initial_Override_Replacement_Date_DT] IS NOT NULL) OR (SRC.[Initial_Override_Replacement_Date_DT] IS NOT NULL AND TGT.[Initial_Override_Replacement_Date_DT] IS NULL)
    OR SRC.[Subsequent_Override_Replacement_Date_DT] <> TGT.[Subsequent_Override_Replacement_Date_DT] OR (SRC.[Subsequent_Override_Replacement_Date_DT] IS NULL AND TGT.[Subsequent_Override_Replacement_Date_DT] IS NOT NULL) OR (SRC.[Subsequent_Override_Replacement_Date_DT] IS NOT NULL AND TGT.[Subsequent_Override_Replacement_Date_DT] IS NULL)
    OR SRC.[Default_Replacement_Cost_DC] <> TGT.[Default_Replacement_Cost_DC] OR (SRC.[Default_Replacement_Cost_DC] IS NULL AND TGT.[Default_Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Default_Replacement_Cost_DC] IS NOT NULL AND TGT.[Default_Replacement_Cost_DC] IS NULL)
    OR SRC.[Replacement_Cost_DC] <> TGT.[Replacement_Cost_DC] OR (SRC.[Replacement_Cost_DC] IS NULL AND TGT.[Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Replacement_Cost_DC] IS NOT NULL AND TGT.[Replacement_Cost_DC] IS NULL)
    OR SRC.[Default_Replacement_Cost_AsAt_DT] <> TGT.[Default_Replacement_Cost_AsAt_DT] OR (SRC.[Default_Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Default_Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Replacement_Cost_AsAt_DT] <> TGT.[Replacement_Cost_AsAt_DT] OR (SRC.[Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Default_Cyclical_Cost_DC] <> TGT.[Default_Cyclical_Cost_DC] OR (SRC.[Default_Cyclical_Cost_DC] IS NULL AND TGT.[Default_Cyclical_Cost_DC] IS NOT NULL) OR (SRC.[Default_Cyclical_Cost_DC] IS NOT NULL AND TGT.[Default_Cyclical_Cost_DC] IS NULL)
    OR SRC.[Cyclical_Cost_DC] <> TGT.[Cyclical_Cost_DC] OR (SRC.[Cyclical_Cost_DC] IS NULL AND TGT.[Cyclical_Cost_DC] IS NOT NULL) OR (SRC.[Cyclical_Cost_DC] IS NOT NULL AND TGT.[Cyclical_Cost_DC] IS NULL)
    OR SRC.[Default_Cyclical_Cost_AsAt_DT] <> TGT.[Default_Cyclical_Cost_AsAt_DT] OR (SRC.[Default_Cyclical_Cost_AsAt_DT] IS NULL AND TGT.[Default_Cyclical_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Cyclical_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Cyclical_Cost_AsAt_DT] IS NULL)
    OR SRC.[Cyclical_Cost_AsAt_DT] <> TGT.[Cyclical_Cost_AsAt_DT] OR (SRC.[Cyclical_Cost_AsAt_DT] IS NULL AND TGT.[Cyclical_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Cyclical_Cost_AsAt_DT] IS NOT NULL AND TGT.[Cyclical_Cost_AsAt_DT] IS NULL)
    OR SRC.[Cyclical_Date_DT] <> TGT.[Cyclical_Date_DT] OR (SRC.[Cyclical_Date_DT] IS NULL AND TGT.[Cyclical_Date_DT] IS NOT NULL) OR (SRC.[Cyclical_Date_DT] IS NOT NULL AND TGT.[Cyclical_Date_DT] IS NULL)
    OR SRC.[Override_Cyclical_Date_DT] <> TGT.[Override_Cyclical_Date_DT] OR (SRC.[Override_Cyclical_Date_DT] IS NULL AND TGT.[Override_Cyclical_Date_DT] IS NOT NULL) OR (SRC.[Override_Cyclical_Date_DT] IS NOT NULL AND TGT.[Override_Cyclical_Date_DT] IS NULL)
    OR SRC.[Default_Cyclical_Period_IN] <> TGT.[Default_Cyclical_Period_IN] OR (SRC.[Default_Cyclical_Period_IN] IS NULL AND TGT.[Default_Cyclical_Period_IN] IS NOT NULL) OR (SRC.[Default_Cyclical_Period_IN] IS NOT NULL AND TGT.[Default_Cyclical_Period_IN] IS NULL)
    OR SRC.[Cyclical_Period_IN] <> TGT.[Cyclical_Period_IN] OR (SRC.[Cyclical_Period_IN] IS NULL AND TGT.[Cyclical_Period_IN] IS NOT NULL) OR (SRC.[Cyclical_Period_IN] IS NOT NULL AND TGT.[Cyclical_Period_IN] IS NULL)
    OR SRC.[Last_Cyclical_Cost_ID] <> TGT.[Last_Cyclical_Cost_ID] OR (SRC.[Last_Cyclical_Cost_ID] IS NULL AND TGT.[Last_Cyclical_Cost_ID] IS NOT NULL) OR (SRC.[Last_Cyclical_Cost_ID] IS NOT NULL AND TGT.[Last_Cyclical_Cost_ID] IS NULL)
    OR SRC.[Last_Planned_Cost_ID] <> TGT.[Last_Planned_Cost_ID] OR (SRC.[Last_Planned_Cost_ID] IS NULL AND TGT.[Last_Planned_Cost_ID] IS NOT NULL) OR (SRC.[Last_Planned_Cost_ID] IS NOT NULL AND TGT.[Last_Planned_Cost_ID] IS NULL)
    OR SRC.[User_ID] <> TGT.[User_ID] OR (SRC.[User_ID] IS NULL AND TGT.[User_ID] IS NOT NULL) OR (SRC.[User_ID] IS NOT NULL AND TGT.[User_ID] IS NULL)
    OR SRC.[Created_Date_DT] <> TGT.[Created_Date_DT] OR (SRC.[Created_Date_DT] IS NULL AND TGT.[Created_Date_DT] IS NOT NULL) OR (SRC.[Created_Date_DT] IS NOT NULL AND TGT.[Created_Date_DT] IS NULL)
    OR SRC.[Feature_ID] <> TGT.[Feature_ID] OR (SRC.[Feature_ID] IS NULL AND TGT.[Feature_ID] IS NOT NULL) OR (SRC.[Feature_ID] IS NOT NULL AND TGT.[Feature_ID] IS NULL)
    OR SRC.[Damage_ID] <> TGT.[Damage_ID] OR (SRC.[Damage_ID] IS NULL AND TGT.[Damage_ID] IS NOT NULL) OR (SRC.[Damage_ID] IS NOT NULL AND TGT.[Damage_ID] IS NULL)
    OR SRC.[Accessibility_ID] <> TGT.[Accessibility_ID] OR (SRC.[Accessibility_ID] IS NULL AND TGT.[Accessibility_ID] IS NOT NULL) OR (SRC.[Accessibility_ID] IS NOT NULL AND TGT.[Accessibility_ID] IS NULL)
    OR SRC.[Access_Restrictions_VC] <> TGT.[Access_Restrictions_VC] OR (SRC.[Access_Restrictions_VC] IS NULL AND TGT.[Access_Restrictions_VC] IS NOT NULL) OR (SRC.[Access_Restrictions_VC] IS NOT NULL AND TGT.[Access_Restrictions_VC] IS NULL)
    OR SRC.[Special_Instructions_VC] <> TGT.[Special_Instructions_VC] OR (SRC.[Special_Instructions_VC] IS NULL AND TGT.[Special_Instructions_VC] IS NOT NULL) OR (SRC.[Special_Instructions_VC] IS NOT NULL AND TGT.[Special_Instructions_VC] IS NULL)
    OR SRC.[Default_Subsequent_Replacement_Cost_DC] <> TGT.[Default_Subsequent_Replacement_Cost_DC] OR (SRC.[Default_Subsequent_Replacement_Cost_DC] IS NULL AND TGT.[Default_Subsequent_Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Default_Subsequent_Replacement_Cost_DC] IS NOT NULL AND TGT.[Default_Subsequent_Replacement_Cost_DC] IS NULL)
    OR SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] <> TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] OR (SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Max_Smooth_Date_DT] <> TGT.[Max_Smooth_Date_DT] OR (SRC.[Max_Smooth_Date_DT] IS NULL AND TGT.[Max_Smooth_Date_DT] IS NOT NULL) OR (SRC.[Max_Smooth_Date_DT] IS NOT NULL AND TGT.[Max_Smooth_Date_DT] IS NULL)
    )

    The business is unwilling to allow me to use a 3rd party SSIS plugins like Konesan's Hash task so I am looking for alternatives that mean we don't have to maintain a complex and labour intensive set of stored procedures.  I have two broad strategies which I am considering.

    1. A function that will take the table name, PK field name (I have single field PKs thank goodness) and the PK identifier  that looks at the table schema, builds a dynamic query to pull out all of the columns except the ROWSTAMP, consistently cast every data type to VARCHAR, will COALESCE() NULL values, CONCAT() them together with a separator then  run CHECKSUM_BINARY() on the resulting string.  This means I can compare the values in the source OLTP system with the values in the DWH current record to decide whether it goes down the add, update or ignore route.
    2. Write code with a similar logic but use it instead to generate the stored procedure for that particular table.  We would still need to apply a different stored procedure to each SSIS package but at least when the database schema changes or new tables are added to the DWH the change is relatively painless.

    I like 1 but I think it could be a real performance hog as for each row in the ETL process it is effectively going to have to rebuild the dynamic SQL but means the process is more robust because it is not sensitive to changes in schema for the source tables  (They would however give a different HASH result so all would get updated on first delta)

    We are currently running on-prem VMs on SQL 2017 for the OLTP and Azure hosted VMs on SQL 2016 for the DWH, however there is a strong possibility that the DWH will move to SQL Azure in the near future just in case that changes the proposed solution.

  • Not answering your question, but a quick note about your existing SQL, as it can be made considerably more concise, using the following structure:

    SELECT *
    FROM src
    JOIN trg
    ON src.Id = trg.Id
    WHERE NOT EXISTS
    (
    SELECT src.col1, src.col2 INTERSECT SELECT trg.col1, trg.col2
    );

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • #1 is (way) too much overhead.

    #2 is better.  You want static code that is generated dynamically.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • check the following link(s) if it helps:

    1) https://michaeljswart.com/2012/05/t-sql-merge-statement-syntax-generator/

    2) https://www.pigeonsql.com/single-post/Dynamic-SQL-Merge-Script-for-Generate-Merge-statement

    CDC would have been better as there are only 4 operation code to find which rows are altered viz. Insert, Delete, Before Update and After Update.  Definitely, each table needs a separate data flow in SSIS.

    Good luck.

    =======================================================================

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply