What is the best strategy to check for changes to a record in DWH ETL processes

  • Hi All,

    We have a current ETL process that takes a delta and uses MERGE to decide whether the data needs to be updated.  The current process uses an audit table (effectively a hand-built CDC) to drive the delta but this has proved to be unreliable.  Most tables do have a ROWSTAMP field so we are going to use this to drive the delta instead. (applying CDC is not an option unfortunately)

    We have about 120 tables to migrate each run, The current process has a separate stored procedure for each table that has a hard-coded list of fields doing a null = null or value - value comparison E.g. (and this is a fairly narrow table...)

    AND
    (
    SRC.[Asset_ID] <> TGT.[Asset_ID] OR (SRC.[Asset_ID] IS NULL AND TGT.[Asset_ID] IS NOT NULL) OR (SRC.[Asset_ID] IS NOT NULL AND TGT.[Asset_ID] IS NULL)
    OR SRC.[Attribute_ID] <> TGT.[Attribute_ID] OR (SRC.[Attribute_ID] IS NULL AND TGT.[Attribute_ID] IS NOT NULL) OR (SRC.[Attribute_ID] IS NOT NULL AND TGT.[Attribute_ID] IS NULL)
    OR SRC.[IsDeleted_BT] <> TGT.[IsDeleted_BT] OR (SRC.[IsDeleted_BT] IS NULL AND TGT.[IsDeleted_BT] IS NOT NULL) OR (SRC.[IsDeleted_BT] IS NOT NULL AND TGT.[IsDeleted_BT] IS NULL)
    OR SRC.[Location_ID] <> TGT.[Location_ID] OR (SRC.[Location_ID] IS NULL AND TGT.[Location_ID] IS NOT NULL) OR (SRC.[Location_ID] IS NOT NULL AND TGT.[Location_ID] IS NULL)
    OR SRC.[Position_ID] <> TGT.[Position_ID] OR (SRC.[Position_ID] IS NULL AND TGT.[Position_ID] IS NOT NULL) OR (SRC.[Position_ID] IS NOT NULL AND TGT.[Position_ID] IS NULL)
    OR SRC.[Type_ID] <> TGT.[Type_ID] OR (SRC.[Type_ID] IS NULL AND TGT.[Type_ID] IS NOT NULL) OR (SRC.[Type_ID] IS NOT NULL AND TGT.[Type_ID] IS NULL)
    OR SRC.[Condition_ID] <> TGT.[Condition_ID] OR (SRC.[Condition_ID] IS NULL AND TGT.[Condition_ID] IS NOT NULL) OR (SRC.[Condition_ID] IS NOT NULL AND TGT.[Condition_ID] IS NULL)
    OR SRC.[Condition_AsAt_DT] <> TGT.[Condition_AsAt_DT] OR (SRC.[Condition_AsAt_DT] IS NULL AND TGT.[Condition_AsAt_DT] IS NOT NULL) OR (SRC.[Condition_AsAt_DT] IS NOT NULL AND TGT.[Condition_AsAt_DT] IS NULL)
    OR SRC.[Priority_ID] <> TGT.[Priority_ID] OR (SRC.[Priority_ID] IS NULL AND TGT.[Priority_ID] IS NOT NULL) OR (SRC.[Priority_ID] IS NOT NULL AND TGT.[Priority_ID] IS NULL)
    OR SRC.[Manufacturer_ID] <> TGT.[Manufacturer_ID] OR (SRC.[Manufacturer_ID] IS NULL AND TGT.[Manufacturer_ID] IS NOT NULL) OR (SRC.[Manufacturer_ID] IS NOT NULL AND TGT.[Manufacturer_ID] IS NULL)
    OR SRC.[Pattern_ID] <> TGT.[Pattern_ID] OR (SRC.[Pattern_ID] IS NULL AND TGT.[Pattern_ID] IS NOT NULL) OR (SRC.[Pattern_ID] IS NOT NULL AND TGT.[Pattern_ID] IS NULL)
    OR SRC.[Material_ID] <> TGT.[Material_ID] OR (SRC.[Material_ID] IS NULL AND TGT.[Material_ID] IS NOT NULL) OR (SRC.[Material_ID] IS NOT NULL AND TGT.[Material_ID] IS NULL)
    OR SRC.[Colour_ID] <> TGT.[Colour_ID] OR (SRC.[Colour_ID] IS NULL AND TGT.[Colour_ID] IS NOT NULL) OR (SRC.[Colour_ID] IS NOT NULL AND TGT.[Colour_ID] IS NULL)
    OR SRC.[Comment_Lines_VC] <> TGT.[Comment_Lines_VC] OR (SRC.[Comment_Lines_VC] IS NULL AND TGT.[Comment_Lines_VC] IS NOT NULL) OR (SRC.[Comment_Lines_VC] IS NOT NULL AND TGT.[Comment_Lines_VC] IS NULL)
    OR SRC.[Risk_ID] <> TGT.[Risk_ID] OR (SRC.[Risk_ID] IS NULL AND TGT.[Risk_ID] IS NOT NULL) OR (SRC.[Risk_ID] IS NOT NULL AND TGT.[Risk_ID] IS NULL)
    OR SRC.[Group_ID] <> TGT.[Group_ID] OR (SRC.[Group_ID] IS NULL AND TGT.[Group_ID] IS NOT NULL) OR (SRC.[Group_ID] IS NOT NULL AND TGT.[Group_ID] IS NULL)
    OR SRC.[Image_Reference_VC] <> TGT.[Image_Reference_VC] OR (SRC.[Image_Reference_VC] IS NULL AND TGT.[Image_Reference_VC] IS NOT NULL) OR (SRC.[Image_Reference_VC] IS NOT NULL AND TGT.[Image_Reference_VC] IS NULL)
    OR SRC.[Media_Reference_VC] <> TGT.[Media_Reference_VC] OR (SRC.[Media_Reference_VC] IS NULL AND TGT.[Media_Reference_VC] IS NOT NULL) OR (SRC.[Media_Reference_VC] IS NOT NULL AND TGT.[Media_Reference_VC] IS NULL)
    OR SRC.[Current_Value_VC] <> TGT.[Current_Value_VC] OR (SRC.[Current_Value_VC] IS NULL AND TGT.[Current_Value_VC] IS NOT NULL) OR (SRC.[Current_Value_VC] IS NOT NULL AND TGT.[Current_Value_VC] IS NULL)
    OR SRC.[Quantity_DC] <> TGT.[Quantity_DC] OR (SRC.[Quantity_DC] IS NULL AND TGT.[Quantity_DC] IS NOT NULL) OR (SRC.[Quantity_DC] IS NOT NULL AND TGT.[Quantity_DC] IS NULL)
    OR SRC.[Fitted_Renewed_Date_DT] <> TGT.[Fitted_Renewed_Date_DT] OR (SRC.[Fitted_Renewed_Date_DT] IS NULL AND TGT.[Fitted_Renewed_Date_DT] IS NOT NULL) OR (SRC.[Fitted_Renewed_Date_DT] IS NOT NULL AND TGT.[Fitted_Renewed_Date_DT] IS NULL)
    OR SRC.[Default_Expected_Life_IN] <> TGT.[Default_Expected_Life_IN] OR (SRC.[Default_Expected_Life_IN] IS NULL AND TGT.[Default_Expected_Life_IN] IS NOT NULL) OR (SRC.[Default_Expected_Life_IN] IS NOT NULL AND TGT.[Default_Expected_Life_IN] IS NULL)
    OR SRC.[Expected_Life_IN] <> TGT.[Expected_Life_IN] OR (SRC.[Expected_Life_IN] IS NULL AND TGT.[Expected_Life_IN] IS NOT NULL) OR (SRC.[Expected_Life_IN] IS NOT NULL AND TGT.[Expected_Life_IN] IS NULL)
    OR SRC.[Default_Replacement_Type_ID] <> TGT.[Default_Replacement_Type_ID] OR (SRC.[Default_Replacement_Type_ID] IS NULL AND TGT.[Default_Replacement_Type_ID] IS NOT NULL) OR (SRC.[Default_Replacement_Type_ID] IS NOT NULL AND TGT.[Default_Replacement_Type_ID] IS NULL)
    OR SRC.[Replacement_Type_ID] <> TGT.[Replacement_Type_ID] OR (SRC.[Replacement_Type_ID] IS NULL AND TGT.[Replacement_Type_ID] IS NOT NULL) OR (SRC.[Replacement_Type_ID] IS NOT NULL AND TGT.[Replacement_Type_ID] IS NULL)
    OR SRC.[Default_Replacement_Pattern_ID] <> TGT.[Default_Replacement_Pattern_ID] OR (SRC.[Default_Replacement_Pattern_ID] IS NULL AND TGT.[Default_Replacement_Pattern_ID] IS NOT NULL) OR (SRC.[Default_Replacement_Pattern_ID] IS NOT NULL AND TGT.[Default_Replacement_Pattern_ID] IS NULL)
    OR SRC.[Replacement_Pattern_ID] <> TGT.[Replacement_Pattern_ID] OR (SRC.[Replacement_Pattern_ID] IS NULL AND TGT.[Replacement_Pattern_ID] IS NOT NULL) OR (SRC.[Replacement_Pattern_ID] IS NOT NULL AND TGT.[Replacement_Pattern_ID] IS NULL)
    OR SRC.[Default_Replacement_Expected_Life_IN] <> TGT.[Default_Replacement_Expected_Life_IN] OR (SRC.[Default_Replacement_Expected_Life_IN] IS NULL AND TGT.[Default_Replacement_Expected_Life_IN] IS NOT NULL) OR (SRC.[Default_Replacement_Expected_Life_IN] IS NOT NULL AND TGT.[Default_Replacement_Expected_Life_IN] IS NULL)
    OR SRC.[Replacement_Expected_Life_IN] <> TGT.[Replacement_Expected_Life_IN] OR (SRC.[Replacement_Expected_Life_IN] IS NULL AND TGT.[Replacement_Expected_Life_IN] IS NOT NULL) OR (SRC.[Replacement_Expected_Life_IN] IS NOT NULL AND TGT.[Replacement_Expected_Life_IN] IS NULL)
    OR SRC.[Initial_Override_Replacement_Date_DT] <> TGT.[Initial_Override_Replacement_Date_DT] OR (SRC.[Initial_Override_Replacement_Date_DT] IS NULL AND TGT.[Initial_Override_Replacement_Date_DT] IS NOT NULL) OR (SRC.[Initial_Override_Replacement_Date_DT] IS NOT NULL AND TGT.[Initial_Override_Replacement_Date_DT] IS NULL)
    OR SRC.[Subsequent_Override_Replacement_Date_DT] <> TGT.[Subsequent_Override_Replacement_Date_DT] OR (SRC.[Subsequent_Override_Replacement_Date_DT] IS NULL AND TGT.[Subsequent_Override_Replacement_Date_DT] IS NOT NULL) OR (SRC.[Subsequent_Override_Replacement_Date_DT] IS NOT NULL AND TGT.[Subsequent_Override_Replacement_Date_DT] IS NULL)
    OR SRC.[Default_Replacement_Cost_DC] <> TGT.[Default_Replacement_Cost_DC] OR (SRC.[Default_Replacement_Cost_DC] IS NULL AND TGT.[Default_Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Default_Replacement_Cost_DC] IS NOT NULL AND TGT.[Default_Replacement_Cost_DC] IS NULL)
    OR SRC.[Replacement_Cost_DC] <> TGT.[Replacement_Cost_DC] OR (SRC.[Replacement_Cost_DC] IS NULL AND TGT.[Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Replacement_Cost_DC] IS NOT NULL AND TGT.[Replacement_Cost_DC] IS NULL)
    OR SRC.[Default_Replacement_Cost_AsAt_DT] <> TGT.[Default_Replacement_Cost_AsAt_DT] OR (SRC.[Default_Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Default_Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Replacement_Cost_AsAt_DT] <> TGT.[Replacement_Cost_AsAt_DT] OR (SRC.[Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Default_Cyclical_Cost_DC] <> TGT.[Default_Cyclical_Cost_DC] OR (SRC.[Default_Cyclical_Cost_DC] IS NULL AND TGT.[Default_Cyclical_Cost_DC] IS NOT NULL) OR (SRC.[Default_Cyclical_Cost_DC] IS NOT NULL AND TGT.[Default_Cyclical_Cost_DC] IS NULL)
    OR SRC.[Cyclical_Cost_DC] <> TGT.[Cyclical_Cost_DC] OR (SRC.[Cyclical_Cost_DC] IS NULL AND TGT.[Cyclical_Cost_DC] IS NOT NULL) OR (SRC.[Cyclical_Cost_DC] IS NOT NULL AND TGT.[Cyclical_Cost_DC] IS NULL)
    OR SRC.[Default_Cyclical_Cost_AsAt_DT] <> TGT.[Default_Cyclical_Cost_AsAt_DT] OR (SRC.[Default_Cyclical_Cost_AsAt_DT] IS NULL AND TGT.[Default_Cyclical_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Cyclical_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Cyclical_Cost_AsAt_DT] IS NULL)
    OR SRC.[Cyclical_Cost_AsAt_DT] <> TGT.[Cyclical_Cost_AsAt_DT] OR (SRC.[Cyclical_Cost_AsAt_DT] IS NULL AND TGT.[Cyclical_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Cyclical_Cost_AsAt_DT] IS NOT NULL AND TGT.[Cyclical_Cost_AsAt_DT] IS NULL)
    OR SRC.[Cyclical_Date_DT] <> TGT.[Cyclical_Date_DT] OR (SRC.[Cyclical_Date_DT] IS NULL AND TGT.[Cyclical_Date_DT] IS NOT NULL) OR (SRC.[Cyclical_Date_DT] IS NOT NULL AND TGT.[Cyclical_Date_DT] IS NULL)
    OR SRC.[Override_Cyclical_Date_DT] <> TGT.[Override_Cyclical_Date_DT] OR (SRC.[Override_Cyclical_Date_DT] IS NULL AND TGT.[Override_Cyclical_Date_DT] IS NOT NULL) OR (SRC.[Override_Cyclical_Date_DT] IS NOT NULL AND TGT.[Override_Cyclical_Date_DT] IS NULL)
    OR SRC.[Default_Cyclical_Period_IN] <> TGT.[Default_Cyclical_Period_IN] OR (SRC.[Default_Cyclical_Period_IN] IS NULL AND TGT.[Default_Cyclical_Period_IN] IS NOT NULL) OR (SRC.[Default_Cyclical_Period_IN] IS NOT NULL AND TGT.[Default_Cyclical_Period_IN] IS NULL)
    OR SRC.[Cyclical_Period_IN] <> TGT.[Cyclical_Period_IN] OR (SRC.[Cyclical_Period_IN] IS NULL AND TGT.[Cyclical_Period_IN] IS NOT NULL) OR (SRC.[Cyclical_Period_IN] IS NOT NULL AND TGT.[Cyclical_Period_IN] IS NULL)
    OR SRC.[Last_Cyclical_Cost_ID] <> TGT.[Last_Cyclical_Cost_ID] OR (SRC.[Last_Cyclical_Cost_ID] IS NULL AND TGT.[Last_Cyclical_Cost_ID] IS NOT NULL) OR (SRC.[Last_Cyclical_Cost_ID] IS NOT NULL AND TGT.[Last_Cyclical_Cost_ID] IS NULL)
    OR SRC.[Last_Planned_Cost_ID] <> TGT.[Last_Planned_Cost_ID] OR (SRC.[Last_Planned_Cost_ID] IS NULL AND TGT.[Last_Planned_Cost_ID] IS NOT NULL) OR (SRC.[Last_Planned_Cost_ID] IS NOT NULL AND TGT.[Last_Planned_Cost_ID] IS NULL)
    OR SRC.[User_ID] <> TGT.[User_ID] OR (SRC.[User_ID] IS NULL AND TGT.[User_ID] IS NOT NULL) OR (SRC.[User_ID] IS NOT NULL AND TGT.[User_ID] IS NULL)
    OR SRC.[Created_Date_DT] <> TGT.[Created_Date_DT] OR (SRC.[Created_Date_DT] IS NULL AND TGT.[Created_Date_DT] IS NOT NULL) OR (SRC.[Created_Date_DT] IS NOT NULL AND TGT.[Created_Date_DT] IS NULL)
    OR SRC.[Feature_ID] <> TGT.[Feature_ID] OR (SRC.[Feature_ID] IS NULL AND TGT.[Feature_ID] IS NOT NULL) OR (SRC.[Feature_ID] IS NOT NULL AND TGT.[Feature_ID] IS NULL)
    OR SRC.[Damage_ID] <> TGT.[Damage_ID] OR (SRC.[Damage_ID] IS NULL AND TGT.[Damage_ID] IS NOT NULL) OR (SRC.[Damage_ID] IS NOT NULL AND TGT.[Damage_ID] IS NULL)
    OR SRC.[Accessibility_ID] <> TGT.[Accessibility_ID] OR (SRC.[Accessibility_ID] IS NULL AND TGT.[Accessibility_ID] IS NOT NULL) OR (SRC.[Accessibility_ID] IS NOT NULL AND TGT.[Accessibility_ID] IS NULL)
    OR SRC.[Access_Restrictions_VC] <> TGT.[Access_Restrictions_VC] OR (SRC.[Access_Restrictions_VC] IS NULL AND TGT.[Access_Restrictions_VC] IS NOT NULL) OR (SRC.[Access_Restrictions_VC] IS NOT NULL AND TGT.[Access_Restrictions_VC] IS NULL)
    OR SRC.[Special_Instructions_VC] <> TGT.[Special_Instructions_VC] OR (SRC.[Special_Instructions_VC] IS NULL AND TGT.[Special_Instructions_VC] IS NOT NULL) OR (SRC.[Special_Instructions_VC] IS NOT NULL AND TGT.[Special_Instructions_VC] IS NULL)
    OR SRC.[Default_Subsequent_Replacement_Cost_DC] <> TGT.[Default_Subsequent_Replacement_Cost_DC] OR (SRC.[Default_Subsequent_Replacement_Cost_DC] IS NULL AND TGT.[Default_Subsequent_Replacement_Cost_DC] IS NOT NULL) OR (SRC.[Default_Subsequent_Replacement_Cost_DC] IS NOT NULL AND TGT.[Default_Subsequent_Replacement_Cost_DC] IS NULL)
    OR SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] <> TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] OR (SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NULL AND TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NOT NULL) OR (SRC.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NOT NULL AND TGT.[Default_Subsequent_Replacement_Cost_AsAt_DT] IS NULL)
    OR SRC.[Max_Smooth_Date_DT] <> TGT.[Max_Smooth_Date_DT] OR (SRC.[Max_Smooth_Date_DT] IS NULL AND TGT.[Max_Smooth_Date_DT] IS NOT NULL) OR (SRC.[Max_Smooth_Date_DT] IS NOT NULL AND TGT.[Max_Smooth_Date_DT] IS NULL)
    )

    The business is unwilling to allow me to use a 3rd party SSIS plugins like Konesan's Hash task so I am looking for alternatives that mean we don't have to maintain a complex and labour intensive set of stored procedures.  I have two broad strategies which I am considering.

    1. A function that will take the table name, PK field name (I have single field PKs thank goodness) and the PK identifier  that looks at the table schema, builds a dynamic query to pull out all of the columns except the ROWSTAMP, consistently cast every data type to VARCHAR, will COALESCE() NULL values, CONCAT() them together with a separator then  run CHECKSUM_BINARY() on the resulting string.  This means I can compare the values in the source OLTP system with the values in the DWH current record to decide whether it goes down the add, update or ignore route.
    2. Write code with a similar logic but use it instead to generate the stored procedure for that particular table.  We would still need to apply a different stored procedure to each SSIS package but at least when the database schema changes or new tables are added to the DWH the change is relatively painless.

    I like 1 but I think it could be a real performance hog as for each row in the ETL process it is effectively going to have to rebuild the dynamic SQL but means the process is more robust because it is not sensitive to changes in schema for the source tables  (They would however give a different HASH result so all would get updated on first delta)

    We are currently running on-prem VMs on SQL 2017 for the OLTP and Azure hosted VMs on SQL 2016 for the DWH, however there is a strong possibility that the DWH will move to SQL Azure in the near future just in case that changes the proposed solution.

  • Not answering your question, but a quick note about your existing SQL, as it can be made considerably more concise, using the following structure:

    SELECT *
    FROM src
    JOIN trg
    ON src.Id = trg.Id
    WHERE NOT EXISTS
    (
    SELECT src.col1, src.col2 INTERSECT SELECT trg.col1, trg.col2
    );

    If the answer to your question can be found with a brief Google search, please perform the search yourself, rather than expecting one of the SSC members to do it for you.
    See https://www.sqlservercentral.com/articles/forum-etiquette-how-to-post-datacode-on-a-forum-to-get-the-best-help/ for details of how to post T-SQL code-related questions.

  • #1 is (way) too much overhead.

    #2 is better.  You want static code that is generated dynamically.

    SQL DBA,SQL Server MVP(07, 08, 09) Prosecutor James Blackburn, in closing argument in the Fatal Vision murders trial: "If in the future, you should cry a tear, cry one for them [the murder victims]. If in the future, you should say a prayer, say one for them. And if in the future, you should light a candle, light one for them."

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply