Join on two columns (using OR) for MERGE. Brain fog killing me!

  • Im sure I'm missing a simple trick here, but suffering post covid brain fog.

    Unsual situation, long story as short as I can, for context mainly.

    Currently have an on-prem database, which is current source of truth, which has been been doing a one-way export/update to a cloud based application using a scheduled pipeline in Azure Data factory for some months. Records are created and edited on prem, ADF upserts cloud system.

    Business has decided that we are migtrating all functions of the on-prem system to the cloud system.

    There will be a long transition period, for reasons I won't go into. However, at start of transition, the cloud system will become source of truth, with all new records created and edited there. At this point, export/update will be reversed. ADF pipeline disabled and we switch to using Azure Logic Apps to make real-time row by row updates from cloud to on-prem.

    I've been tasked with creating sProcs which the logic apps will call, on any insert, update or delete. There will only ever be a single row of data sent from a single cloud table. I then need to normalise in the sProc and upsert a number of tables on-prem. So far it's been so good, apart from one thing: the join for the merge, there will not be a relationship I can use for all scenarios.

    Contacts created prior to transition, have their on-prem (bigint) ID stored in the cloud system.

    Contacts created post start of transition will NOT, instead, they will have their cloud (guid) ID, stored in the on-prem system.

    If the developer of the logic apps can write back the onprem ID to the cloud, in the same execution as it creates the new rows on-prem then that would be best. I've not had an answer if that's possible so am hedging my bets.

    I'm putting my testbed in consumable code here. I don't like the OR in the join of the MERGE statement. If a GUID is not completely globally unique, then I could have problems. And this is where the brain fog is affecting me - I'd like to:

    1) JOIN on mig_contid = contID if mig_contid is NOT NULL and ignore cloudid = contguid

    2) But use cloudid = contguid only if mig_contid IS NULL.

    And I can't decide if I'm worrying too much, because even in 2) if I get a guid clash, I have problems.

    Appreciate this has now got very long, so thanks if you read this far. I can't shake the feeling I'm missing something extremely basic!

    /***************************************************************************************************
    02/02/2022SSC Sample

    OLD contact = Created on prem
    NEW contact = Created in cloud

    OLD: has cloud.mig_contid already populated with OnPrem:contID however Cloud.cloudid <> OnPrem.contGuid

    NEW: cloud.mig_contid IS NULL, however cloud.cloudid = OnPrem.contGUID

    Scenario1
    Old contact is updated in cloud
    Merge should join on mig_contid = contID
    Name is UPDATEd


    Scenario2
    New contact created in cloud
    Merge should insert new row

    Scenario3
    New contact is updated in cloud
    Merge should join on Cloud.cloudid = OmPrem.contGUID
    mig_contid will be NULL
    Name is UPDATEd

    But, What happens if (unlikely?) GUIDs clash on OLD contacts? - contID should win, ONLY one row updated
    onPrem contGUID previously populated by Default Constraint NEWID()
    cloud system, unsure how cloudid GUID generated.
    ***************************************************************************************************/


    ----------------------------------------------------------------------------------------------------
    -- START -- Very stripped down analogue for the on-prem table
    ----------------------------------------------------------------------------------------------------
    CREATE TABLE #contact(
    contID INT NOT NULL PRIMARY KEY
    ,contGuid UNIQUEIDENTIFIER NOT NULL
    ,contName NVARCHAR(50) NULL
    )

    INSERT INTO #contact (
    contID
    , contGuid
    , contName
    )
    VALUES
    (1, 'B6DE2996-69C3-4BB1-B201-003BF7AFB4CC', N'Tester1')
    ,(2, '7214A497-F6D7-43E7-A20D-34C5438CBD1A', N'Tester2')
    ,(3, '7BC9CFD8-C7E2-439D-8D2D-5D37596FF4B3', N'Tester3')
    ----------------------------------------------------------------------------------------------------
    -- END -- Very stripped down analogue for the on-prem table
    ----------------------------------------------------------------------------------------------------

    SELECT * FROM #contact AS c --Compare result 1

    ----------------------------------------------------------------------------------------------------
    -- START -- test bed for portion of sProc
    ----------------------------------------------------------------------------------------------------
    DECLARE
    -- sProc parameters populated by the Logic App
    @mig_contID AS BIGINT = NULL--column in cloud system storing "old" on-prem contID. Will be NULL for contacts created in cloud. ##Change Me##
    ,@cloudid AS UNIQUEIDENTIFIER= NEWID() --'7BC9CFD8-C7E2-439D-8D2D-5D37596FF4B3'##Change Me##
    ,@contname AS NVARCHAR(50)= N'Mr Bloggs'

    -- Insert parameters to table variable (used throughout sproc)
    DECLARE @record AS TABLE(
    mig_contid BIGINT NULL --column in cloud system storing "old" onprem contID. Will be NULL for contacts created in cloud.
    ,cloudid UNIQUEIDENTIFIER NOT NULL--ID of contact in cloud
    ,contName NVARCHAR(50) NOT NULL--dc.contSalut
    )

    INSERT INTO @record (
    mig_contid
    ,cloudid
    ,contName
    )
    VALUES (
    @mig_contid
    ,@cloudid
    ,@contName
    )

    -- Do the work
    MERGE INTO #contact AS t
    USING @record AS s
    ON s.mig_contid = t.contID -- If the GUIDs are truly globally unique, is this join appropriate. Otherwise potential for updating multiple rows
    OR s.cloudid = t.contguid --
    WHEN MATCHED THEN
    UPDATE
    SET t.contName = s.contName
    WHEN NOT MATCHED THEN
    INSERT
    (contID, contguid, contName)
    VALUES
    (4, s.cloudid, s.contName);
    ----------------------------------------------------------------------------------------------------
    -- END -- test bed for portion of sProc
    ----------------------------------------------------------------------------------------------------

    SELECT * FROM #contact AS c -- Compare result 2

    DROP TABLE #contact

     

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
    — Samuel Johnson
    I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?

  • This was removed by the editor as SPAM

  • It sounds like you only want to use the guid to join to new contacts in the on-prem table. Is there anything in the on-prem table that tells you it is an old or new contact?  If so could you add a filter to the guid join? If there isn't any way to tell the difference between on-prem and cloud-sourced rows, can you record the max contID just before the switch and use it as a variable? If contID is an identity can you reseed it to create a large gap and start at a round number? so that if you have to hard code it, it's obvious what you're doing.

    If this works and a duplicate guid is created, instead of an invalid update you will end up with a duplicate guid in the on-prem table. Does it allow duplicates?

    MERGE INTO #contact AS t
    USING @record AS s
    ON s.mig_contid = t.contID
    OR (s.cloudid = t.contguid and t.contID > @maxonpremcontid)

    MERGE INTO #contact AS t
    USING @record AS s
    ON s.mig_contid = t.contID
    OR (s.cloudid = t.contguid and t.datasource = 'cloud')

     

  • Ed B wrote:

    It sounds like you only want to use the guid to join to new contacts in the on-prem table. Is there anything in the on-prem table that tells you it is an old or new contact?  If so could you add a filter to the guid join? If there isn't any way to tell the difference between on-prem and cloud-sourced rows, can you record the max contID just before the switch and use it as a variable? If contID is an identity can you reseed it to create a large gap and start at a round number? so that if you have to hard code it, it's obvious what you're doing.

    If this works and a duplicate guid is created, instead of an invalid update you will end up with a duplicate guid in the on-prem table. Does it allow duplicates?

    MERGE INTO #contact AS t
    USING @record AS s
    ON s.mig_contid = t.contID
    OR (s.cloudid = t.contguid and t.contID > @maxonpremcontid)

    MERGE INTO #contact AS t
    USING @record AS s
    ON s.mig_contid = t.contID
    OR (s.cloudid = t.contguid and t.datasource = 'cloud')

    Thanks for taking the time to read and respond. Your first suggestion gave me a chink:

    ON s.mig_contid = t.contID  
    OR (s.cloudid = t.contguid AND s.mig_contid IS NULL)

    However, I have not tested and thought too much, because I have just had the excellent news that the Logic App is perfectly capable of populating the cloud mig_contid column with the onprem contid after an insert. So I will have an clear and unambiguous join across old and new data.

    Thanks again.

    "Knowledge is of two kinds. We know a subject ourselves, or we know where we can find information upon it. When we enquire into any subject, the first thing we have to do is to know what books have treated of it. This leads us to look at catalogues, and at the backs of books in libraries."
    — Samuel Johnson
    I wonder, would the great Samuel Johnson have replaced that with "GIYF" now?

  • This was removed by the editor as SPAM

  • This was removed by the editor as SPAM

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply