Removing Reverse Duplicates

  • Hi all,
    I've been racking my brain all afternoon with this one.  In their wisdom, the developers of this database many years ago decided that when linking two records a separate record in a Links table would be needed for the 2 way. 

    i.e.
    If ClientA was to be linked to ClientB, then 2 records would be created in the Links table:

    -----------------------------
    ClientRef | LinkedClientRef
    -----------------------------
    10000001 | 10000002  
    -----------------------------
    10000002 | 10000001  
    -----------------------------
    10000003 | 10000004  
    -----------------------------
    10000005 | 10000006  
    -----------------------------

    This is the table is its basic form
    CREATE TABLE Links (
        ClientRef INT,
        LinkedClientRef INT
        )

    Where this exists I'm trying to only select one of the records.  In other words, the result I want to return is :

    -----------------------------
    ClientRef | LinkedClientRef
    -----------------------------
    10000001 | 10000002  
    -----------------------------
    10000003 | 10000004  
    -----------------------------
    10000005 | 10000006  
    -----------------------------

    The trouble is this table doesn't have a primary key and no other unique column I can use a window function on to maybe assign a row number where the combination ClientRef|LinkedClientRef matches the reverse combination LinkedClientRef|ClientRef

    I've been looking at it so long now I think I've lost the plot :crazy:
    Is anybody able to help at all?

    Thanks in advance

    Regards

    Steve

  • for these records

    10000003 | 10000004 
    -----------------------------
    10000005 | 10000006 

    why arent there corresponding reverse entries?.......or am I misunderstanding your post?

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • This will remove the duplicates


    -- Create a table to hold the data
    CREATE TABLE #Links (
      ClientRef   INT
    , LinkedClientRef INT
    );

    -- Create the data
    INSERT INTO #Links ( ClientRef, LinkedClientRef )
    VALUES ( 10000001, 10000002 )
      , ( 10000002, 10000001 )
      , ( 10000003, 10000004 )
      , ( 10000005, 10000006 );

    -- Check the data before starting
    SELECT *
    FROM #Links AS cr;

    -- Delete the duplicates
    WITH cteDuplicates AS (
    SELECT
      ClientRef = cr.LinkedClientRef
      , LinkedClientRef = cr.ClientRef
    FROM #Links AS cr
    )
    DELETE dup
    FROM #Links AS cr
    INNER JOIN cteDuplicates AS dup
     ON cr.ClientRef = dup.ClientRef
    AND cr.LinkedClientRef = dup.LinkedClientRef

    -- Check the data after deleting the duplicates
    SELECT *
    FROM #Links AS cr;

  • To select the records without deleting the duplicates


    WITH cteBaseData AS (
    SELECT
      cr.ClientRef
      , cr.LinkedClientRef
      , rn = ROW_NUMBER() OVER (ORDER BY cr.ClientRef, cr.LinkedClientRef)
    FROM #Links AS cr
    )
    , cteDuplicates AS (
    SELECT
      ClientRef = cr.LinkedClientRef
      , LinkedClientRef = cr.ClientRef
      , cr.rn
    FROM cteBaseData AS cr
    )
    SELECT
      cr.ClientRef
    , cr.LinkedClientRef
    FROM cteBaseData AS cr
    LEFT JOIN cteDuplicates AS dup
     ON cr.ClientRef = dup.ClientRef
    AND cr.LinkedClientRef = dup.LinkedClientRef
    WHERE cr.rn <= ISNULL(dup.rn, cr.rn)

  • Thanks for that, but it's not quite there.  I still wish to retain one of the 2 duplicates.  This code removes both rows. I need one of the rows to remain. I'll try and adapt your code to see if I can make it work.

    Regards

    Steve

  • Presumably the lower number is always first for single entries (if not, fix that! 🙂 ).


    SELECT L.*
    FROM Links L
    WHERE L.ClientRef < L.LinkedClientRef

     

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

  • Would something like this help?

    SELECT DISTINCT
      CASE WHEN ClientRef <= LinkedClientRef THEN ClientRef ELSE LinkedClientRef END AS ClientRef,
      CASE WHEN ClientRef <= LinkedClientRef THEN LinkedClientRef ELSE ClientRef END AS LinkedClientRef
    FROM #Links AS cr;

    Of Course, you should fix the data instead of the query.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • smw147 - Wednesday, April 5, 2017 10:32 AM

    Thanks for that, but it's not quite there.  I still wish to retain one of the 2 duplicates.  This code removes both rows. I need one of the rows to remain. I'll try and adapt your code to see if I can make it work.

    The code that I posted selects the 1st duplicate.
    If there are multiple duplicatse, then you can add a GROUP BY 

    -- Create a table to hold the data
    CREATE TABLE #Links (
      rowNum    INT
    , ClientRef   INT
    , LinkedClientRef INT
    );

    -- Create the data
    INSERT INTO #Links ( rowNum, ClientRef, LinkedClientRef )
    VALUES ( 1, 10000001, 10000002 )
    , ( 2, 10000002, 10000001 )
    , ( 3, 10000003, 10000004 )
    , ( 4, 10000005, 10000006 );

    -- Check the data before starting
    SELECT *
    FROM #Links AS cr;

    -- Select all the unique records
    -- This will only exclude rowNum=2
    WITH cteBaseData AS (
    SELECT
      cr.ClientRef
      , cr.LinkedClientRef
      , rn = ROW_NUMBER() OVER (ORDER BY cr.ClientRef, cr.LinkedClientRef)
    FROM #Links AS cr
    )
    , cteDuplicates AS (
    SELECT
    ClientRef = cr.LinkedClientRef
    , LinkedClientRef = cr.ClientRef
    , cr.rn
    FROM cteBaseData AS cr
    )
    SELECT
      cr.ClientRef
    , cr.LinkedClientRef
    FROM cteBaseData AS cr
    LEFT JOIN cteDuplicates AS dup
    ON cr.ClientRef = dup.ClientRef
    AND cr.LinkedClientRef = dup.LinkedClientRef
    WHERE cr.rn <= ISNULL(dup.rn, cr.rn)
    --GROUP BY cr.ClientRef, cr.LinkedClientRef

    Results

    ClientRef   LinkedClientRef
    ----------- ---------------
    10000001    10000002
    10000003    10000004
    10000005    10000006

  • J Livingston SQL - Wednesday, April 5, 2017 9:57 AM

    for these records

    10000003 | 10000004 
    -----------------------------
    10000005 | 10000006 

    why arent there corresponding reverse entries?.......or am I misunderstanding your post?

    It's the way the software was designed many years ago before my time I'm afraid 🙁

    Luis Cazares - Wednesday, April 5, 2017 10:59 AM

    Would something like this help?

    SELECT DISTINCT
      CASE WHEN ClientRef <= LinkedClientRef THEN ClientRef ELSE LinkedClientRef END AS ClientRef,
      CASE WHEN ClientRef <= LinkedClientRef THEN LinkedClientRef ELSE ClientRef END AS LinkedClientRef
    FROM #Links AS cr;

    Of Course, you should fix the data instead of the query.

    Sorry for the delay in responding, but Ive only just logged on after a few days away at a conference.
    @luis Cazares - I actually did something similar to this in the end.

    Thanks for all the responses 🙂

    Regards

    Steve

  • DesNorton - Wednesday, April 5, 2017 10:08 AM

    This will remove the duplicates


    -- Create a table to hold the data
    CREATE TABLE #Links (
      ClientRef   INT
    , LinkedClientRef INT
    );

    -- Create the data
    INSERT INTO #Links ( ClientRef, LinkedClientRef )
    VALUES ( 10000001, 10000002 )
      , ( 10000002, 10000001 )
      , ( 10000003, 10000004 )
      , ( 10000005, 10000006 );

    -- Check the data before starting
    SELECT *
    FROM #Links AS cr;

    -- Delete the duplicates
    WITH cteDuplicates AS (
    SELECT
      ClientRef = cr.LinkedClientRef
      , LinkedClientRef = cr.ClientRef
    FROM #Links AS cr
    )
    DELETE dup
    FROM #Links AS cr
    INNER JOIN cteDuplicates AS dup
     ON cr.ClientRef = dup.ClientRef
    AND cr.LinkedClientRef = dup.LinkedClientRef

    -- Check the data after deleting the duplicates
    SELECT *
    FROM #Links AS cr;

    >> In their wisdom, the developers of this database many years ago decided that when linking [sic] two records [sic] a separate record [sic] in a Links table would be needed for the 2 way. <<

    Did you ever watch a television show called "Malcolm in the middle"? In one of the episodes one of the sons and the family has gone off to Alaska, and is surrounded by total absurd maniacs in a logging camp. One of the characters in the camp is a grisly old mountain man, who keeps talking about "grizzly bear eggs" when he's trying to explain life here being attacked by a grizzly bear – she's just protecting her nest!

    The concept of linking is not part of RDBMS. It's part of the old network databases which used pointer chains (links) and not the RDBMS concept up references and relationships. You've just described grizzly bear eggs in database terms.

    >>If ClientA was to be linked [sic] to ClientB, then 2 records [sic] would be created in the Links [sic] table:<<

    CREATE TABLE Recommendations
    (first_client_id CHAR(15) NOT NULL ,
    second_client_id CHAR(15) NOT NULL ,
    CHECK (first_client_id < second_client_id)
    PRIMARY KEY (first_client_id, second_client_id)
    );

    Now create a view with

    CREATE VIEW Symetric
    AS
    SELECT first_client_id, second_client_id
    FROM Recommendations
    UNION
    SELECT second_client_id, first_client_id
    FROM Recommendations

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • jcelko212 32090 - Tuesday, April 11, 2017 12:15 PM

    DesNorton - Wednesday, April 5, 2017 10:08 AM

    This will remove the duplicates


    -- Create a table to hold the data
    CREATE TABLE #Links (
      ClientRef   INT
    , LinkedClientRef INT
    );

    -- Create the data
    INSERT INTO #Links ( ClientRef, LinkedClientRef )
    VALUES ( 10000001, 10000002 )
      , ( 10000002, 10000001 )
      , ( 10000003, 10000004 )
      , ( 10000005, 10000006 );

    -- Check the data before starting
    SELECT *
    FROM #Links AS cr;

    -- Delete the duplicates
    WITH cteDuplicates AS (
    SELECT
      ClientRef = cr.LinkedClientRef
      , LinkedClientRef = cr.ClientRef
    FROM #Links AS cr
    )
    DELETE dup
    FROM #Links AS cr
    INNER JOIN cteDuplicates AS dup
     ON cr.ClientRef = dup.ClientRef
    AND cr.LinkedClientRef = dup.LinkedClientRef

    -- Check the data after deleting the duplicates
    SELECT *
    FROM #Links AS cr;

    >> In their wisdom, the developers of this database many years ago decided that when linking [sic] two records [sic] a separate record [sic] in a Links table would be needed for the 2 way. <<

    Did you ever watch a television show called "Malcolm in the middle"? In one of the episodes one of the sons and the family has gone off to Alaska, and is surrounded by total absurd maniacs in a logging camp. One of the characters in the camp is a grisly old mountain man, who keeps talking about "grizzly bear eggs" when he's trying to explain life here being attacked by a grizzly bear – she's just protecting her nest!

    The concept of linking is not part of RDBMS. It's part of the old network databases which used pointer chains (links) and not the RDBMS concept up references and relationships. You've just described grizzly bear eggs in database terms.

    >>If ClientA was to be linked [sic] to ClientB, then 2 records [sic] would be created in the Links [sic] table:<<

    CREATE TABLE Recommendations
    (first_client_id CHAR(15) NOT NULL ,
    second_client_id CHAR(15) NOT NULL ,
    CHECK (first_client_id < second_client_id)
    PRIMARY KEY (first_client_id, second_client_id)
    );

    Now create a view with

    CREATE VIEW Symetric
    AS
    SELECT first_client_id, second_client_id
    FROM Recommendations
    UNION
    SELECT second_client_id, first_client_id
    FROM Recommendations

    Sorry, I will post the table DDL in future.

    Regards

    Steve

  • Here's another set of options...

    IF OBJECT_ID('tempdb..#Links','U') IS NOT NULL
        DROP TABLE #Links;

    CREATE TABLE #Links (
            ClientRef INT,
            LinkedClientRef INT
        );
    -- Create the data
    INSERT    INTO #Links (ClientRef,LinkedClientRef)
    VALUES
        (10000001,10000002),
        (10000002,10000001),
        (10000003,10000004),
        (10000005,10000006);

    --==============================================
    -- select valid links...
    SELECT
        l1.ClientRef,
        l1.LinkedClientRef
    FROM
        #Links l1
    WHERE
        NOT EXISTS (
                    SELECT 1
                    FROM #Links l2
                    WHERE
                        l1.ClientRef = l2.LinkedClientRef
                        AND l1.LinkedClientRef = l2.ClientRef
                        AND l2.ClientRef < l2.LinkedClientRef
                    );

    --==============================================
    -- delete invalid links...
    DELETE l1
    FROM
        #Links l1
    WHERE
        EXISTS (
                SELECT 1
                FROM #Links l2
                WHERE
                    l1.ClientRef = l2.LinkedClientRef
                    AND l1.LinkedClientRef = l2.ClientRef
                    AND l2.ClientRef < l2.LinkedClientRef
                );

  • To add a little more detail to my first recommendation, I think you should be able to do this:

    1) Update any rows where link1 > link2 to swap the links.  Then you know that link1 always < link2.
    2) Add a check constraint to make sure that link1 is always less than link2.
    3) Remove any duplicates in the existing data.

    SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply