Removing Reverse Duplicates

Question

Post reply

Removing Reverse Duplicates

smw147

SSCommitted

Points: 1544
More actions
April 5, 2017 at 9:42 am

#329111

Hi all,
I've been racking my brain all afternoon with this one. In their wisdom, the developers of this database many years ago decided that when linking two records a separate record in a Links table would be needed for the 2 way.
i.e.
If ClientA was to be linked to ClientB, then 2 records would be created in the Links table:
-----------------------------
ClientRef | LinkedClientRef
-----------------------------
10000001 | 10000002
-----------------------------
10000002 | 10000001
-----------------------------
10000003 | 10000004
-----------------------------
10000005 | 10000006
-----------------------------
This is the table is its basic form
CREATE TABLE Links (     ClientRef INT,     LinkedClientRef INT     )
Where this exists I'm trying to only select one of the records. In other words, the result I want to return is :
-----------------------------
ClientRef | LinkedClientRef
-----------------------------
10000001 | 10000002
-----------------------------
10000003 | 10000004
-----------------------------
10000005 | 10000006
-----------------------------

The trouble is this table doesn't have a primary key and no other unique column I can use a window function on to maybe assign a row number where the combination ClientRef|LinkedClientRef matches the reverse combination LinkedClientRef|ClientRef
I've been looking at it so long now I think I've lost the plot :crazy:
Is anybody able to help at all?
Thanks in advance
Regards
Steve

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply

J Livingston SQL SSC Guru Points: 51550 More actions · Answer 1

for these records

10000003 | 10000004
-----------------------------
10000005 | 10000006

why arent there corresponding reverse entries?.......or am I misunderstanding your post?

________________________________________________________________
you can lead a user to data....but you cannot make them think
and remember....every day is a school day

DesNorton SSC-Insane Points: 24283 More actions · Answer 2

This will remove the duplicates

-- Create a table to hold the data CREATE TABLE #Links ( ClientRef INT , LinkedClientRef INT );

-- Create the data INSERT INTO #Links ( ClientRef, LinkedClientRef ) VALUES ( 10000001, 10000002 ) , ( 10000002, 10000001 ) , ( 10000003, 10000004 ) , ( 10000005, 10000006 );

-- Check the data before starting SELECT * FROM #Links AS cr;

-- Delete the duplicates WITH cteDuplicates AS ( SELECT ClientRef = cr.LinkedClientRef , LinkedClientRef = cr.ClientRef FROM #Links AS cr ) DELETE dup FROM #Links AS cr INNER JOIN cteDuplicates AS dup ON cr.ClientRef = dup.ClientRef AND cr.LinkedClientRef = dup.LinkedClientRef

-- Check the data after deleting the duplicates SELECT * FROM #Links AS cr;

How to post data/code on a forum to get the best help.

DesNorton SSC-Insane Points: 24283 More actions · Answer 3

To select the records without deleting the duplicates

WITH cteBaseData AS ( SELECT cr.ClientRef , cr.LinkedClientRef , rn = ROW_NUMBER() OVER (ORDER BY cr.ClientRef, cr.LinkedClientRef) FROM #Links AS cr ) , cteDuplicates AS ( SELECT ClientRef = cr.LinkedClientRef , LinkedClientRef = cr.ClientRef , cr.rn FROM cteBaseData AS cr ) SELECT cr.ClientRef , cr.LinkedClientRef FROM cteBaseData AS cr LEFT JOIN cteDuplicates AS dup ON cr.ClientRef = dup.ClientRef AND cr.LinkedClientRef = dup.LinkedClientRef WHERE cr.rn <= ISNULL(dup.rn, cr.rn)

How to post data/code on a forum to get the best help.

smw147 SSCommitted Points: 1544 More actions · Answer 4

Thanks for that, but it's not quite there. I still wish to retain one of the 2 duplicates. This code removes both rows. I need one of the rows to remain. I'll try and adapt your code to see if I can make it work.

Regards

Steve

ScottPletcher SSC Guru Points: 100949 More actions · Answer 5

Presumably the lower number is always first for single entries (if not, fix that! 🙂 ).

SELECT L.* FROM Links L WHERE L.ClientRef < L.LinkedClientRef

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.

Luis Cazares SSC Guru Points: 183698 More actions · Answer 6

Would something like this help?
SELECT DISTINCT CASE WHEN ClientRef <= LinkedClientRef THEN ClientRef ELSE LinkedClientRef END AS ClientRef, CASE WHEN ClientRef <= LinkedClientRef THEN LinkedClientRef ELSE ClientRef END AS LinkedClientRef FROM #Links AS cr;
Of Course, you should fix the data instead of the query.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

DesNorton SSC-Insane Points: 24283 More actions · Answer 7

smw147 - Wednesday, April 5, 2017 10:32 AM
Thanks for that, but it's not quite there. I still wish to retain one of the 2 duplicates. This code removes both rows. I need one of the rows to remain. I'll try and adapt your code to see if I can make it work.

The code that I posted selects the 1st duplicate.
If there are multiple duplicatse, then you can add a GROUP BY
-- Create a table to hold the data CREATE TABLE #Links ( rowNum INT , ClientRef INT , LinkedClientRef INT );

-- Create the data INSERT INTO #Links ( rowNum, ClientRef, LinkedClientRef ) VALUES ( 1, 10000001, 10000002 ) , ( 2, 10000002, 10000001 ) , ( 3, 10000003, 10000004 ) , ( 4, 10000005, 10000006 );

-- Check the data before starting SELECT * FROM #Links AS cr;

-- Select all the unique records -- This will only exclude rowNum=2 WITH cteBaseData AS ( SELECT cr.ClientRef , cr.LinkedClientRef , rn = ROW_NUMBER() OVER (ORDER BY cr.ClientRef, cr.LinkedClientRef) FROM #Links AS cr ) , cteDuplicates AS ( SELECT ClientRef = cr.LinkedClientRef , LinkedClientRef = cr.ClientRef , cr.rn FROM cteBaseData AS cr ) SELECT cr.ClientRef , cr.LinkedClientRef FROM cteBaseData AS cr LEFT JOIN cteDuplicates AS dup ON cr.ClientRef = dup.ClientRef AND cr.LinkedClientRef = dup.LinkedClientRef WHERE cr.rn <= ISNULL(dup.rn, cr.rn) --GROUP BY cr.ClientRef, cr.LinkedClientRef

Results
ClientRef LinkedClientRef ----------- --------------- 10000001 10000002 10000003 10000004 10000005 10000006

How to post data/code on a forum to get the best help.

smw147 SSCommitted Points: 1544 More actions · Answer 8

J Livingston SQL - Wednesday, April 5, 2017 9:57 AM
for these records
10000003 | 10000004
-----------------------------
10000005 | 10000006
why arent there corresponding reverse entries?.......or am I misunderstanding your post?

It's the way the software was designed many years ago before my time I'm afraid 🙁

Luis Cazares - Wednesday, April 5, 2017 10:59 AM
Would something like this help?
SELECT DISTINCT CASE WHEN ClientRef <= LinkedClientRef THEN ClientRef ELSE LinkedClientRef END AS ClientRef, CASE WHEN ClientRef <= LinkedClientRef THEN LinkedClientRef ELSE ClientRef END AS LinkedClientRef FROM #Links AS cr;
Of Course, you should fix the data instead of the query.

Sorry for the delay in responding, but Ive only just logged on after a few days away at a conference.
@luis Cazares - I actually did something similar to this in the end.

Thanks for all the responses 🙂

Regards

Steve

jcelko212 32090 SSCrazy Eights Points: 9294 More actions · Answer 9

DesNorton - Wednesday, April 5, 2017 10:08 AM
This will remove the duplicates
-- Create a table to hold the data CREATE TABLE #Links ( ClientRef   INT , LinkedClientRef INT );
-- Create the data INSERT INTO #Links ( ClientRef, LinkedClientRef ) VALUES ( 10000001, 10000002 )   , ( 10000002, 10000001 )   , ( 10000003, 10000004 )   , ( 10000005, 10000006 );
-- Check the data before starting SELECT * FROM #Links AS cr;
-- Delete the duplicates WITH cteDuplicates AS ( SELECT   ClientRef = cr.LinkedClientRef , LinkedClientRef = cr.ClientRef FROM #Links AS cr ) DELETE dup FROM #Links AS cr INNER JOIN cteDuplicates AS dup ON cr.ClientRef = dup.ClientRef AND cr.LinkedClientRef = dup.LinkedClientRef
-- Check the data after deleting the duplicates SELECT * FROM #Links AS cr;

>> In their wisdom, the developers of this database many years ago decided that when linking [sic] two records [sic] a separate record [sic] in a Links table would be needed for the 2 way. <<

Did you ever watch a television show called "Malcolm in the middle"? In one of the episodes one of the sons and the family has gone off to Alaska, and is surrounded by total absurd maniacs in a logging camp. One of the characters in the camp is a grisly old mountain man, who keeps talking about "grizzly bear eggs" when he's trying to explain life here being attacked by a grizzly bear â€“ she's just protecting her nest!

The concept of linking is not part of RDBMS. It's part of the old network databases which used pointer chains (links) and not the RDBMS concept up references and relationships. You've just described grizzly bear eggs in database terms.

>>If ClientA was to be linked [sic] to ClientB, then 2 records [sic] would be created in the Links [sic] table:<<

CREATE TABLE Recommendations
(first_client_id CHAR(15) NOT NULL ,
second_client_id CHAR(15) NOT NULL ,
CHECK (first_client_id < second_client_id)
PRIMARY KEY (first_client_id, second_client_id)
);

Now create a view with

CREATE VIEW Symetric
AS
SELECT first_client_id, second_client_id
FROM Recommendations
UNION
SELECT second_client_id, first_client_id
FROM Recommendations

Please post DDL and follow ANSI/ISO standards when asking for help.

smw147 SSCommitted Points: 1544 More actions · Answer 10

jcelko212 32090 - Tuesday, April 11, 2017 12:15 PM
DesNorton - Wednesday, April 5, 2017 10:08 AM
This will remove the duplicates
-- Create a table to hold the data CREATE TABLE #Links ( ClientRef   INT , LinkedClientRef INT );
-- Create the data INSERT INTO #Links ( ClientRef, LinkedClientRef ) VALUES ( 10000001, 10000002 )   , ( 10000002, 10000001 )   , ( 10000003, 10000004 )   , ( 10000005, 10000006 );
-- Check the data before starting SELECT * FROM #Links AS cr;
-- Delete the duplicates WITH cteDuplicates AS ( SELECT   ClientRef = cr.LinkedClientRef , LinkedClientRef = cr.ClientRef FROM #Links AS cr ) DELETE dup FROM #Links AS cr INNER JOIN cteDuplicates AS dup ON cr.ClientRef = dup.ClientRef AND cr.LinkedClientRef = dup.LinkedClientRef
-- Check the data after deleting the duplicates SELECT * FROM #Links AS cr;
>> In their wisdom, the developers of this database many years ago decided that when linking [sic] two records [sic] a separate record [sic] in a Links table would be needed for the 2 way. <<
Did you ever watch a television show called "Malcolm in the middle"? In one of the episodes one of the sons and the family has gone off to Alaska, and is surrounded by total absurd maniacs in a logging camp. One of the characters in the camp is a grisly old mountain man, who keeps talking about "grizzly bear eggs" when he's trying to explain life here being attacked by a grizzly bear â€“ she's just protecting her nest!
The concept of linking is not part of RDBMS. It's part of the old network databases which used pointer chains (links) and not the RDBMS concept up references and relationships. You've just described grizzly bear eggs in database terms.
>>If ClientA was to be linked [sic] to ClientB, then 2 records [sic] would be created in the Links [sic] table:<<
CREATE TABLE Recommendations
(first_client_id CHAR(15) NOT NULL ,
second_client_id CHAR(15) NOT NULL ,
CHECK (first_client_id < second_client_id)
PRIMARY KEY (first_client_id, second_client_id)
);
Now create a view with
CREATE VIEW Symetric
AS
SELECT first_client_id, second_client_id
FROM Recommendations
UNION
SELECT second_client_id, first_client_id
FROM Recommendations

Sorry, I will post the table DDL in future.

Regards

Steve

Jason A. Long SSC-Insane Points: 23780 More actions · Answer 11

Here's another set of options...
IF OBJECT_ID('tempdb..#Links','U') IS NOT NULL DROP TABLE #Links;

CREATE TABLE #Links ( ClientRef INT, LinkedClientRef INT ); -- Create the data INSERT INTO #Links (ClientRef,LinkedClientRef) VALUES (10000001,10000002), (10000002,10000001), (10000003,10000004), (10000005,10000006);

--============================================== -- select valid links... SELECT l1.ClientRef, l1.LinkedClientRef FROM #Links l1 WHERE NOT EXISTS ( SELECT 1 FROM #Links l2 WHERE l1.ClientRef = l2.LinkedClientRef AND l1.LinkedClientRef = l2.ClientRef AND l2.ClientRef < l2.LinkedClientRef );

--============================================== -- delete invalid links... DELETE l1 FROM #Links l1 WHERE EXISTS ( SELECT 1 FROM #Links l2 WHERE l1.ClientRef = l2.LinkedClientRef AND l1.LinkedClientRef = l2.ClientRef AND l2.ClientRef < l2.LinkedClientRef );

ScottPletcher SSC Guru Points: 100949 More actions · Answer 12

To add a little more detail to my first recommendation, I think you should be able to do this:

1) Update any rows where link1 > link2 to swap the links. Then you know that link1 always < link2.
2) Add a check constraint to make sure that link1 is always less than link2.
3) Remove any duplicates in the existing data.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.