Remove duplicate record based on criteria

  • Hi All,

    EDIT: I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.

    I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.

    Information:

    There will only ever be 2 duplicated records per  'DuplicatingValue'.
    There is no unique identifier in the data set.

    Requirements\rules:

    Referring to the sample data below:
    Within a duplicate set - if the values are 'Red' for one record and 'Green' for the other record, then delete the 'Red' record and keep the 'Green' record ('DuplicatingValues' 1 and 4 in the sample below).
    For any other combination in the duplicate set, do nothing - therefore in this example for  'DuplicatingValues' 2 and 3, do nothing.


    CREATE TABLE [dbo].[CentralSample](
        [Value1] [varchar](50) NULL,
        [DuplicatingValue] [varchar](50) NULL
    ) ON [PRIMARY]
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'1')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'1')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'2')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'3')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Green', N'4')
    GO
    INSERT [dbo].[CentralSample] ([Value1], [DuplicatingValue]) VALUES (N'Red', N'4')
    GO

    The result expected from the above sample data once the duplicates are removed, would be:


    select 'Green' as Value1,1 as DuplicatingValue
    union all
    select 'Red',2
    union all
    select 'Red',2
    union all
    select 'Green',3
    union all
    select 'Green',3
    union all
    select 'Green',4

    Value1            DuplicatingValue
    Green1
    Red2
    Red2
    Green3
    Green3
    Green4

    I managed to identify the duplicates, but I have not figured out how to remove the unwanted records.
    Thanks in advance for any help in pointing me in the right direction.

    Regards,
    Michael

  • I think this is close... back up your table or wrap this in a transaction and roll it back when testing.. The INTERSECT returns the DuplicatingValue where there is at least one "Green" record and at least one "Red" record..

    DELETE FROM CentralSample
    WHERE DuplicatingValue IN (
                                    SELECT DuplicatingValue
                                    FROM CentralSample
                                    WHERE Value1 = 'Green'
                                    INTERSECT
                                    SELECT DuplicatingValue
                                    FROM CentralSample
                                    WHERE Value1 = 'Red');
    ;

  • pietlinden, thank you very much for the reply, this will lead me in the right direction, appreciate your time.

    Regards,
    Michael

  • Here's a different option that might be better, but you should still test for performance.

    WITH CTE AS(
      SELECT *,
       ROW_NUMBER() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rn,
       RANK() OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1) rnk
      FROM dbo.CentralSample AS cs
    )
    DELETE
    FROM CTE
    WHERE CTE.rn = 2
    AND CTE.rnk = 2
    AND CTE.Value1 = N'Red'

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • And yet another option:

    WITH CTE AS(
      SELECT *,
       LAG(cs.Value1) OVER(PARTITION BY cs.DuplicatingValue ORDER BY cs.Value1 ) prev_Value1
      FROM dbo.CentralSample AS cs
    )
    DELETE
    FROM CTE
    WHERE CTE.prev_Value1 = N'Green'
    AND CTE.Value1 = N'Red'

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis,
    Thanks so much for the 2 options, I can't use the 2nd option because I was convinced I was connecting to a 2016 instance but just now found out it's in-fact a 2008 R2 instance.
    I really appreciate your time.

    Thanks
    Michael

  • Yes, or this, which is very similar.  It relies on Red and Green being the only two possible values of Value1.  I think Piet's solution deleted too many rows.

    WITH Duplicates AS (
        SELECT
             DuplicatingValue
        ,    Value1
        ,    MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue
        ,    MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue
        FROM CentralSample
        )
    DELETE FROM Duplicates
    WHERE MaxValue > MinValue
    AND Value1 = 'Red';

  • John Mitchell-245523 - Wednesday, January 31, 2018 8:34 AM

    Yes, or this, which is very similar.  It relies on Red and Green being the only two possible values of Value1.  I think Piet's solution deleted too many rows.

    WITH Duplicates AS (
        SELECT
             DuplicatingValue
        ,    Value1
        ,    MAX(Value1) OVER (PARTITION BY DuplicatingValue) AS MaxValue
        ,    MIN(Value1) OVER (PARTITION BY DuplicatingValue) AS MinValue
        FROM CentralSample
        )
    DELETE FROM Duplicates
    WHERE MaxValue > MinValue
    AND Value1 = 'Red';

    Thanks John - yes, you are correct, Piet's was deleting all the rows that have a combination of red and green - but it helped me in identifying the records.
    Your code gives the correct results - thank you very much for your input too.

    Regards,
    Michael

  • micang - Tuesday, January 30, 2018 1:58 PM

    I have the following data set below - I have identified duplicates which is purely based on the column 'DuplicatingValue'.

    >> There will only ever be 2 duplicated records [sic] per 'DuplicatingValue'. <<

    Rows are not record [sic] s. There is a huge and significant difference between the two concepts.

    >> There is no unique identifier in the data set. <<

    without a key, you cannot have a table. This is by definition, so nothing you are doing applies to an SQL problem. You have a deck of punch cards 🙁

    I have an article coming out on Codd’s deGreen of duplication, which was his way of solving this sort of problem you might want to read it.

    >> Within a duplicate set - if the are 'Red' for one record [sic] and 'Green' for the other record [sic], then delete the 'Red' record [sic] and keep the 'Green' record [sic] ('DuplicatingValues' 1 and 4 in the sample below). For any other combination in the duplicate set, do nothing - therefore in this example for 'DuplicatingValues' 2 and 3, do nothing. <<

    Dr. Codd had what is called the information principle in his 12 rules of RDBMS. If I understand what you’re saying is that what you are calling “duplicating_values†is the identifier for one of these sets.

    Another problem with your posting is that you have no idea what the ISO 11179 naming rules are, and that a table must have a key. Then why is everything a variable string? You also don’t know syntax for the insert into statement. Let’s do some corrections on the DDL you did post (by the way, at least you tried to post DDL; so many people here don’t even bother with that courtesy)

    CREATE TABLE Central_Samples
    (something_color VARCHAR(5) NOT NULL
    CHECK (something_color IN (‘Red’, ‘Green’),
    set_nbr INTEGER NOT NULL
    CHECK (set_nbr > 0),
    PRIMARY KEY (something_color, set_nbr)
    );

    I want you to think about this. If you actually had a primary key, there would be no need to remove the duplicates. You could have prevented them. Would you buy a house from a carpenter who tells you that he didn’t finish the roof but don’t worry about it. You can fix it later!

    INSERT INTO Central_Samples
    ('Green', 1), ('Red', 1),
    ('Red', 2), ('Red', 2), --- dups
    ('Green', 3), ('Green', 3), --- dups
    ('Green', 4), ('Red', 4);

    >> The result expected from the above sample data once the duplicates are removed, would be:

    'Green',1
    'Red',2
    'Red',2
    'Green',3
    'Green',3
    'Green',4
    <<

    no, that’s not what the results would be. You still have duplicated rows! Try this:

    Green    1
    Red    1
    Red    2
    Green    3
    Green    4

    This will find the sets that have both red and green. You can use it in the delete statement, if you wish.

    SELECT set_nbr
    FROM Central_Samples
    GROUP BY set_nbr
    HAVING MIN(something_color) = ‘Green’
     AND MAX(something_color)= ‘Red’;

    Please post DDL and follow ANSI/ISO standards when asking for help. 

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply