Find and Remove Duplicate Records SQL Server

  • Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    --Brandie added FROM clause

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

    If not, what am I missing?

    You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.

    Ahh, yes. I was skimming past that. It's been that kind of a morning.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    --Brandie added FROM clause

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

    If not, what am I missing?

    You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.

    Ahh, yes. I was skimming past that. It's been that kind of a morning.

    Been there 😀

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Am I being dim?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    wouldn't this just not do anything at all?

    if you have

    100,

    200,

    300,

    300,

    300,

    400,

    500

    doing the aggregate would just say delete everything where the customerid is not in 100,200,300,400,500. but the duplicate customer id is 300, so it wouldn't delete?

    Happy to be wrong (It happens often).

  • Jeff Moden (2/2/2016)


    andy_111 (1/31/2016)


    Not really useful article.

    Not really a useful comment, either. Please explain why you think it's not useful.

    I have to agree with andy_111's sentiment, he and others noted the row number / windowing function alternative, and additionally I think that with the warnings about SET ROWCOUNT's changing semantics, the original posted code is probably a trap for the unwary.

    Also deleting dupes is sort of a FAQ and I don't think the author really covered the material, especially given what I've seen with a basic web search on the topic. Clearly the windowing functionality should have been mentioned, heck even Microsoft offers the selecting distinct into a temp table, deleting and reinserting as an option in one of their older pages. Given the shakey semantics lifetime of SET ROWCOUNT on updates, I think a decent effort should have discussed this, so I have to in general agree with andy_111's sentiment.

    It looks like an old fashion article, doesn't it? For SQL 2000 or something.

    Even with that, I like the insert distinct copies of the dupes into a temp table, remove the dupes, then reinsert back into the source if we're talking about non windowing function methods. Still yeah, a bit on the old fashioned side 🙂

  • Luis Cazares (2/1/2016)


    venkataprasanth (2/1/2016)


    The author's posted solution works and it DOESN'T DELETE ALL Duplicates. (If there are 3 dup rows it deletes 2 and leaves 1). So it works. Also I guess people posting the row_number() solution works only when we have a unique identifier in the table. But having a unique identifier obviously doesn't make two rows appear duplicate (no matter even if all other columns are same). The post is related to deleting duplicates when there is no unique identifier (i.e all columns values are same for more than one row)

    What do you mean with having a unique identifier? The ROW_NUMBER option allows you to define what constitutes a duplicate (either one column, some columns or all the columns). The method shown in the article is bad in terms that it needs the table to have a single column that defines duplication and the worst part is that it deletes one row at a time using a deprecated option.

    I'm sorry that Ginger Keys got such a harsh reaction, but the alternatives are certainly much better.

    I did an article on the cte method years ago. Maybe the demo in the article will help clear things up a bit. There seems to be a lot of back and forth on using that method.

    Remove Duplicates Article

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Luis Cazares (2/2/2016)


    marquito_61 (2/2/2016)


    But this method eliminates all duplicate records

    that is wrong

    As far as I know, none of the code posted in the article of the forum eliminates all rows with duplicates, they all remove duplicates and leave the desired rows.

    Maybe a bit of translation, but to me the desired row is not a duplicate and any additional row matching the desired row is the duplicate. Therefore all duplicates are removed. Does this mean that all rows with exactly matching criteria are removed? No. To me, it just means that the extra (or duplicates) are removed.

    I digress a bit there. I do agree with Luis - only the undesired rows are removed.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Luis Cazares (2/2/2016)


    o.kaesmann (2/2/2016)


    My faforite method to clean-up duplicated rows is this (I used the example code from above):

    That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

    While it works, it doesn't mean it won't change and cause potentially unwanted effects. It is undocumented and unsupported. Does it look cool? Certainly. I would prefer to not call the virtual column in a production environment - mostly to avoid the extra function call and avoid the chance of it changing and causing problems in a future edition (though that may be unlikely with this one).

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • o.kaesmann (2/2/2016)


    I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.

    In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.

    Good info. Thanks

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • rob.carter 79958 (2/3/2016)


    Am I being dim?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    wouldn't this just not do anything at all?

    if you have

    100,

    200,

    300,

    300,

    300,

    400,

    500

    doing the aggregate would just say delete everything where the customerid is not in 100,200,300,400,500. but the duplicate customer id is 300, so it wouldn't delete?

    Happy to be wrong (It happens often).

    If you have duplicate CustID, you are correct. The query itself is grouping on CustName. This will delete the duplicates unless the design is that the customer id can be duplicated in the customers table. I really hope your Customers table does not allow for duplicate customer ids. If it does, then hopefully there is a composite key that will help identify different rows. If not, then the assumption that this will do nothing is correct.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    --Brandie added FROM clause

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

    If not, what am I missing?

    You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.

    Ahh, yes. I was skimming past that. It's been that kind of a morning.

    Been there 😀

    Still there. :crazy:

  • I'd be very careful with the "duplicate" deletion. When our old and new systems merged "generated" such records. But the identification process was very difficult. The name is not a good choice, the address... better but sometimes tricky, the email address is ok if it is not an organization, composite (name+birth date, etc.) are can be dangerous too, and so on. So this is difficult. Usually no other choice than deciding and deleting one by one. Better if we don't allow to appear.

    (And yes the originally posted program is wrong as I see)

  • andy_111 (2/1/2016)


    Daniel Matthee (1/31/2016)


    I have not tested but just by looking at the code will it not delete both the records???

    There is nothing to indicate that you just want to get rid of all dups and only have a unique records left, or am I missing something here?

    Regards

    For example, you have table Customers (contactname VARCHAR(100) NOT NULL, Age INT)

    You fill table

    Insert into Customers (contactname, Age) VALUES ('Andrey',35),('Alexey', '40'),('Andrey',50);

    Query

    SELECT ROW_NUMBER() Over(PARTITION BY C.contactname ORDER BY C.contactname)

    As RowNumber,C.* FROM Customers C

    returns

    1 Alexey 40

    1 Andrey 35

    2 Andrey 50

    RowNumber partitioned inside contactname and it unique inside every contactname. If you delete RowNumber > 1 you delete duplicates.

    Yes, code in this post will delete all the records. Thats why using old method for deleting duplicates (using group by) is 3 steps. First copy all found duplicate records (one of them) to temp table. Then delete like in post. Then insert to original table from the temp table.

    Using CTE, ronumber() and delete where rowNum >1 from CTE is better by all means 🙂

  • Can you please provide an example, on how to use this script when the duplicates situation is based on more then one column, lets say a receiving database, where: if you receive same Quantity of a certain sku, in same day, it constitutes a duplicates record. How do you remove that by this script? Big thanks

  • post the CREATE TABLE and INSERT scripts so we have some data to work with. what version of SQL Server are you using?

  • thanks, i will do it shortly

Viewing 15 posts - 61 through 75 (of 76 total)

You must be logged in to reply to this topic. Login to reply