Find and Remove Duplicate Records SQL Server

  • This will delete all of the records that have more than 1 occurrence including the original.

  • John Donaldson (2/2/2016)


    This will delete all of the records that have more than 1 occurrence including the original.

    I guess you didn't try the code. Here, let me give you the opportunity to run it by giving you sample data and the code all in one.

    CREATE TABLE Duplicates(

    CustID INT,

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (1, 100, 'ABC'),

    (2, 200, 'ABC'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (4, 200, 'CCC'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (6, 500, 'DDA');

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    SET NOCOUNT ON

    SET ROWCOUNT 1

    WHILE 1 = 1

    BEGIN

    DELETE

    FROM Duplicates

    WHERE Col1 IN

    (SELECT Col1

    FROM Duplicates

    GROUP BY Col1

    HAVING COUNT(*) > 1)

    IF @@Rowcount = 0

    BREAK ;

    END

    SET ROWCOUNT 0

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

    This technique is bad because it uses a deprecated feature and deletes one row at a time, but it actually does what it promises to do.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/1/2016)


    The problem is that you're missing an important option that SQL Server offers: The possibility to modify data through CTEs and views as long as they follow certain rules (Check Updatable views).

    With that option, the code results in something easier to read and generate. Here's an example of your first case.

    --CASE 1:

    CREATE TABLE Duplicates(

    RowID INT IDENTITY (1,1),

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (100, 'ABC'),

    (200, 'ABC'),

    (300, 'BBB'),

    (300, 'BBB'),

    (200, 'CCC'),

    (400, 'EEE'),

    (400, 'EEE'),

    (400, 'EEE'),

    (500, 'DDA');

    --Validate the data before the delete

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    /* --Original code (rearranged)

    DELETE FROM Duplicates

    FROM (SELECT *,

    ROW_NUMBER() OVER (Partition by col1,col2 order by col1) as RankCol1

    FROM Duplicates) As T

    WHERE Duplicates.RowID IN (SELECT T.RowID WHERE T.RankCol1 >1)

    */

    WITH CTE AS(

    SELECT *, ROW_NUMBER() OVER (Partition by col1,col2 --Use the columns that define what is a duplicate row

    order by RowID --Ability to define which row will be left

    ) as RowNum

    FROM Duplicates

    )

    DELETE FROM CTE

    WHERE RowNum > 1;

    --Validate the data after the delete

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

    And here's the second case you stated. I just needed to include the CustID column in the PARTITION BY to establish that it's also part of the duplicates definition.

    CREATE TABLE Duplicates(

    CustID INT,

    Col1 INT,

    Col2 VARCHAR(100));

    INSERT INTO Duplicates

    VALUES (1, 100, 'ABC'),

    (2, 200, 'ABC'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (3, 300, 'BBB'),

    (4, 200, 'CCC'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (5, 400, 'EEE'),

    (6, 500, 'DDA');

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    WITH CTE AS(

    SELECT *, ROW_NUMBER() OVER (Partition by CustID, col1,col2 order by col1) as RowNum

    FROM Duplicates

    )

    DELETE FROM CTE

    WHERE RowNum > 1;

    SELECT * FROM Duplicates ORDER BY Col1, col2;

    GO

    --Clean my DB

    DROP TABLE Duplicates;

    No Loops, no RBAR, no deprecated features, no multiple FROM in the delete statement, just simple SQL code that can even be ported to a different RDBMS.

    Is this the best solution? It depends.

    With large data sets and few duplicates, a solution as simple as this might not perform well enough. Depending on different factors, changes should be done to the code. For most scenarios, this method should work just fine.

    Thank you, Luis, for posting an approach that makes sense, is flexible to include multiple columns of multiple types and is performant.

  • Luis Cazares (2/2/2016)


    o.kaesmann (2/2/2016)


    My faforite method to clean-up duplicated rows is this (I used the example code from above):

    That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

    It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.

  • Ed Wagner (2/2/2016)


    Luis Cazares (2/2/2016)


    o.kaesmann (2/2/2016)


    My faforite method to clean-up duplicated rows is this (I used the example code from above):

    That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

    It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.

    Yes, I tested it after finding Paul's post. The question is: would you use it in production code?

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/2/2016)


    Ed Wagner (2/2/2016)


    Luis Cazares (2/2/2016)


    o.kaesmann (2/2/2016)


    My faforite method to clean-up duplicated rows is this (I used the example code from above):

    That's certainly interesting. I've never heard of %%physloc%% before and for a moment I thought it wasn't even t-sql code. I'm not sure I'd like to use it in production code, but it's something worth some research.

    It's a physical row locator. It's been around a while. Paul did a post on it at http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/. Yes, it's real.

    Yes, I tested it after finding Paul's post. The question is: would you use it in production code?

    I don't see a use for production code, but maybe for troubleshooting internals.

  • I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.

    In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.

  • Not sure if I am missing something here, but why can't the following be used?

    delete Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    EDIT: I have subsequently realised (following responses) that CustID is not a PK or other unique constraint so am aware this will not work in this instance.

  • SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    Because that will delete all the non duplicate customers as well.

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • o.kaesmann (2/2/2016)


    I do use it in production environmentbecause of getting date from other Systems that conteins duplicatetd data (don't ask ....) and it works fine.

    In the past i've worked on Oracle and learned to eliminate duplicates by using ROWID. I've searched something equivalent on MS-SQL and found this physloc-thing.

    I belive that Oracle's ROWID is the fastest mechanism to access a specific row in a table. Do you know if the same is true of MS-SQL?

  • Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.

    I'll return to my cave.

  • SQLian (2/3/2016)


    Yes, my bad as I failed to read the original question properly. I assumed wrongly that CustId was a PK.

    I'll return to my cave.

    Your approach is certainly better than the one proposed in the article. It just won't work in certain cases, but it's certainly not bad per se.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    --Brandie added FROM clause

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

    If not, what am I missing?

    Brandie Tarvin, MCITP Database AdministratorLiveJournal Blog: http://brandietarvin.livejournal.com/[/url]On LinkedIn!, Google+, and Twitter.Freelance Writer: ShadowrunLatchkeys: Nevermore, Latchkeys: The Bootleg War, and Latchkeys: Roscoes in the Night are now available on Nook and Kindle.

  • Brandie Tarvin (2/3/2016)


    Luis Cazares (2/3/2016)


    Brandie Tarvin (2/3/2016)


    SQLian (2/3/2016)


    Not sure if I am missing something here, but why can't the following be used?

    delete [FROM] Customers

    where CustID NOT in (

    select min(CustID)

    from Customers

    group by CustName

    )

    --Brandie added FROM clause

    Because that will delete all the non duplicate customers as well.

    No it won't. But it won't work with rows that are completely duplicates or with tables with composite keys.

    Maybe I'm misreading, but SELECT MIN(CustID) would select the minimum customer ID (say 1) and then delete everything else, wouldn't it?

    If not, what am I missing?

    You're missing the GROUP BY CustName. It will leave one CustID for each CustName. For unique names it won't do anything, and for duplicate names it will delete the higher CustIDs.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 46 through 60 (of 76 total)

You must be logged in to reply to this topic. Login to reply