Removing Duplicates

  • Comments posted to this topic are about the item Removing Duplicates

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).

    delete

    from table t1

    where exists

    (select *

    from table t2

    where t1.column1 = t2.column1

    and t1.column2 = t2.column2

    ...

    and t1.rowid > t2. rowid) -- in Oracle

    and t1.%%physloc%% > %%physloc%%) -- in SQL Server

    This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.

  • peterswe (10/8/2013)


    Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).

    delete

    from table t1

    where exists

    (select *

    from table t2

    where t1.column1 = t2.column1

    and t1.column2 = t2.column2

    ...

    and t1.rowid > t2. rowid) -- in Oracle

    and t1.%%physloc%% > %%physloc%%) -- in SQL Server

    This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.

    http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/

    Not so sure how happy I would be using this - but then as Paul says all the cool stuff is undocumented.

    Great question, I really like how you have included the ability to retain the latest version.. I usually use a more simple way to remove duplicate rows..

    Using the same table...

    Declare @FName varchar(30), -- values to look for

    @LName varchar (30),-- values to look for

    @cnt int -- count

    Declare getallrecords cursor local static For

    Select count (1), Fname, Lname

    from DuplicateRow (nolock)

    group by FName, LName having count(1)>1

    Open getallrecords

    Fetch next from getallrecords into @cnt,@FName,@LName

    --Cursor to check with all other records

    While @@fetch_status=0

    Begin

    Set @cnt= @cnt-1

    Set rowcount @cnt

    -- Deleting the duplicate records. Observe that all fields are mentioned at the where condition

    Delete from DuplicateRow where Fname=@FName and LName=@LName

    Set rowcount 0

    Fetch next from getallrecords into @cnt,@FName,@LName

    End

    Close getallrecords

    Deallocate getallrecords

    Obviously depending on how many rows you need to compare to ensure your only removing duplicates would depend on what you declare and select.

    Hope this helps...

    Ford Fairlane
    Rock and Roll Detective

  • Thanks for a useful article, clearly explained. I shall have occasion to use your method when I import data from Excel, which despite precautions has duplicates.

  • I believe the following statement in the article is misleading:

    You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.

    It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:

    SELECT

    ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R

    , FName

    , LName

    , JobTitle

    , Age

    FROM DuplicateRow

    WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1

    Gives the following error:

    Windowed functions can only appear in the SELECT or ORDER BY clauses.

  • Thanks Stefan. Great article.

  • tom.w.brannon (10/8/2013)


    I believe the following statement in the article is misleading:

    You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.

    It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:

    SELECT

    ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R

    , FName

    , LName

    , JobTitle

    , Age

    FROM DuplicateRow

    WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1

    Gives the following error:

    Windowed functions can only appear in the SELECT or ORDER BY clauses.

    What I meant by "You don't need to do a SELECT first" is that you can just use the DELETE query. The SELECT subquery within the DELETE query is still needed. By "first" I meant running the SELECT to see the results before running the DELETE. Both have another SELECT subquery as part of the operation.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • -- I have added DateInserted and Address columns to show how to remove all old records for the same person

    --drop table #DuplicateRow

    CREATE TABLE #DuplicateRow(

    ID INT identity(1,1),

    DateInserted Datetime,

    FName varchar(30),

    LName varchar(30),

    JobTitle varchar(30),

    Age tinyint,

    [Address] Varchar(255)

    )

    GO

    INSERT INTO #DuplicateRow (DateInserted,FName,LName,JobTitle,Age,[Address])

    SELECT DateInserted,FName,LName,JobTitle,Age,[Address] FROM (

    SELECT '1/1/12' [DateInserted], 'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '1 broad street,ny ' [Address]UNION ALL

    SELECT '2/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11 broad street,ny ' [Address]UNION ALL

    SELECT '6/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '1 main street,ny' [Address]UNION ALL

    SELECT '3/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '11 main street,ny' [Address]UNION ALL

    SELECT '7/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/a broad street,ny' [Address]UNION ALL

    SELECT '8/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/b broad street,ny' [Address]UNION ALL

    SELECT '9/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '115 main street,ny' [Address]UNION ALL

    SELECT '10/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age], '118 main street,ny' [Address]

    ) A

    SELECT * FROM #DuplicateRow order by FName,LName,JobTitle,Age,DateInserted desc

    -- we need to remove all rows except id= 6 and 8

    SELECT

    ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R

    , FName

    , LName

    , JobTitle

    , Age

    FROM #DuplicateRow

    SELECT *

    FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc) R

    , FName

    , LName

    , JobTitle

    , Age

    FROM #DuplicateRow

    ) B

    WHERE R > 1

    DELETE B

    FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R

    FROM #DuplicateRow

    ) B

    WHERE R > 1

    SELECT * FROM #DuplicateRow

  • Hello There,

    You may check this article that provides detailed examples of how to remove duplicate records.

    http://www.dfarber.com/computer-consulting-blog/2011/12/26/remove-duplicate-records-in-sql.aspx

    It combines several values from several records into one good record.

    Regards,

    Doron

    The Farber Consulting Group, Inc.

    http://www.dFarber.com

  • Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I can see the need to understand how to do this, although I tend to go back to the basics.

    If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?

    After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.

    I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.

    If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.

    And the changing job title also drives me towards separating out to different tables and having effectivity dates.

    But that is way beyond your intended scope.

    Just trying to spark a thought or two, not to make a big deal about any of this.

  • Jeff Moden (10/8/2013)


    Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.

    Thank you!

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Hi Guys,

    I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

  • Greg Edwards-268690 (10/8/2013)


    I can see the need to understand how to do this, although I tend to go back to the basics.

    If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?

    After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.

    I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.

    If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.

    And the changing job title also drives me towards separating out to different tables and having effectivity dates.

    But that is way beyond your intended scope.

    Just trying to spark a thought or two, not to make a big deal about any of this.

    It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated" then I reminded myself that was outside the scope of this article and I just needed something to use as an example. : -)

    Same thing with table design, there are situations where you have no way to prevent this ahead of time, whether it is because the duplicates are in the data coming in or because the table is already in production and the powers that be won't approve a structural change. I wanted to keep the focus on this one task as dealing with every possibility would make the article far longer.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

  • Jozef Moravcik (10/8/2013)


    Hi Guys,

    I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

    Sure, and if duplicate data gets into your table, you can use the primary key to remove duplicates, but this gives you a way to find what those duplicates are.

    But what do you do if you don't have the ability to add the primary key? Or if you need to remove duplicates on large batches of data that are coming into your system from outside sources? This should help with any of those situations.

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    It’s unpleasantly like being drunk.
    What’s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 1 through 15 (of 52 total)

You must be logged in to reply to this topic. Login to reply