Removing Duplicates

Question

Removing Duplicates

Sioban Krzywicki

One Orange Chip

Points: 27770
More actions
October 7, 2013 at 9:18 pm

#278715

Comments posted to this topic are about the item Removing Duplicates
--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 1 through 15 (of 53 total)

You must be logged in to reply to this topic. Login to reply

peterswe SSC-Addicted Points: 465 More actions · Answer 1

Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).

delete

from table t1

where exists

(select *

from table t2

where t1.column1 = t2.column1

and t1.column2 = t2.column2

...

and t1.rowid > t2. rowid) -- in Oracle

and t1.%%physloc%% > %%physloc%%) -- in SQL Server

This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.

Ford Fairlane SSCertifiable Points: 7664 More actions · Answer 2

peterswe (10/8/2013)
Since I have worked with Oracle I have removed duplicates using the rowid column. SQL Server has something similar but its an undocumented feature (might be changed without notice).
delete
from table t1
where exists
(select *
from table t2
where t1.column1 = t2.column1
and t1.column2 = t2.column2
...
and t1.rowid > t2. rowid) -- in Oracle
and t1.%%physloc%% > %%physloc%%) -- in SQL Server
This self join can be useful, you dont need to compare all rows for example and it removed more than one row in case of tripple duplicates.

http://www.sqlskills.com/blogs/paul/sql-server-2008-new-undocumented-physical-row-locator-function/

Not so sure how happy I would be using this - but then as Paul says all the cool stuff is undocumented.

Great question, I really like how you have included the ability to retain the latest version.. I usually use a more simple way to remove duplicate rows..

Using the same table...

Declare @FName varchar(30), -- values to look for

@LName varchar (30),-- values to look for

@cnt int -- count

Declare getallrecords cursor local static For

Select count (1), Fname, Lname

from DuplicateRow (nolock)

group by FName, LName having count(1)>1

Open getallrecords

Fetch next from getallrecords into @cnt,@FName,@LName

--Cursor to check with all other records

While @@fetch_status=0

Begin

Set @cnt= @cnt-1

Set rowcount @cnt

-- Deleting the duplicate records. Observe that all fields are mentioned at the where condition

Delete from DuplicateRow where Fname=@FName and LName=@LName

Set rowcount 0

Fetch next from getallrecords into @cnt,@FName,@LName

End

Close getallrecords

Deallocate getallrecords

Obviously depending on how many rows you need to compare to ensure your only removing duplicates would depend on what you declare and select.

Hope this helps...

Ford Fairlane
Rock and Roll Detective

gand 29575 SSChasing Mays Points: 651 More actions · Answer 3

Thanks for a useful article, clearly explained. I shall have occasion to use your method when I import data from Excel, which despite precautions has duplicates.

tom.w.brannon Hall of Fame Points: 3976 More actions · Answer 4

I believe the following statement in the article is misleading:

You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.

It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:

SELECT

ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R

, FName

, LName

, JobTitle

, Age

FROM DuplicateRow

WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1

Gives the following error:

Windowed functions can only appear in the SELECT or ORDER BY clauses.

justin-397850 SSC Rookie Points: 35 More actions · Answer 5

justin-397850

SSC Rookie

Points: 35

October 8, 2013 at 4:58 am

#1656673

Thanks Stefan. Great article.

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 6

tom.w.brannon (10/8/2013)
I believe the following statement in the article is misleading:
You don’t need to do the SELECT first, but it is nice to see what you’ll be deleting.
It is required that the select section be in the delete statement because the windowing functions cannot be part of the where clause. The following DOES NOT work:
SELECT
ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) R
, FName
, LName
, JobTitle
, Age
FROM DuplicateRow
WHERE ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY LName ) > 1
Gives the following error:
Windowed functions can only appear in the SELECT or ORDER BY clauses.

What I meant by "You don't need to do a SELECT first" is that you can just use the DELETE query. The SELECT subquery within the DELETE query is still needed. By "first" I meant running the SELECT to see the results before running the DELETE. Both have another SELECT subquery as part of the operation.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

apatel62 Valued Member Points: 53 More actions · Answer 7

-- I have added DateInserted and Address columns to show how to remove all old records for the same person

--drop table #DuplicateRow

CREATE TABLE #DuplicateRow(

ID INT identity(1,1),

DateInserted Datetime,

FName varchar(30),

LName varchar(30),

JobTitle varchar(30),

Age tinyint,

[Address] Varchar(255)

)

GO

INSERT INTO #DuplicateRow (DateInserted,FName,LName,JobTitle,Age,[Address])

SELECT DateInserted,FName,LName,JobTitle,Age,[Address] FROM (

SELECT '1/1/12' [DateInserted], 'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '1 broad street,ny ' [Address]UNION ALL

SELECT '2/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11 broad street,ny ' [Address]UNION ALL

SELECT '6/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '1 main street,ny' [Address]UNION ALL

SELECT '3/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '11 main street,ny' [Address]UNION ALL

SELECT '7/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/a broad street,ny' [Address]UNION ALL

SELECT '8/1/12' [DateInserted],'Mangu' [FName],'Changu' [LName],'Bekar' [JobTitle], 21 [Age] , '11/b broad street,ny' [Address]UNION ALL

SELECT '9/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age] , '115 main street,ny' [Address]UNION ALL

SELECT '10/1/12' [DateInserted],'Mehul' [FName],'Shah' [LName],'Manager' [JobTitle], 55 [Age], '118 main street,ny' [Address]

) A

SELECT * FROM #DuplicateRow order by FName,LName,JobTitle,Age,DateInserted desc

-- we need to remove all rows except id= 6 and 8

SELECT

ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R

, FName

, LName

, JobTitle

, Age

FROM #DuplicateRow

SELECT *

FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc) R

, FName

, LName

, JobTitle

, Age

FROM #DuplicateRow

) B

WHERE R > 1

DELETE B

FROM ( SELECT ROW_NUMBER() OVER ( PARTITION BY FName, LName, JobTitle, Age ORDER BY DateInserted desc ) R

FROM #DuplicateRow

) B

WHERE R > 1

SELECT * FROM #DuplicateRow

consultant_fcg SSC Rookie Points: 25 More actions · Answer 8

Hello There,

You may check this article that provides detailed examples of how to remove duplicate records.

http://www.dfarber.com/computer-consulting-blog/2011/12/26/remove-duplicate-records-in-sql.aspx

It combines several values from several records into one good record.

Regards,

Doron

The Farber Consulting Group, Inc.

http://www.dFarber.com

Jeff Moden SSC Guru Points: 1004749 More actions · Answer 9

Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Greg Edwards-268690 SSC-Insane Points: 20680 More actions · Answer 10

I can see the need to understand how to do this, although I tend to go back to the basics.

If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?

After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.

I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.

If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.

And the changing job title also drives me towards separating out to different tables and having effectivity dates.

But that is way beyond your intended scope.

Just trying to spark a thought or two, not to make a big deal about any of this.

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 11

Jeff Moden (10/8/2013)
Short, sweet, to the point with great examples and good simple explanations. Very nicely done, Mr. Krzywicki.

Thank you!

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Jozef Moravcik Grasshopper Points: 23 More actions · Answer 12

Hi Guys,

I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 13

Greg Edwards-268690 (10/8/2013)
I can see the need to understand how to do this, although I tend to go back to the basics.
If you find yourself needing to do this, question if it be designed into the the table in the first place to prevent this?
After all, I think this leads into the initial issue - without deleting all, the engine needs a way to discern which one(s) to delete.
I realize sometimes you have no input / control into this, but a few words about this concept might be a worthwhile addition.
If you wanted to expand this, age (at least to me) should be calculated, not stored in most cases.
And the changing job title also drives me towards separating out to different tables and having effectivity dates.
But that is way beyond your intended scope.
Just trying to spark a thought or two, not to make a big deal about any of this.

It is funny, I agonized a bit over the example table. As I was putting in the columns I kept thinking "Well, if this were a real database I'd put this in another table so it could have multiple values or historical data or maybe this should be calculated" then I reminded myself that was outside the scope of this article and I just needed something to use as an example. : -)

Same thing with table design, there are situations where you have no way to prevent this ahead of time, whether it is because the duplicates are in the data coming in or because the table is already in production and the powers that be won't approve a structural change. I wanted to keep the focus on this one task as dealing with every possibility would make the article far longer.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 14

Jozef Moravcik (10/8/2013)
Hi Guys,
I see there is a wrong DB structure (table structure). Ussualy each master table should have at least a primary key. Using proper primary key you avoid the problem...

Sure, and if duplicate data gets into your table, you can use the primary key to remove duplicates, but this gives you a way to find what those duplicates are.

But what do you do if you don't have the ability to add the primary key? Or if you need to remove duplicates on large batches of data that are coming into your system from outside sources? This should help with any of those situations.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams