Removing Duplicates

Question

Removing Duplicates

Viewing 15 posts - 31 through 45 (of 53 total)

You must be logged in to reply to this topic. Login to reply

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 1

curious_sqldba (10/10/2013)
Very nice article and good explanation. Can you please add your thoughts related to performance and is this the recommended way to do it?

I find the performance to be quite good, I have this code or similar code in production in a number of places and I've never had issues with speed. I haven't had the opportunity to compare its performance to a CTE (shouldn't be much difference at all) or a cursor (could be a large difference if there's a large number of records). I don't know if it is THE recommended way to do it, but it is the way I would recommend to do it.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Koen Verbeeck SSC Guru Points: 259165 More actions · Answer 2

Nice introductory article Stefan. Good job!

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

.Patrick SSC Eights! Points: 828 More actions · Answer 3

I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).

So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:

SELECT

DISTINCT CreateStatement,

dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename

FROM

CB_I_Drop_Create_Indexes

Above query took me arround 4 minutes to complete the same job. (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 4

prennings (10/15/2013)
I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).
So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:
SELECT
DISTINCT CreateStatement,
dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename
FROM
CB_I_Drop_Create_Indexes
Above query took me arround 4 minutes to complete the same job. (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

I think in that case you might even be better off just copying the good rows to a new table, dropping the old one and renaming it. How did you end up with a table that's that dirty?

I don't know that this query would crash your system or make the database unavailable. I'd think that for that to happen there'd have to be some configuration problems with the server and for it to be a rather small server.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Jeff Moden SSC Guru Points: 1004600 More actions · Answer 5

prennings (10/15/2013)
I tried this solution on table with over 7 million results. (Around a 100.000 duplicates each) however the query did run for around 57 minutes then my server crashed and the database I was performing it on crashed and went into recovery mode... (needed to reboot the server to made it available again :-)).
So the function is great for smaller databases but not a real solution in my case for a database with 7+ million results. (In the end I had a total of 1654 rows left, so yeah ALLOT of duplicates). In my case I had the luck I could run:
SELECT
DISTINCT CreateStatement,
dbo.split(dbo.split(CreateStatement, '[', 2), ']') AS Tablename
FROM
CB_I_Drop_Create_Indexes
Above query took me arround 4 minutes to complete the same job. (In the createStatement column the tablename for the index was in that column row). Hope someone also has some use for it.

I would love to see the split function. Any chance of you posting it, please? Thanks.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

jangardi 33339 SSC Enthusiast Points: 173 More actions · Answer 6

jangardi 33339

SSC Enthusiast

Points: 173

January 22, 2015 at 11:14 pm

#1772493

Great article, thanks! Will definitely use it.

Shankar Walvekar SSC Veteran Points: 278 More actions · Answer 7

http://deleteduplicaterecords.blogspot.in/

You get all answers here

jeknight44 Grasshopper Points: 17 More actions · Answer 8

Here is another way to delete dups.

WITH dup

AS (Select Policy, ROW_NUMBER() OVER (PARTITION BY agentnum, policy ORDER BY ( SELECT 0)) RN

FROM dbo.YourTable)

DELETE FROM dup

WHERE RN > 1

Koen Verbeeck SSC Guru Points: 259165 More actions · Answer 9

jeknight44 (1/23/2015)
Here is another way to delete dups.
WITH dup
AS (Select Policy, ROW_NUMBER() OVER (PARTITION BY agentnum, policy ORDER BY ( SELECT 0)) RN
FROM dbo.YourTable)
DELETE FROM dup
WHERE RN > 1

This thread is about a question of the day (some sort of quiz) from 1.5 years ago.

I think most people here thus already know how to remove duplicates.

Need an answer? No, you need a question
My blog at https://sqlkover.com.
MCSE Business Intelligence - Microsoft Data Platform MVP

nicolas_johnson 52979 SSC Rookie Points: 44 More actions · Answer 10

Sometimes the easiest way is not the most elegant. When I just have one set of duplicates I select the date, stash it on a nopepad or excel, delete all the duplicate records and re-insert it with a paste using the 'Edit top 200' feature in MS-SMS.

I find this thread has very helpful for when you have multiple dupe records in your table that was poorly designed without an ID or key constraint preventing dupe records.

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 11

nicolas_johnson 52979 (1/23/2015)
Sometimes the easiest way is not the most elegant. When I just have one set of duplicates I select the date, stash it on a nopepad or excel, delete all the duplicate records and re-insert it with a paste using the 'Edit top 200' feature in MS-SMS.
I find this thread has very helpful for when you have multiple dupe records in your table that was poorly designed without an ID or key constraint preventing dupe records.

Sometimes the problem is that the data you receive doesn't have a distinct ID or you get the data twice. Your table can be perfect, but if the data coming in isn't, this will keep your system from crashing. And it is something you can automate, so you don't have to manually process any duplicates you have. When I set up my imports to dedupe this way, I rarely have to pay attention to duplicates after that.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Jeff Moden SSC Guru Points: 1004600 More actions · Answer 12

Shankar Walvekar (1/23/2015)
http://deleteduplicaterecords.blogspot.in/
You get all answers here

Why? Sioban's article has all the answers along with reasons why it works.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 13

Jeff Moden (1/23/2015)
Shankar Walvekar (1/23/2015)
http://deleteduplicaterecords.blogspot.in/
You get all answers here
Why? Sioban's article has all the answers along with reasons why it works.

Thanks Jeff.

I feel like I always see someone self-promoting like this in the comments. That's why I didn't bother to respond. Or click through. : -)

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

pmusiej Grasshopper Points: 15 More actions · Answer 14

Hi,

I some times use cte to delete duplicates. It works fine.

Paul

Sioban Krzywicki One Orange Chip Points: 27770 More actions · Answer 15

pmusiej (1/23/2015)
Hi,
I some times use cte to delete duplicates. It works fine.
Paul

Sure! That's a way you can do it. Want to post some code to share with people? I find multiple ways of doing things is both nice to know and can help if one way runs into trouble.

For example, sometimes CTEs can run long on large datasets.

--------------------------------------
When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
--------------------------------------
It’s unpleasantly like being drunk.
What’s so unpleasant about being drunk?
You ask a glass of water. -- Douglas Adams

Removing Duplicates

Cookies on SQLServerCentral