SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Antother Delete Duplicates Thread


Antother Delete Duplicates Thread

Author
Message
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
I know this topic has been covered many times, but I cannot find the specific solution I am looking for.

We load data from an outside source and there are sometimes duplicates being loaded, sometimes there are not. (This is another story)

I have a basic query that will handle my duplicate delete issue without any logic added:

SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE
GO
TUNCATE TABLE SOURCE_TABLE
GO
INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE
GO
DROP TABLE TEMP_TABLE
GO

I am plannig to setup a job to run on off peak times that will remove the duplicates (until I can fix the source data feed)

I want to build some logic that will look at the total records and compare to the distinct. If total is > than distinct, remove duplicates.

I attempted to build a IF statement but it doesnt like me using multiple select statements. Here is what I tried:

IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

BEGIN

SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE

TUNCATE TABLE SOURCE_TABLE

INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE

DROP TABLE TEMP_TABLE



END

ERROR: Msg 116, Level 16, State 1, Line 3
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.


I've also seem some examples that set the database recovery model to bulk logged during the duplicate removal operations and setting back to its previous recovery model after the operation is complete. It makes sense to me as the log file be getting entries for all the inserting going on. Is this correct and should I set to bulk logged? (A few of the tables have more than 2 million rows)


If there is a better way to handle this I am open to other ideas. Thanks!
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)

Group: General Forum Members
Points: 26478 Visits: 17557
Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

The error you are getting is because you trying to compare entire result sets and what you want is the row count.


IF (Select * From SoureTable) > (Select Distinct * From SourceTable)



That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:


IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)



Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
Sean Lange (10/10/2012)
Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

The error you are getting is because you trying to compare entire result sets and what you want is the row count.


IF (Select * From SoureTable) > (Select Distinct * From SourceTable)



That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:


IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)



Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.


Thanks Sean,

You are correct all i want to do is just delete the duplicate rows. Looks like I need to revisit this another way and look for better fixes using row counts or some other option.
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)

Group: General Forum Members
Points: 26478 Visits: 17557
Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.


create table #Dupes
(
IdentityValue int identity not null,
UserValue varchar(20),
AnotherValue varchar(10)
)

insert #Dupes
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another' union all
select 'qwer', 'Another' union all
select 'unique', 'Another' union all
select 'another', 'Another' union all
select 'another', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another'

select * from #Dupes

;with cte as
(
select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum
from #Dupes
)
delete cte where RowNum > 1

select * from #Dupes

drop table #Dupes



_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
Sean Lange (10/10/2012)
Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.


create table #Dupes
(
IdentityValue int identity not null,
UserValue varchar(20),
AnotherValue varchar(10)
)

insert #Dupes
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another' union all
select 'qwer', 'Another' union all
select 'unique', 'Another' union all
select 'another', 'Another' union all
select 'another', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another'

select * from #Dupes

;with cte as
(
select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum
from #Dupes
)
delete cte where RowNum > 1

select * from #Dupes

drop table #Dupes





GOT IT! Thanks!

Here is what I ended up using:

TABLE:

[dbo].[ActivityGroup]
(
[ACTIVITYGROUP_DESC] [nvarchar](255) NULL,
[REQUEST_Id] [int] NULL,
[PullDate] [date] NULL
) ON [PRIMARY]

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

Delete DUPLICATES where Dups > 1

This is a great way to get rid of exact duplicates! Im sure its less over head than my previous approach Smile Many Thanks!
Sean Lange
Sean Lange
One Orange Chip
One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)One Orange Chip (26K reputation)

Group: General Forum Members
Points: 26478 Visits: 17557
You are quite welcome. Thanks for letting me know that worked for you.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Modens splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
Two more questions. I am new to using a CTE, so bear with me Smile


There are tables with 20 or more columns. Can I use a wild card like *, if not I have no problem listing them out, just checking. (I tried replacing the column names with a *, but it didnt work)


I wanted to check for duplicates across all columns, which I got sorted thanks to your help.


Now I have some duplicates where the activitygroup_desc and request_id are the same but the pulldate is different. (due to my bad usage of the update and insert routine i wrote - its now fixed) The table will have the request_id used multiple times as its tied to the activitygroup_desc. Here is an example of the data:

ACTIVITYGROUP_DESC   REQUEST_ID     PullDate
Customer Requests 39 2012-10-05
Customer Requests 39 2012-09-27
Lab Services 39 2012-09-27
Customer Requests 40 2012-10-04
Lab Services 40 2012-10-02
Customer Requests 40 2012-10-02
Customer Requests 41 2012-09-17
Customer Requests 42 2012-10-08



Any recomendations to modify the CTE to look at column1, column2, and when there is a duplicate remove the one with the older pull date in column3? Does that make sense?
sam.dahl
sam.dahl
SSC-Addicted
SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)SSC-Addicted (484 reputation)

Group: General Forum Members
Points: 484 Visits: 887
My understanding is that there's no way to use *, instead generate a list of columns and use that.

Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

Your code:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields
ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1
) AS Dups
FROM ActivityGroup



Any rows with a value greater than one are older duplicates and can be deleted.

Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).
Minor disclaimer: I haven't tested the code yet!
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
sam.dahl (10/11/2012)
My understanding is that there's no way to use *, instead generate a list of columns and use that.

Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

Your code:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields
ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1
) AS Dups
FROM ActivityGroup



Any rows with a value greater than one are older duplicates and can be deleted.

Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).
Minor disclaimer: I haven't tested the code yet!


Sam,

This works perfectly (had to fix a few syntax errors and typos, but that was no biggie)

The explanation and the comments in code really brought it home for me. I was reading about all this yesterday after testing the example posted by Sean. Now it all makes sense (light bulb above my head now)

Gotta give thanks to Sean and Sam! This forum rocks! One day I will be able to help others and keep the cycle going Smile
SQL-Squid
SQL-Squid
SSC-Enthusiastic
SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)SSC-Enthusiastic (114 reputation)

Group: General Forum Members
Points: 114 Visits: 306
A side note for any newbies reading this:

I was manually performing:

SELECT * FROM ACTIVITY GROUP --Gave me total count of rows
SELECT DISTINCT * FROM ACTIVITY GROUP --Thought this was a true Distinct count of the distinct values, in many cases this would be a good count. I had previously made a mistake and the 3rd column (Pulldate) was incorrectly assigned and duplicated to many rows.

Using the CTE examples provided above I was able to remove the true duplicates.

So to check the CTE was doing as it was suposed to, I needed to be more specific in my test query, I needed the distinct count from columns 1 and 2, excluding column 3.

Here is the simple query I used:

SELECT DISTINCT ACTIVITYGROUP_DESC, REQUEST_ID
FROM ACTIVITYGROUP

^Now I am only looking at the first two columns, this gave me a better count of actual distinct values.

Hope this makes sense to anyone that reads this.

Thanks to the following post for help with DISTINCT.
http://www.sqlservercentral.com/articles/DISTINCT/71551/
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search