Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Antother Delete Duplicates Thread Expand / Collapse
Author
Message
Posted Wednesday, October 10, 2012 12:37 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
I know this topic has been covered many times, but I cannot find the specific solution I am looking for.

We load data from an outside source and there are sometimes duplicates being loaded, sometimes there are not. (This is another story)

I have a basic query that will handle my duplicate delete issue without any logic added:

SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE
GO
TUNCATE TABLE SOURCE_TABLE
GO
INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE
GO
DROP TABLE TEMP_TABLE
GO

I am plannig to setup a job to run on off peak times that will remove the duplicates (until I can fix the source data feed)

I want to build some logic that will look at the total records and compare to the distinct. If total is > than distinct, remove duplicates.

I attempted to build a IF statement but it doesnt like me using multiple select statements. Here is what I tried:

IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

BEGIN

SELECT DISTINCT * INTO TEMP_TABLE
FROM SOURCE_TABLE

TUNCATE TABLE SOURCE_TABLE

INSERT INTO SOURCE_TABLE
SELECT * FROM TEMP_TABLE

DROP TABLE TEMP_TABLE



END

ERROR: Msg 116, Level 16, State 1, Line 3
Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.


I've also seem some examples that set the database recovery model to bulk logged during the duplicate removal operations and setting back to its previous recovery model after the operation is complete. It makes sense to me as the log file be getting entries for all the inserting going on. Is this correct and should I set to bulk logged? (A few of the tables have more than 2 million rows)


If there is a better way to handle this I am open to other ideas. Thanks!



Post #1371081
Posted Wednesday, October 10, 2012 1:02 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:12 PM
Points: 13,441, Visits: 12,303
Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

The error you are getting is because you trying to compare entire result sets and what you want is the row count.

IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:

IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)

Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.


_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1371086
Posted Wednesday, October 10, 2012 1:18 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
Sean Lange (10/10/2012)
Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

The error you are getting is because you trying to compare entire result sets and what you want is the row count.

IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:

IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)

Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.


Thanks Sean,

You are correct all i want to do is just delete the duplicate rows. Looks like I need to revisit this another way and look for better fixes using row counts or some other option.
Post #1371089
Posted Wednesday, October 10, 2012 1:45 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:12 PM
Points: 13,441, Visits: 12,303
Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.

create table #Dupes
(
IdentityValue int identity not null,
UserValue varchar(20),
AnotherValue varchar(10)
)

insert #Dupes
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another' union all
select 'qwer', 'Another' union all
select 'unique', 'Another' union all
select 'another', 'Another' union all
select 'another', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another'

select * from #Dupes

;with cte as
(
select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum
from #Dupes
)
delete cte where RowNum > 1

select * from #Dupes

drop table #Dupes



_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1371100
Posted Wednesday, October 10, 2012 3:14 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
Sean Lange (10/10/2012)
Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.

create table #Dupes
(
IdentityValue int identity not null,
UserValue varchar(20),
AnotherValue varchar(10)
)

insert #Dupes
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another' union all
select 'qwer', 'Another' union all
select 'unique', 'Another' union all
select 'another', 'Another' union all
select 'another', 'Another' union all
select 'asdf', 'Another' union all
select 'qwer', 'Another'

select * from #Dupes

;with cte as
(
select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum
from #Dupes
)
delete cte where RowNum > 1

select * from #Dupes

drop table #Dupes



GOT IT! Thanks!

Here is what I ended up using:

TABLE:

[dbo].[ActivityGroup]
(
[ACTIVITYGROUP_DESC] [nvarchar](255) NULL,
[REQUEST_Id] [int] NULL,
[PullDate] [date] NULL
) ON [PRIMARY]

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

Delete DUPLICATES where Dups > 1

This is a great way to get rid of exact duplicates! Im sure its less over head than my previous approach :) Many Thanks!

Post #1371136
Posted Wednesday, October 10, 2012 3:16 PM


SSChampion

SSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampionSSChampion

Group: General Forum Members
Last Login: Yesterday @ 3:12 PM
Points: 13,441, Visits: 12,303
You are quite welcome. Thanks for letting me know that worked for you.

_______________________________________________________________

Need help? Help us help you.

Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

Need to split a string? Try Jeff Moden's splitter.

Cross Tabs and Pivots, Part 1 – Converting Rows to Columns
Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs
Understanding and Using APPLY (Part 1)
Understanding and Using APPLY (Part 2)
Post #1371137
Posted Wednesday, October 10, 2012 4:30 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
Two more questions. I am new to using a CTE, so bear with me :)


There are tables with 20 or more columns. Can I use a wild card like *, if not I have no problem listing them out, just checking. (I tried replacing the column names with a *, but it didnt work)


I wanted to check for duplicates across all columns, which I got sorted thanks to your help.


Now I have some duplicates where the activitygroup_desc and request_id are the same but the pulldate is different. (due to my bad usage of the update and insert routine i wrote - its now fixed) The table will have the request_id used multiple times as its tied to the activitygroup_desc. Here is an example of the data:

ACTIVITYGROUP_DESC	REQUEST_ID     PullDate
Customer Requests 39 2012-10-05
Customer Requests 39 2012-09-27
Lab Services 39 2012-09-27
Customer Requests 40 2012-10-04
Lab Services 40 2012-10-02
Customer Requests 40 2012-10-02
Customer Requests 41 2012-09-17
Customer Requests 42 2012-10-08

Any recomendations to modify the CTE to look at column1, column2, and when there is a duplicate remove the one with the older pull date in column3? Does that make sense?




Post #1371158
Posted Thursday, October 11, 2012 1:22 AM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, August 12, 2014 8:23 PM
Points: 367, Visits: 822
My understanding is that there's no way to use *, instead generate a list of columns and use that.

Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

Your code:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields
ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1
) AS Dups
FROM ActivityGroup

Any rows with a value greater than one are older duplicates and can be deleted.

Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).
Minor disclaimer: I haven't tested the code yet!
Post #1371242
Posted Thursday, October 11, 2012 10:15 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
sam.dahl (10/11/2012)
My understanding is that there's no way to use *, instead generate a list of columns and use that.

Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

Your code:
WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate
ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups
FROM ActivityGroup

The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

WITH DUPLICATES AS
(SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields
ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1
) AS Dups
FROM ActivityGroup

Any rows with a value greater than one are older duplicates and can be deleted.

Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).
Minor disclaimer: I haven't tested the code yet!


Sam,

This works perfectly (had to fix a few syntax errors and typos, but that was no biggie)

The explanation and the comments in code really brought it home for me. I was reading about all this yesterday after testing the example posted by Sean. Now it all makes sense (light bulb above my head now)

Gotta give thanks to Sean and Sam! This forum rocks! One day I will be able to help others and keep the cycle going :)
Post #1371654
Posted Thursday, October 11, 2012 12:45 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Yesterday @ 11:33 AM
Points: 42, Visits: 189
A side note for any newbies reading this:

I was manually performing:

SELECT * FROM ACTIVITY GROUP --Gave me total count of rows
SELECT DISTINCT * FROM ACTIVITY GROUP --Thought this was a true Distinct count of the distinct values, in many cases this would be a good count. I had previously made a mistake and the 3rd column (Pulldate) was incorrectly assigned and duplicated to many rows.

Using the CTE examples provided above I was able to remove the true duplicates.

So to check the CTE was doing as it was suposed to, I needed to be more specific in my test query, I needed the distinct count from columns 1 and 2, excluding column 3.

Here is the simple query I used:

SELECT DISTINCT ACTIVITYGROUP_DESC, REQUEST_ID
FROM ACTIVITYGROUP

^Now I am only looking at the first two columns, this gave me a better count of actual distinct values.

Hope this makes sense to anyone that reads this.

Thanks to the following post for help with DISTINCT.
http://www.sqlservercentral.com/articles/DISTINCT/71551/
Post #1371735
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse