Antother Delete Duplicates Thread

  • I know this topic has been covered many times, but I cannot find the specific solution I am looking for.

    We load data from an outside source and there are sometimes duplicates being loaded, sometimes there are not. (This is another story)

    I have a basic query that will handle my duplicate delete issue without any logic added:

    SELECT DISTINCT * INTO TEMP_TABLE

    FROM SOURCE_TABLE

    GO

    TUNCATE TABLE SOURCE_TABLE

    GO

    INSERT INTO SOURCE_TABLE

    SELECT * FROM TEMP_TABLE

    GO

    DROP TABLE TEMP_TABLE

    GO

    I am plannig to setup a job to run on off peak times that will remove the duplicates (until I can fix the source data feed)

    I want to build some logic that will look at the total records and compare to the distinct. If total is > than distinct, remove duplicates.

    I attempted to build a IF statement but it doesnt like me using multiple select statements. Here is what I tried:

    IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

    BEGIN

    SELECT DISTINCT * INTO TEMP_TABLE

    FROM SOURCE_TABLE

    TUNCATE TABLE SOURCE_TABLE

    INSERT INTO SOURCE_TABLE

    SELECT * FROM TEMP_TABLE

    DROP TABLE TEMP_TABLE

    END

    ERROR: Msg 116, Level 16, State 1, Line 3

    Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.

    I've also seem some examples that set the database recovery model to bulk logged during the duplicate removal operations and setting back to its previous recovery model after the operation is complete. It makes sense to me as the log file be getting entries for all the inserting going on. Is this correct and should I set to bulk logged? (A few of the tables have more than 2 million rows)

    If there is a better way to handle this I am open to other ideas. Thanks!

  • Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

    The error you are getting is because you trying to compare entire result sets and what you want is the row count.

    IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

    That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:

    IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)

    Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (10/10/2012)


    Not 100% sure what you are trying to do here. It sounds like you want to delete duplicates from a table? This seems like an awfully complicated way to go about it.

    The error you are getting is because you trying to compare entire result sets and what you want is the row count.

    IF (Select * From SoureTable) > (Select Distinct * From SourceTable)

    That doesn't work. That is trying to compare the table to itself which just doesn't make any sense. You should instead use:

    IF (Select count(*) From SoureTable) > (Select count(distinct *) From SourceTable)

    Even though this will fix the error it seems like you could just remove the duplicates a lot easier than copying millions of rows to another table and then copying them back. Seems a lot more efficient to just delete the duplicates.

    Thanks Sean,

    You are correct all i want to do is just delete the duplicate rows. Looks like I need to revisit this another way and look for better fixes using row counts or some other option.

  • Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.

    create table #Dupes

    (

    IdentityValue int identity not null,

    UserValue varchar(20),

    AnotherValue varchar(10)

    )

    insert #Dupes

    select 'asdf', 'Another' union all

    select 'asdf', 'Another' union all

    select 'asdf', 'Another' union all

    select 'qwer', 'Another' union all

    select 'qwer', 'Another' union all

    select 'unique', 'Another' union all

    select 'another', 'Another' union all

    select 'another', 'Another' union all

    select 'asdf', 'Another' union all

    select 'qwer', 'Another'

    select * from #Dupes

    ;with cte as

    (

    select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum

    from #Dupes

    )

    delete cte where RowNum > 1

    select * from #Dupes

    drop table #Dupes

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Sean Lange (10/10/2012)


    Here is an example of one way of deleting duplicates. There are plenty of other ways to do this.

    create table #Dupes

    (

    IdentityValue int identity not null,

    UserValue varchar(20),

    AnotherValue varchar(10)

    )

    insert #Dupes

    select 'asdf', 'Another' union all

    select 'asdf', 'Another' union all

    select 'asdf', 'Another' union all

    select 'qwer', 'Another' union all

    select 'qwer', 'Another' union all

    select 'unique', 'Another' union all

    select 'another', 'Another' union all

    select 'another', 'Another' union all

    select 'asdf', 'Another' union all

    select 'qwer', 'Another'

    select * from #Dupes

    ;with cte as

    (

    select IdentityValue, UserValue, AnotherValue, ROW_NUMBER() over (partition by UserValue, AnotherValue order by IdentityValue) as RowNum

    from #Dupes

    )

    delete cte where RowNum > 1

    select * from #Dupes

    drop table #Dupes

    GOT IT! Thanks!

    Here is what I ended up using:

    TABLE:

    [dbo].[ActivityGroup]

    (

    [ACTIVITYGROUP_DESC] [nvarchar](255) NULL,

    [REQUEST_Id] [int] NULL,

    [PullDate] [date] NULL

    ) ON [PRIMARY]

    WITH DUPLICATES AS

    (SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate

    ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups

    FROM ActivityGroup

    Delete DUPLICATES where Dups > 1

    This is a great way to get rid of exact duplicates! Im sure its less over head than my previous approach 🙂 Many Thanks!

  • You are quite welcome. Thanks for letting me know that worked for you.

    _______________________________________________________________

    Need help? Help us help you.

    Read the article at http://www.sqlservercentral.com/articles/Best+Practices/61537/ for best practices on asking questions.

    Need to split a string? Try Jeff Modens splitter http://www.sqlservercentral.com/articles/Tally+Table/72993/.

    Cross Tabs and Pivots, Part 1 – Converting Rows to Columns - http://www.sqlservercentral.com/articles/T-SQL/63681/
    Cross Tabs and Pivots, Part 2 - Dynamic Cross Tabs - http://www.sqlservercentral.com/articles/Crosstab/65048/
    Understanding and Using APPLY (Part 1) - http://www.sqlservercentral.com/articles/APPLY/69953/
    Understanding and Using APPLY (Part 2) - http://www.sqlservercentral.com/articles/APPLY/69954/

  • Two more questions. I am new to using a CTE, so bear with me 🙂

    There are tables with 20 or more columns. Can I use a wild card like *, if not I have no problem listing them out, just checking. (I tried replacing the column names with a *, but it didnt work)

    I wanted to check for duplicates across all columns, which I got sorted thanks to your help.

    Now I have some duplicates where the activitygroup_desc and request_id are the same but the pulldate is different. (due to my bad usage of the update and insert routine i wrote - its now fixed) The table will have the request_id used multiple times as its tied to the activitygroup_desc. Here is an example of the data:

    ACTIVITYGROUP_DESCREQUEST_ID PullDate

    Customer Requests39 2012-10-05

    Customer Requests39 2012-09-27

    Lab Services 39 2012-09-27

    Customer Requests40 2012-10-04

    Lab Services 40 2012-10-02

    Customer Requests40 2012-10-02

    Customer Requests41 2012-09-17

    Customer Requests42 2012-10-08

    Any recomendations to modify the CTE to look at column1, column2, and when there is a duplicate remove the one with the older pull date in column3? Does that make sense?

  • My understanding is that there's no way to use *, instead generate a list of columns and use that.

    Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

    Your code:

    WITH DUPLICATES AS

    (SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate

    ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups

    FROM ActivityGroup

    The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

    Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

    WITH DUPLICATES AS

    (SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields

    ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1

    ) AS Dups

    FROM ActivityGroup

    Any rows with a value greater than one are older duplicates and can be deleted.

    Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).

    Minor disclaimer: I haven't tested the code yet!

  • sam.dahl (10/11/2012)


    My understanding is that there's no way to use *, instead generate a list of columns and use that.

    Modifying the CTE requires recognising how the ROW_NUMBER/PARTITION BY works.

    Your code:

    WITH DUPLICATES AS

    (SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate

    ORDER BY ACTIVITYGROUP_DESC, REQUEST_Id, PullDate) AS Dups

    FROM ActivityGroup

    The ROW_NUMBER() function provides an incrementing number starting at 1 for each row. PARTITION BY operates on the table as sort of an on-the-fly grouping operation such that the ROW_NUMBER() is applied to each group (i.e. restarts at 1 for each partitioned section). In your context the fields you partition on are the fields being compared to determine whether or not the record constitutes a duplicate. The ORDER BY then sorts the grouped records in some particular order. If the fields that you order on are the same as the ones you partition by then the resulting intra-group order of 'duplicates' will be effectively arbitrary.

    Using your example of comparing ACTIVITYGROUP_DESC and REQUEST_ID and removing the older PullDate something like this would work:

    WITH DUPLICATES AS

    (SELECT ROW_RUMBER() OVER (PARTITION BY ACTIVITYGROUP_DESC, REQUEST_ID --effectively group by these fields

    ORDER BY PullDate DESC -- Sort by PullDate in descending order since you want the newest PullDate to be assigned value 1

    ) AS Dups

    FROM ActivityGroup

    Any rows with a value greater than one are older duplicates and can be deleted.

    Does that help? (sometimes the explanations are clearer in my head before the attempt to articulate them).

    Minor disclaimer: I haven't tested the code yet!

    Sam,

    This works perfectly (had to fix a few syntax errors and typos, but that was no biggie)

    The explanation and the comments in code really brought it home for me. I was reading about all this yesterday after testing the example posted by Sean. Now it all makes sense (light bulb above my head now)

    Gotta give thanks to Sean and Sam! This forum rocks! One day I will be able to help others and keep the cycle going 🙂

  • A side note for any newbies reading this:

    I was manually performing:

    SELECT * FROM ACTIVITY GROUP --Gave me total count of rows

    SELECT DISTINCT * FROM ACTIVITY GROUP --Thought this was a true Distinct count of the distinct values, in many cases this would be a good count. I had previously made a mistake and the 3rd column (Pulldate) was incorrectly assigned and duplicated to many rows.

    Using the CTE examples provided above I was able to remove the true duplicates.

    So to check the CTE was doing as it was suposed to, I needed to be more specific in my test query, I needed the distinct count from columns 1 and 2, excluding column 3.

    Here is the simple query I used:

    SELECT DISTINCT ACTIVITYGROUP_DESC, REQUEST_ID

    FROM ACTIVITYGROUP

    ^Now I am only looking at the first two columns, this gave me a better count of actual distinct values.

    Hope this makes sense to anyone that reads this.

    Thanks to the following post for help with DISTINCT.

    http://www.sqlservercentral.com/articles/DISTINCT/71551/

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply