query to delete 'almost' duplicate rows

  • I have a table that has 5 columns. Two of the columns, 'ObjectId' and 'DataKey' uniquely identify information, but are not used in a key or index of any kind. A third column is an identity column.

    I want to be able to delete rows that have duplicate 'ObjectId' and 'DataKey' values, being sure to leave the row with the most recent (greatest) identity value. I can use this:

    select distinct

    ObjectId,

    DataKey

    from MyTable

    to get only the non-duplicates, but how can I be sure to delete the duplicate rows leaving the one with the greatest identity value?

    Thanks

  • delete

    from MyTable

    where exists (select * from MyTable t

    where t.ObjectId=MyTable.ObjectId

    and t.DataKey=MyTable.DataKey

    and t.IdentCol>MyTable.IdentCol)

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • Thanks Mark,

    I'm a newbie with SQL, so forgive the ignorance. I don't really understand why that query works. I ran just the 'select' portion to see if it returned the records that would be deleted - but it returned nothing. Can you explain what's happening in that query?

    Thanks again.

  • If you run this

    select *

    from MyTable

    where exists (select * from MyTable t

    where t.ObjectId=MyTable.ObjectId

    and t.DataKey=MyTable.DataKey

    and t.IdentCol>MyTable.IdentCol)

    it should show you the rows that are to be deleted. The query simply returns rows where there is at least one other (different) row of the same ObjectId and DataKey, but with a higher identity column value. If you're not getting anything, try posting some sample data.

    ____________________________________________________

    Deja View - The strange feeling that somewhere, sometime you've optimised this query before

    How to get the best help on a forum

    http://www.sqlservercentral.com/articles/Best+Practices/61537
  • OK, I think I've got it.

    In my test to show the records that would be deleted, I ran only the select subquery:

    select * from MyTable t

    where t.ObjectId=MyTable.ObjectId

    and t.DataKey=MyTable.DataKey

    and t.IdentCol>MyTable.IdentCol

    I didn't use the query:

    select *

    from MyTable

    where exists (select * from MyTable t

    where t.ObjectId=MyTable.ObjectId

    and t.DataKey=MyTable.DataKey

    and t.IdentCol>MyTable.IdentCol)

    Thanks again.

  • Little hint for future posts. Use "[code]" instead of "<code>" to post source code. Hml-tags will be miss-interpreted. 😉

  • DELETE MyTable FROM (SELECT IdentityColumn,DENSE_RANK() OVER(PARTITION BY ObjectId,DataKey ORDER BY NEWID()) 'Rank' FROM MyTable) MyTable1

    WHERE MyTable.IdentityColumn =MyTable1.IdentityColumn

    AND MyTable1.Rank > 1

  • Thanks Pandian,

    I tried that query...it appears that it doesn't leave the record with the greatest identity value and remove the rest. I think the lowest identity value was the one kept and others were deleted.

  • I changed the query to

    DELETE MyTable FROM

    (SELECT IdentityColumn,DENSE_RANK() OVER(PARTITION BY ObjectId,DataKey ORDER BY IdentityColumn DESC) 'Rank'

    FROM MyTable) MyTable1

    WHERE MyTable.IdentityColumn = MyTable1.IdentityColumn

    AND MyTable1.Rank > 1

    and that appears to work.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply