SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Remove Duplicate Records


Remove Duplicate Records

Author
Message
Syed-201559
Syed-201559
SSC Veteran
SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)

Group: General Forum Members
Points: 207 Visits: 94
Comments posted to this topic are about the item Remove Duplicate Records


Kindest Regards,

Syed
Sr. SQL Server DBA
colin Robinson-345240
colin Robinson-345240
Valued Member
Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)

Group: General Forum Members
Points: 63 Visits: 148
clever but on large tables probably inparactical due to lock escalation and the time taken to do disk based delete and reindex operations.

alternatives could be a Unique indexed view based around select Distinct * from PhoneBook.
or a select distinct into Phonebook2 combined with rename Table operations.

You could also stop the duplicates getting in there in the first place with an insert trigger. You can use Binary_checksum(*) function on the inserted table to check aginst the Binary_Checksum(*) of existing Rows
Craig Sunderland
Craig Sunderland
Grasshopper
Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)Grasshopper (23 reputation)

Group: General Forum Members
Points: 23 Visits: 61
Surely adding a constraint when designing the table is the best option.

Stop duplicates at source?

Cheap toilet paper is a false economy, beware!
toniupstny
toniupstny
Mr or Mrs. 500
Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)Mr or Mrs. 500 (544 reputation)

Group: General Forum Members
Points: 544 Visits: 940
I would avoid using Checksum or binary or otherwise for uniqueness. It is not guaranteed to be unique as shown by this example I took from somewhere out on the internet (apologies to the author who I did not note).



select binary_checksum('where myval in (7004054,7004055)') a, binary_checksum('where myval in (7003888,7003889)') b 



this gives the same result - 432179860 432179860 for both values

Toni
colin Robinson-345240
colin Robinson-345240
Valued Member
Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)Valued Member (63 reputation)

Group: General Forum Members
Points: 63 Visits: 148
no problem, youve illustrated a fair point. Id like to see the link that adds some context to your example
Syed-201559
Syed-201559
SSC Veteran
SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)SSC Veteran (207 reputation)

Group: General Forum Members
Points: 207 Visits: 94
Delete should be based on business logics, as phone book example if you create unique index on phone number column then SQL Server will not allow duplicate phone number. This mean your phone book can not insert husband and wife who shares the same home phone number.

CheckSum and Binary_CheckSum both are not reliable, they were orignally designed to check message integrity when sending secure messages, both parties can detect if message was altered. If you are on SQL Server 2005 then use HashBytes function.

check sum and binary check sum can take the whole row and make things easy as example

Select Binary_CheckSum(*) from table_name

the hashbytes is limited to varchar,nvarchar and varbinary as input. You have to convert your columns to one of the supported data types.

select HashBytes('SHA1', CONVERT(varchar,column_name)) from table_name

use it with caution even a minor change like varchar to nvarchar will change the hashbytes value.

select HashBytes('SHA1', CONVERT(nvarchar,column_name)) from table_name

HashBytes
http://msdn.microsoft.com/en-us/library/ms174415.aspx

Binary_CheckSum
http://msdn.microsoft.com/en-us/library/ms173784.aspx

Duplicate rows are common in systems where you are importing data from other systems. There are few methods to delete duplicate rows in SQL server table.

http://www.sqldba.org/articles/34-find-duplicate-records-to-delete-or-update-in-sql-server.aspx


I hope it helps.


Kindest Regards,

Syed
Sr. SQL Server DBA
lisa.siverly
lisa.siverly
Valued Member
Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)Valued Member (74 reputation)

Group: General Forum Members
Points: 74 Visits: 81
Hey guys,

I have a fairly large table in production with almost a million records, and I have duplicate records in that table with about 18,000 duplicate records. I've been trying some sql suggestions to delete duplicates, and I found your example here I was able to (I think) incorporate my field information into. Running this statement against my test database which is only half my production database, i executed the query which is running for some 4+ hours. When i stopped the query from executing, i found i had some 38,000+ records added to my database table... so I'm wondering if i made some error when configuring your sql into my database. Here is the code i created using your example.

SET ROWCOUNT 1
SELECT @@rowcount
WHILE @@rowcount > 0
DELETE pb FROM Expense as pb
INNER JOIN
(SELECT accountCode, expenseDescription, invoiceDate, invoiceNumber, ledgerCode, openItemNumber,
programCode, transactionAmount, vendorName, warrantNumber, lineNumber, collocationCodeID
FROM Expense
where expenseDescription like 'PP%'
GROUP BY accountCode, expenseDescription, invoiceDate, invoiceNumber, ledgerCode, openItemNumber,
programCode, transactionAmount, vendorName, warrantNumber, lineNumber, collocationCodeID HAVING count(*) > 1)
AS c ON c.accountCode = pb.accountCode
and c.expenseDescription = pb.expenseDescription
and c.invoiceDate = pb.invoiceDate
and c.invoiceNumber = pb.invoiceNumber
and c.ledgerCode = pb.ledgerCode
and c.openItemNumber = pb.openItemNumber
and c.programCode = pb.programCode
and c.transactionAmount = pb.transactionAmount
and c.vendorName = pb.vendorName
and c.warrantNumber = pb.warrantNumber
and c.lineNumber = pb.lineNumber
and c.collocationCodeID = pb.collocationCodeID
SET ROWCOUNT 0

SELECT * FROM Expense

DROP TABLE Expense

So, my questions are... Will this sql actually delete what seems like temp records? With this size table, should i expect this sql to execute for hours???

I would appreciate any insight, this is my first time performing a task like this.

Thank you,
Lisa
AmolNaik
AmolNaik
Ten Centuries
Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)Ten Centuries (1.3K reputation)

Group: General Forum Members
Points: 1261 Visits: 1234
Smart technique. But beware SET ROWCOUNT is set to be deprecated in the future releases, instead you could use TOP 1.

An alternate method to remove duplicates is to use ROW_NUMBER() technique, basically ordering the resultset over the column phone number and then deleting row where row_number > 1, this will make sure that only 1 resultset is retained and any other repeating instances are deleted.

Thanks!

Amol Naik
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)SSC Guru (119K reputation)

Group: General Forum Members
Points: 119180 Visits: 41482
AmolNaik (6/13/2011)
Smart technique. But beware SET ROWCOUNT is set to be deprecated in the future releases, instead you could use TOP 1.

An alternate method to remove duplicates is to use ROW_NUMBER() technique, basically ordering the resultset over the column phone number and then deleting row where row_number > 1, this will make sure that only 1 resultset is retained and any other repeating instances are deleted.

Thanks!


It'll also be much faster than the looping method. :-)

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search