Deleting Duplicate Records

This article explains a solution which can be used for deleting non-unique duplicate records from a table. In Oracle, every row in a table has a unique identifier so deleting duplicate records can be done as each record is unique based on its rowid (rowid is the mechanism which Oracle uses to physically locate the records in a table). But as SQL Server 7.0/2000 does not have the concept of row identifier it’s difficult to delete duplicate records from a table.

I have created a Script which I have found very useful when I am faced with such a problem. In the example below I have tried to explain how the script works. First let’s create a non indexed table using the script given below:

/*********************************************************//* Script for Creation of Employee Table*/CREATE TABLE [dbo].[Employee] (
            [id] [int] NULL ,
            [name] [Varchar] (50) COLLATE Latin1_General_CI_AS NULL ,
            [salary] [Numeric](18, 2) NULL 
 ) ON [PRIMARY]
GO
/*********************************************************/

Then add some records into the table using the script given below. Note that there are 5 similar records being inserted into the table

/**********************************************************//* Script for Insertion of duplicate records to the Table */Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (2,'Joe', 2000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (3,'Mary', 1000.00)
Insert into employee values (4,'Julie', 5000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (1,'Ram', 1000.00)

/**********************************************************/

The table below shows the data present in the ‘Employee’ table.

id	Name	Salary
1	Ram	1000
1	Ram	1000
2	Joe	2000
2	Joe	1000
3	Mary	1000
4	Julie	5000
2	Joe	1000
1	Ram	1000

The following script deletes the duplicates from ‘Employee’ table. If there are duplicate records then the script will get a count of the duplicate records and will then delete the records till the count is 1.

/*******************************************************************//* Script for deletion of Duplicate record from the Employee Table */Declare @id int,
        @name varchar (50),
        @cnt int,
        @salary numeric

Declare getallrecords cursor local static For
 Select count (1), id, name, salary 
   from employee (nolock)
   group by id, name,salary having count(1)>1
 
Open getallrecords

Fetch next from getallrecords into @cnt,@id,@name,@salary
--Cursor to check with all other records
While @@fetch_status=0
 Begin
  Set @cnt= @cnt-1
  Set rowcount @cnt

  -- Deleting the duplicate records. Observe that all fields are mentioned at the where condition
  Delete from employee where id=@id and name=@name
  and salary=@salary

  Set rowcount 0

  Fetch next from getallrecords into @cnt,@id,@name,@salary
 End 

Close getallrecords
Deallocate getallrecords

*******************************************************************

The logic of the script is pretty simple; the select query retrieves all the records that are duplicates i.e. having Count greater than one. The result set is retrieved by opening a local cursor which fetches one row at a time.

Note here that the Count column is a part of the select query; this is used to identify the no of duplicate rows in the result set.

The row count has been set to (Value obtained from the Count Column – 1). SQL Server uses rowcount to stop processing the query after the specified numbers of rows are returned. The delete statement is executed only to delete the rows set by the Set rowcount command. Once the records have been deleted, the rowcount of SQL server is reset to the default value of 0.

For more details on row count visit (http://msdn.microsoft.com/library/default.asp?url=/library/en-us/tsqlref/ts_set-set_0bjo.asp)

After the above script is executed the data in the ‘Employee’ table is as shown below and note that there are no duplicates anymore:

id	Name	Salary
1	Ram	1000
2	Joe	1000
2	Joe	2000
3	Mary	1000
4	Julie	5000

In the above example duplicate records were deleted at a row level, as we had considered all the columns in the select query. By customizing the script you can also delete duplicate records at a column level. This Query can be extremely beneficial and time saving for data cleansing during Data Migration.

Rate

Share

Categories

Share

Rate

Deleting Duplicate Records

Rate

Share

Categories

Share

Rate

Related content

Tame Those Strings Part 1 - Using SUBSTRING

Tame Those Strings! Part 2 - Using CHARINDEX

Tame Those Strings! Part 5 - Using STUFF

SP_MSFOREACHDB - Getting a list of databases via TSQL.

SP_MSForeachtable - Life Without Cursors.