SQLServerCentral Article

Removing Duplicate Rows

,

Removing Duplicate Rows


Introduction

"How do I remove duplicate rows from a table" is a common question

in the newsgroups, and here we will look at a few different techniques for

achieving this.

Databases can easily be designed to disallow duplicates (which I will discuss

at the bottom of the article) but the issue can still arise. For example,

duplicate data might arrive at your database via an interface to another system.

The code examples in this article were all developed on SQL 2000, but should

work fine in versions 6.5 and 7.

The "Distinct" technique

This is the easiest way of de-duplicating a table, and is at it’s best with

fairly small rowsets, and where all the columns in a table are duplicates. We

simply run a SELECT DISTINCT on the offending table, storing the results in a

temporary table, then delete the contents of the permanent table and re-populate

it from the temporary one.

The downside of this is that you need enough free data space in your database

(or in tempdb) to store the entire table again, plus plenty of log space when

you are de-duplicating large tables. This simple example loads a table with

duplicate data then removes the duplicates:

create table marxBrothers (

    Number int,

    Name varchar(32)

)

go

insert marxBrothers

    select 1, 'Groucho Marx' UNION ALL

    select 2, 'Harpo Marx' UNION ALL

    select 3, 'Chico Marx' UNION ALL

    select 1, 'Groucho Marx' UNION ALL

    select 2, 'Harpo Marx' UNION ALL

    select 3, 'Chico Marx' UNION ALL

    select 6, 'Zeppo Marx' UNION ALL

    select 7, 'Gummo Marx' UNION ALL

    select 6, 'Zeppo Marx'

begin tran deduplicate

    select DISTINCT *

    into #temp

    from marxBrothers

    truncate table marxBrothers

    insert marxBrothers

    select *

    from #temp

    select * from marxBrothers

    drop table #temp

commit tran deduplicate

drop table marxBrothers

Using derived tables

For tables where duplication is defined by a subset of the columns in the

table, you can use one of the other columns in the table to identify which rows

to keep and which to delete. Here is a simple example – it is similar to the

previous one, but note that only the names are duplicated, because the "ident"

column has been defined as an Identity column. Here we only want to keep one

instance of each name, regardless of the value in the Ident column. In the

example I have arbitrarily decided to keep the name with the lowest Ident value

create table marxBrothers (

    ident int IDENTITY,

    Name varchar(32)

)

go

insert marxBrothers (Name)

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Zeppo Marx' UNION ALL

    select 'Gummo Marx' UNION ALL

    select 'Zeppo Marx'

select * from marxBrothers

delete marxBrothers

from marxBrothers,

    (

        select min(ident) as minIdent, name

        from marxBrothers m

        group by name

        having count(1) > 1

    ) as derived

where marxBrothers.name = derived.name

and ident > minIdent

select * from marxBrothers

drop table marxBrothers

The inner query selects all the names in the table that are duplicated, and

the minimum value of the Ident column for each name. These results are compared

against the main table in order to select the records for deletion

With this technique you may be able to get away with having much less free

room than the earlier technique, but this depends on the ratio of duplicated to

unique records – the more records are duplicated, the more space SQL Server

needs to temporarily store, and work with, the results that makeup the derived

table. The derived table technique is the one I usually try first.

Using correlated subqueries

This example uses a correlated sub-query to achieve de-duplication

create table marxBrothers (

    ident int IDENTITY,

    Name varchar(32)

)

go

insert marxBrothers (Name)

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Zeppo Marx' UNION ALL

    select 'Gummo Marx' UNION ALL

    select 'Zeppo Marx'

select * from marxBrothers

delete marxBrothers

where ident > (

    select min(ident)

    from marxBrothers m

    where m.name = marxBrothers.name

)

select * from marxBrothers

drop table marxBrothers

Correlated sub-queries can be pretty slow and inefficient to run against

large tables.

De-duplicating using dynamic SQL

This technique uses a cursor and the "SET ROWCOUNT" feature to work

through the table one name at a time, deleting all but one of the rows for each

name. Cursors and Dynamic SQL both have their place, but I generally try to

avoid using them for reasons of efficiency, As this example uses both a cursor

and dynamic SQL, it can bet pretty inefficient with large numbers of duplicated

rows, and I would prefer to use one of the previous techniques. However, the

technique does work, so I think it is still worth a mention. 

create table marxBrothers (

ident int IDENTITY,

Name varchar(32)

)

insert marxBrothers (Name)

    select 'Groucho Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Groucho Marx' UNION ALL

    select 'Harpo Marx' UNION ALL

    select 'Chico Marx' UNION ALL

    select 'Zeppo Marx' UNION ALL

    select 'Gummo Marx' UNION ALL

    select 'Zeppo Marx' UNION ALL

    select 'Zeppo Marx'

go

declare @sql varchar(255),

        @name varchar(32),

        @count int

select * from marxBrothers

declare CI insensitive cursor for

    select name, count(1) – 1 from marxBrothers m

    group by name

    having count(1) > 1

open CI

fetch next from CI into @name, @count

while @@fetch_status = 0

begin

    select @sql =

    'set rowcount ' + convert(varchar(9), @count) + '

    delete marxBrothers where name = "' + @name + '"'

    exec (@sql)

    fetch next from CI into @name, @count

end-- while fetch

close CI

deallocate CI

select * from marxBrothers

go

drop table marxBrothers

Preventing duplication

Prevention is, after all, better than cure, so It would be pointless

discussing de-duplicating a table without mentioning how to prevent duplicate

rows appearing again afterwards. There are three main declarative methods for

doing this in SQL:

The Primary Key constraint

This is implemented as an index in SQL Server, and so it can also help speed

up select and join operations from the table. One or more columns may make up a

primary key, and SQL Server will automatically ensure that duplicates do not

exist in a table’s primary key. An error will be returned if you try to insert

duplicates.

create table marxBrothers (

    ident         intIDENTITY,

    name          varchar(32),

    constraint    PK_marxBrothers PRIMARY KEY

(name)

)

go

The Unique constraint

This is also Implemented as an index in SQL Server. The main difference

between a primary key and a unique constraint is that the Primary

Key can never contain a NULL value.

create table marxBrothers (

    ident         intIDENTITY,

    name          varchar(32),

    constraint    U_marxBrothers UNIQUE (name)

)

go

The IGNORE_DUP_KEY option

This is an option on the create index command, which will cause SQL

Server to discard rows where a duplicate of the columns in the specified index

already exists. Use this option with care – you have to be certain that you do

really need to do this. Also remember that the results of using this option can

easily be misinterpreted as an error by end users.

create table marxBrothers (

    ident         intIDENTITY,

    name          varchar(32),

)

go

create unique index UI_marxBrothers on marxBrothers (name) with ignore_dup_key

 

About the author

Neil Boyle is an independant SQL Server consultant working out of London,

England. Neil's free SQL Server guide is available on-line at

http://www.impetus-sql.co.uk

Rate

3 (2)

Share

Share

Rate

3 (2)