Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Deleting Duplicate Records


Deleting Duplicate Records

Author
Message
Anup pandey
Anup pandey
Forum Newbie
Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)Forum Newbie (5 reputation)

Group: General Forum Members
Points: 5 Visits: 1

How does such bad design and missing constraint get into Dbs that allow duplicates to get into databases in the first place? Heard of Primary Keys (oh, some people make IDs as PKs for all tables where naturalkeys make sense, tsk tsk), heard of constraints and triggers (in SQL at least). There is no reason to have to use these duplicate deletion scripts. Design tables with proper PKs and constraints and use triggers.

www.DynamicWebTX.Com


katesl
katesl
SSC Journeyman
SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)SSC Journeyman (79 reputation)

Group: General Forum Members
Points: 79 Visits: 473

Lots of data exists in the world outside real (that is, normalized and managed by a DBMS) databases. The big ETL (extract, transform, load) task is to load this messy data into your database. I love this work and never have to write a program or a cursor thanks to DTS and to my progammer's editor Epsilon, which can load and edit huge files.



_________________
"Look, those sheep have been shorn."
data analyst replies, "On the sides that we can see.."
Ashish Bajpai
Ashish Bajpai
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 1

SQL was good but too long

Do One thing

  • Select ID, NAME, SALARY into #TableName from TableName Group by ID, NAME, SALARY
  • Delete From TABLENAME
  • INSERT INTO TABLENAME SELECT * FROM #TABLENAME

This is 3 Line SQL Make a Program & EXECUTE it.

Ashish Bajpai







Ashish Bajpai

Programer

NCR Corpration, Mumbai

[91] - 9324 959982

Dan Collier
Dan Collier
Grasshopper
Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)

Group: General Forum Members
Points: 19 Visits: 16

This does the same thing, simplier and NO CURSORS. If it can be done without a cursor, than much better.

CREATE TABLE [dbo].[Employee] (
[id] [int] NULL ,
[name] [Varchar] (50) COLLATE Latin1_General_CI_AS NULL ,
[salary] [Numeric](18, 2) NULL
  ON [PRIMARY]
GO
/*********************************************************/

/**********************************************************/
/* Script for Insertion of duplicate records to the Table */
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (2,'Joe', 2000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (3,'Mary', 1000.00)
Insert into employee values (4,'Julie', 5000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (1,'Ram', 1000.00)

/**********************************************************/

BEGIN TRANSACTION

Select [id], [name], [salary]
INTO #MakeUnique
FROM employee
GROUP BY [id], [name], [salary]

DELETE FROM employee

INSERT INTO employee( [id], [name], [salary] )
SELECT [id], [name], [salary] FROM #MakeUnique

COMMIT TRANSACTION

SELECT * FROM employee

DROP TABLE #MakeUnique
DROP TABLE employee


Dan Collier
Dan Collier
Grasshopper
Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)Grasshopper (19 reputation)

Group: General Forum Members
Points: 19 Visits: 16
Yes very very bad, really a NO DESIGN table. Believe or not, I had to deal with NO UNIQUE CONSTRAINT Tables, these BAD Designs with Commercial Enterprise systems.
tbredeme
tbredeme
SSC Eights!
SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)SSC Eights! (876 reputation)

Group: General Forum Members
Points: 876 Visits: 78
While I agree that proper design will eliminate the need for such scripts, inevitably we will all inherit poorly structured systems which necessitate this functionality. Personally I prefer to use a stored procedure that I found a few years ago. I didn't write it - I think I have only used it a few times - but it has saved my bacon each time.


/* Description: UTILITY - Locate in MASTER


Syntax: EXEC sp_RemoveDups TableName, DupQualifierFieldNameList, DeleteDups, UniqueColName, CreateIdentColIfNeeded, StoredProcedureResult
Only the first two arguments are required
For HELP, enter the stored procedures name without any arguments or see the PRINT statements below
NO DELETION WILL OCCUR by default - only duplicate recordset returned. To delete records, pass in a 0 for the DeleteDups argument.

Example: EXEC sp_RemoveDups 'MyTable','LastName,FirstName,HomePhone'

Purpose: Allow removal of duplicate rows where
1. We define what fields qualify the duplicate
2. We select the unique rowid or it is detected automatically else no action takes place

Method: Delete by RowID all duplicate rows except the highest RowID (in alpha-sort sequence)
of each group of duplicates.

DATE BY CHANGE
09-23-2002 Frank Original v1.0
09-23-2002 Frank Changed the name from sp_RemoveDupsByRowID to sp_RemoveDups
10-8-2002 Sean P. O. MacCath-Moran Made @UniqueColName optional
Added logic to auto-detect RowGUID and Identity columns
Added logic to check for unique value column if no RowGUID or Identity column exists
Added logic to create a temporary ID field as a last resort if no unique column could be located
Added HELP

*/

CREATE PROCEDURE sp_RemoveDups
@TableName as varchar(50) = NULL,
@DupQualifierFieldNameList as varchar(200) = NULL,
@DeleteDups as bit = NULL,
@UniqueColName as varchar(50) = NULL,
@CreateIdentColIfNeeded as bit = NULL,
@StoredProcedureResult int = NULL OUTPUT

AS
SET NOCOUNT ON

DECLARE @SQL nvarchar(2000)
DECLARE @SQL_DetermineUniqueTemplate nvarchar(2000)
DECLARE @TempIdentColName varchar(20)
DECLARE @HostName varchar(50)
DECLARE @ActionText varchar(10)

DECLARE @SUM int
DECLARE @COUNT int
DECLARE @NextColumn varchar(50)


/*==================================================================================*/
/*========================VARIABLE INITIALIZATION AND SETUP========================*/
/*=================================================================================*/
/*If no unique column is located then a temporary Identity column will be created useing the name specified in this TempIdentColName string*/
SET @TempIdentColName = 'TempIdentityColXY123'

SET @SQL_DetermineUniqueTemplate = 'SELECT @COUNT = COUNT(Count), @SUM = sum(Count) from '
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + '(SELECT TOP 100 PERCENT , COUNT(*) as Count FROM ' + @TableName
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + ' GROUP BY ORDER BY ) a'

/*Retrieve the Host Name. This will be used later in this SP as a test to determine if the user is making this call from within SQL Query Analyzer*/
SELECT @HostName = hostname FROM master..sysprocesses WHERE spid = @@SPID AND program_name = 'SQL Query Analyzer'


/*Set ActionText to be used in message output*/
IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
SET @ActionText = 'Selection'
ELSE
SET @ActionText = 'Deletion'

/*If a value is specified for use by UniqueColName it cannot exist in the columns from the DupQualifierFieldNameList*/
IF CHARINDEX(@UniqueColName, @DupQualifierFieldNameList) > 0
BEGIN
/*The value in UniqueColName was detected in DupQualifierFieldNameList.*/
IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') must not exist in DupQualifierFieldNameList (' + @DupQualifierFieldNameList + '). Other solutions will be sought automatically.'
SET @UniqueColName = NULL
END


/*If UniqueColName is provided then perform check to ensure that all the values in that column are, in fact, unique.*/
IF NOT (@UniqueColName IS NULL)
BEGIN
SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @UniqueColName)
/*Perform a check of this column to determine if all of it's values are unique*/
EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
/*Test to determine if this column contains unique values*/
If @SUM <> @COUNT
BEGIN
/*The column specified by UniqueColName does not contain unique values.*/
IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') does not contain unique values. Other solutions will be sought automatically.'
SET @UniqueColName = NULL
END
END


/*==============================================================*/
/*======================HELP OUTPUT TEXT======================*/
/*==============================================================*/
IF (@TableName IS NULL) OR (@TableName = '/?') OR (@TableName = '?') OR (@DupQualifierFieldNameList IS NULL) OR (@DupQualifierFieldNameList = '/?') OR (@DupQualifierFieldNameList = '?')
BEGIN
IF NOT (@HostName IS NULL)
BEGIN
PRINT 'sp_RemoveDups ''TableName'', ''DupQualifierFieldNameList'', [''DeleteDups''], [''UniqueColName''], [''CreateIdentColIfNeeded''], <''StoredProcedureResult''>'
PRINT '====================================================================================================================================================================='
PRINT 'TableName: Required - String - Name of the table to detect duplicate records in.'
PRINT 'DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName.'
PRINT 'DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will return the duplicate records to be deleted.'
PRINT 'UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic. If no UniqueColName is provided then an attemp will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the Identity column. If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected.'
PRINT 'CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if no unique column can be located. Pass in a 1 here to run this feature off.'
PRINT 'StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.'
END
SET @StoredProcedureResult = 3
RETURN
END


/*========================================================================*/
/*======================DETECT USABLE UniqueColName======================*/
/*========================================================================*/
IF @UniqueColName IS NULL
BEGIN
/*Check for a RowGUID or Identity column in this table. If one exists, then utilze it as the unique value for the purposes of this deletion*/
IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsRowGUIDCol') = 1) SET @UniqueColName = 'RowGUIDCol'
IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsIdentity') = 1) SET @UniqueColName = 'IdentityCol'

IF @UniqueColName IS NULL
/*If no RowGUID or Identity column was found then check all of the columns in this table to see if one of them can be utilized as a unique value column*/
BEGIN
/*Select all of the columns from the table in question...*/
DECLARE MyCursor CURSOR LOCAL SCROLL STATIC FOR SELECT name FROM syscolumns WHERE OBJECT_ID(@TableName)=ID

OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @NextColumn
WHILE @@fetch_status = 0
BEGIN
/*Create SQL string with correct column name in place.*/
SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @NextColumn)
/*Perform a check of this column to determine if all of it's values are unique*/
EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
/*Test to determine if this column contains unique values*/
If @SUM = @COUNT
BEGIN
/*A unique values column is detected. Use it and break out of the loop UNLESS column is specified in DupQualifierFieldNameList*/
IF CHARINDEX(@NextColumn, @DupQualifierFieldNameList) = 0
BEGIN
/*NextColumn was NOT detected in DupQualifierFieldNameList, so this is the column we will use.*/
SET @UniqueColName = @NextColumn
BREAK
END
END
ELSE
FETCH NEXT FROM MyCursor INTO @NextColumn
END
CLOSE MyCursor
DEALLOCATE MyCursor

END
END

/*If no UniqueColName has been found then create one UNLESS @CreateIdentColIfNeeded = 1*/
IF (@UniqueColName IS NULL) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
BEGIN
/*Add a sequence column to the table...*/
IF NOT (@HostName IS NULL) PRINT 'Creating temporary identity column in the ' + @TableName + ' table named ' + @TempIdentColName + ' for use in this ' + LOWER(@ActionText) + ' process...'
EXEC('ALTER TABLE ' + @TableName + ' ADD ' + @TempIdentColName + ' [int] IDENTITY (1, 1)')
SET @UniqueColName = @TempIdentColName
END


/*============================================================================*/
/*======================EXECUTE DELETION OR SELECTION======================*/
/*===========================================================================*/
IF @UniqueColName IS NULL
BEGIN
/*No UniqueColName was provided by the user and none were detected by the script. This deletion algorythm cannot run.*/
IF NOT (@HostName IS NULL) PRINT 'Could not perform ' + LOWER(@ActionText) + ' process. No unique columns were located and the UniqueColName flag is set to 1 (False).'
SET @StoredProcedureResult = 3
RETURN
END
ELSE
BEGIN
IF NOT (@HostName IS NULL) PRINT 'Performing ' + LOWER(@ActionText) + ' utilizing the unique values in the ' + @UniqueColName + ' column as a reference...'
/*
Create and execute an SQL statement in the form of:

SELECT * (or DELETE)
FROM TableName WHERE UniqueColName IN
(
SELECT UniqueColName FROM TableName WHERE UniqueColName NOT IN
(
SELECT MAX(Cast(UniqueColName AS varchar(36))) FROM TableName GROUP BY DupQualifierFieldNameList, DupQualifierFieldNameList, etc
)
)
*/
/*Delete all duplicate records useing @UniqueColName as a unique ID column */
IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
SET @SQL = 'SELECT * '
ELSE
SET @SQL = 'DELETE '

SET @SQL = @SQL + 'FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' IN '
SET @SQL = @SQL + CHAR(13) + CHAR(9) + '(' + CHAR(13) + CHAR(9)
SET @SQL = @SQL + 'SELECT ' + @UniqueColName + ' FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' NOT IN '
SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + '(' + CHAR(13) + CHAR(9)+CHAR(9)
SET @SQL = @SQL + 'SELECT MAX(Cast(' + @UniqueColName + ' AS varchar(36))) FROM '
SET @SQL = @SQL + @TableName + ' GROUP BY ' + @DupQualifierFieldNameList
SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + ')' + CHAR(13) + CHAR(9) + ')'

EXEC (@SQL)
IF @@ERROR <> 0
BEGIN
IF NOT (@HostName IS NULL) PRINT @ActionText + ' process failed.'
SET @StoredProcedureResult = 3
END
ELSE
BEGIN
IF NOT (@HostName IS NULL) PRINT @ActionText + ' completed successfully with this SQL: ' + CHAR(13) + @SQL
SET @StoredProcedureResult = 0
END
END


IF (@UniqueColName = @TempIdentColName) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
BEGIN
/*Remove the sequence column from the table...*/
IF NOT (@HostName IS NULL) PRINT 'Removing temporary identity column named ' + @TempIdentColName + ' from the ' + @TableName + ' table...'
EXEC('ALTER TABLE ' + @TableName + ' DROP COLUMN ' + @TempIdentColName)
END

GO


/*
USAGE

sp_RemoveDups 'TableName', 'DupQualifierFieldNameList', ['DeleteDups'], ['UniqueColName'], ['CreateIdentColIfNeeded'], <'StoredProcedureResult'>
================================================================================================================================================
TableName: Required - String - Name of the table to detect duplicate records in.
DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the
unique record within TableName.
DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will
return the duplicate records to be deleted.
UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the
deletion logic. If no UniqueColName is provided then an attemp will be made to locate
the RowGUID column. If that fails then an attempt will be made to locate the
Identity column. If that fails then all of the columns of the table will be examined
and the first one with all unique values will be selected.
CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if
no unique column can be located. Pass in a 1 here to run this feature off.
StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.

Simple Syntax:
Use [databaseName]
sp_RemoveDups '[tableName]','[uniqueColName]','[Optional bit flag - set to zero to remove dups, omit to report only]'
*/


Artificial Intelligence stands no chance against Natural Stupidity.
punniyakodi
punniyakodi
SSC Journeyman
SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)SSC Journeyman (96 reputation)

Group: General Forum Members
Points: 96 Visits: 32

Appreciate your alternative method to delete duplicate records.


dbalos
dbalos
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 14
My two cents about database design and avoiding duplicates - if you have unstructured, flat data, let's say monitoring system data about 20 years worth, in various excel, text and other forms of (electronic) data storage, with about 27 milion lines of data, where the only identifier might be four out of 5 columns (timestamp, status1, status2, measurandID), the only solution for importing you have is to use DTS... And because you have to import the data as fast as possible, you might want to avoid having any kind of indexing in place, let alone constraint checks. So you end up with about 10% duplicates or even more, but you would not care too much, except that your database blows up to 16 GB, after you finish indexing in order to be able to use the data at all.



Lee Davis-285380
Lee Davis-285380
Grasshopper
Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)Grasshopper (10 reputation)

Group: General Forum Members
Points: 10 Visits: 1

Excellent Script and a very explanation of it!

Thanks!


Jules Bui
Jules Bui
SSC Rookie
SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)SSC Rookie (41 reputation)

Group: General Forum Members
Points: 41 Visits: 2

I totally agree with avoiding cursors, they are fun only if you have a table with limited rows of data. Try running a cursor with millions of records and you will be waiting forever for it to complete plus the resources dedicated to running the process could be a lot more useful elsewhere.



Jules Bui
IT Operations DBA
Backup and Restore Administrator
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search