Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Deleting Duplicate Records Expand / Collapse
Author
Message
Posted Tuesday, December 21, 2004 11:54 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, February 8, 2007 10:11 AM
Points: 5, Visits: 1

How does such bad design and missing constraint get into Dbs that allow duplicates to get into databases in the first place?  Heard of Primary Keys (oh, some people make IDs as PKs for all tables where naturalkeys make sense, tsk tsk), heard of constraints and triggers (in SQL at least).  There is no reason to have to use these duplicate deletion scripts.  Design tables with proper PKs and constraints and use triggers.

www.DynamicWebTX.Com

Post #152041
Posted Tuesday, December 21, 2004 12:57 PM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Tuesday, November 4, 2014 12:07 PM
Points: 63, Visits: 472

Lots of data exists in the world outside real (that is, normalized and managed by a DBMS) databases.  The big ETL (extract, transform, load) task is to load this messy data into your database.  I love this work and never have to write a program or a cursor thanks to DTS and to my progammer's editor Epsilon, which can load and edit huge files.




_________________
"Look, those sheep have been shorn."
data analyst replies, "On the sides that we can see.."
Post #152048
Posted Monday, December 19, 2005 11:21 PM


Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Wednesday, December 21, 2005 3:47 AM
Points: 1, Visits: 1

SQL was good but too long

Do One thing

  • Select ID, NAME, SALARY into #TableName from TableName Group by ID, NAME, SALARY
  • Delete From TABLENAME
  • INSERT INTO TABLENAME SELECT * FROM #TABLENAME

This is 3 Line SQL Make a Program & EXECUTE it.

Ashish Bajpai







Ashish Bajpai

Programer

NCR Corpration, Mumbai

[91] - 9324 959982
Post #245449
Posted Tuesday, December 20, 2005 12:03 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, June 20, 2013 4:47 PM
Points: 17, Visits: 16

This does the same thing, simplier and NO CURSORS.  If it can be done without a cursor, than much better. 

CREATE TABLE [dbo].[Employee] (
            [id] [int] NULL ,
            [name] [Varchar] (50) COLLATE Latin1_General_CI_AS NULL ,
            [salary] [Numeric](18, 2) NULL
  ON [PRIMARY]
GO
/*********************************************************/

/**********************************************************/
/* Script for Insertion of duplicate records to the Table */
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (1,'Ram', 1000.00)
Insert into employee values (2,'Joe', 2000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (3,'Mary', 1000.00)
Insert into employee values (4,'Julie', 5000.00)
Insert into employee values (2,'Joe', 1000.00)
Insert into employee values (1,'Ram', 1000.00)

/**********************************************************/

BEGIN TRANSACTION

Select [id], [name], [salary]
INTO #MakeUnique
FROM employee
GROUP BY [id], [name], [salary]

DELETE FROM employee  

INSERT INTO employee( [id], [name], [salary] )
SELECT [id], [name], [salary] FROM #MakeUnique

COMMIT TRANSACTION
 
SELECT * FROM employee

DROP TABLE #MakeUnique
DROP TABLE employee

 

Post #245453
Posted Tuesday, December 20, 2005 12:10 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, June 20, 2013 4:47 PM
Points: 17, Visits: 16
Yes very very bad, really a NO DESIGN table.  Believe or not, I had to deal with NO UNIQUE CONSTRAINT Tables, these BAD Designs with Commercial Enterprise systems.
Post #245454
Posted Tuesday, December 20, 2005 7:37 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Thursday, June 25, 2009 6:14 AM
Points: 876, Visits: 78
While I agree that proper design will eliminate the need for such scripts, inevitably we will all inherit poorly structured systems which necessitate this functionality. Personally I prefer to use a stored procedure that I found a few years ago. I didn't write it - I think I have only used it a few times - but it has saved my bacon each time.


/* Description: UTILITY - Locate in MASTER


Syntax: EXEC sp_RemoveDups TableName, DupQualifierFieldNameList, DeleteDups, UniqueColName, CreateIdentColIfNeeded, StoredProcedureResult
Only the first two arguments are required
For HELP, enter the stored procedures name without any arguments or see the PRINT statements below
NO DELETION WILL OCCUR by default - only duplicate recordset returned. To delete records, pass in a 0 for the DeleteDups argument.

Example: EXEC sp_RemoveDups 'MyTable','LastName,FirstName,HomePhone'

Purpose: Allow removal of duplicate rows where
1. We define what fields qualify the duplicate
2. We select the unique rowid or it is detected automatically else no action takes place

Method: Delete by RowID all duplicate rows except the highest RowID (in alpha-sort sequence)
of each group of duplicates.

DATE BY CHANGE
09-23-2002 Frank Original v1.0
09-23-2002 Frank Changed the name from sp_RemoveDupsByRowID to sp_RemoveDups
10-8-2002 Sean P. O. MacCath-Moran Made @UniqueColName optional
Added logic to auto-detect RowGUID and Identity columns
Added logic to check for unique value column if no RowGUID or Identity column exists
Added logic to create a temporary ID field as a last resort if no unique column could be located
Added HELP

*/

CREATE PROCEDURE sp_RemoveDups
@TableName as varchar(50) = NULL,
@DupQualifierFieldNameList as varchar(200) = NULL,
@DeleteDups as bit = NULL,
@UniqueColName as varchar(50) = NULL,
@CreateIdentColIfNeeded as bit = NULL,
@StoredProcedureResult int = NULL OUTPUT

AS
SET NOCOUNT ON

DECLARE @SQL nvarchar(2000)
DECLARE @SQL_DetermineUniqueTemplate nvarchar(2000)
DECLARE @TempIdentColName varchar(20)
DECLARE @HostName varchar(50)
DECLARE @ActionText varchar(10)

DECLARE @SUM int
DECLARE @COUNT int
DECLARE @NextColumn varchar(50)


/*==================================================================================*/
/*========================VARIABLE INITIALIZATION AND SETUP========================*/
/*=================================================================================*/
/*If no unique column is located then a temporary Identity column will be created useing the name specified in this TempIdentColName string*/
SET @TempIdentColName = 'TempIdentityColXY123'

SET @SQL_DetermineUniqueTemplate = 'SELECT @COUNT = COUNT(Count), @SUM = sum(Count) from '
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + '(SELECT TOP 100 PERCENT , COUNT(*) as Count FROM ' + @TableName
SET @SQL_DetermineUniqueTemplate = @SQL_DetermineUniqueTemplate + CHAR(13) + ' GROUP BY ORDER BY ) a'

/*Retrieve the Host Name. This will be used later in this SP as a test to determine if the user is making this call from within SQL Query Analyzer*/
SELECT @HostName = hostname FROM master..sysprocesses WHERE spid = @@SPID AND program_name = 'SQL Query Analyzer'


/*Set ActionText to be used in message output*/
IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
SET @ActionText = 'Selection'
ELSE
SET @ActionText = 'Deletion'

/*If a value is specified for use by UniqueColName it cannot exist in the columns from the DupQualifierFieldNameList*/
IF CHARINDEX(@UniqueColName, @DupQualifierFieldNameList) > 0
BEGIN
/*The value in UniqueColName was detected in DupQualifierFieldNameList.*/
IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') must not exist in DupQualifierFieldNameList (' + @DupQualifierFieldNameList + '). Other solutions will be sought automatically.'
SET @UniqueColName = NULL
END


/*If UniqueColName is provided then perform check to ensure that all the values in that column are, in fact, unique.*/
IF NOT (@UniqueColName IS NULL)
BEGIN
SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @UniqueColName)
/*Perform a check of this column to determine if all of it's values are unique*/
EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
/*Test to determine if this column contains unique values*/
If @SUM <> @COUNT
BEGIN
/*The column specified by UniqueColName does not contain unique values.*/
IF NOT (@HostName IS NULL) PRINT 'The UniqueColName provided (' + @UniqueColName + ') does not contain unique values. Other solutions will be sought automatically.'
SET @UniqueColName = NULL
END
END


/*==============================================================*/
/*======================HELP OUTPUT TEXT======================*/
/*==============================================================*/
IF (@TableName IS NULL) OR (@TableName = '/?') OR (@TableName = '?') OR (@DupQualifierFieldNameList IS NULL) OR (@DupQualifierFieldNameList = '/?') OR (@DupQualifierFieldNameList = '?')
BEGIN
IF NOT (@HostName IS NULL)
BEGIN
PRINT 'sp_RemoveDups ''TableName'', ''DupQualifierFieldNameList'', [''DeleteDups''], [''UniqueColName''], [''CreateIdentColIfNeeded''], <''StoredProcedureResult''>'
PRINT '====================================================================================================================================================================='
PRINT 'TableName: Required - String - Name of the table to detect duplicate records in.'
PRINT 'DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the unique record within TableName.'
PRINT 'DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will return the duplicate records to be deleted.'
PRINT 'UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the deletion logic. If no UniqueColName is provided then an attemp will be made to locate the RowGUID column. If that fails then an attempt will be made to locate the Identity column. If that fails then all of the columns of the table will be examined and the first one with all unique values will be selected.'
PRINT 'CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if no unique column can be located. Pass in a 1 here to run this feature off.'
PRINT 'StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.'
END
SET @StoredProcedureResult = 3
RETURN
END


/*========================================================================*/
/*======================DETECT USABLE UniqueColName======================*/
/*========================================================================*/
IF @UniqueColName IS NULL
BEGIN
/*Check for a RowGUID or Identity column in this table. If one exists, then utilze it as the unique value for the purposes of this deletion*/
IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsRowGUIDCol') = 1) SET @UniqueColName = 'RowGUIDCol'
IF EXISTS(SELECT * FROM SysColumns WHERE ID = Object_ID(@TableName) and ColumnProperty(ID,Name,'IsIdentity') = 1) SET @UniqueColName = 'IdentityCol'

IF @UniqueColName IS NULL
/*If no RowGUID or Identity column was found then check all of the columns in this table to see if one of them can be utilized as a unique value column*/
BEGIN
/*Select all of the columns from the table in question...*/
DECLARE MyCursor CURSOR LOCAL SCROLL STATIC FOR SELECT name FROM syscolumns WHERE OBJECT_ID(@TableName)=ID

OPEN MyCursor
FETCH NEXT FROM MyCursor INTO @NextColumn
WHILE @@fetch_status = 0
BEGIN
/*Create SQL string with correct column name in place.*/
SET @SQL = REPLACE(@SQL_DetermineUniqueTemplate,'', @NextColumn)
/*Perform a check of this column to determine if all of it's values are unique*/
EXEC sp_executesql @SQL, N'@SUM as int OUTPUT,@COUNT as int OUTPUT',@SUM OUTPUT,@COUNT OUTPUT
/*Test to determine if this column contains unique values*/
If @SUM = @COUNT
BEGIN
/*A unique values column is detected. Use it and break out of the loop UNLESS column is specified in DupQualifierFieldNameList*/
IF CHARINDEX(@NextColumn, @DupQualifierFieldNameList) = 0
BEGIN
/*NextColumn was NOT detected in DupQualifierFieldNameList, so this is the column we will use.*/
SET @UniqueColName = @NextColumn
BREAK
END
END
ELSE
FETCH NEXT FROM MyCursor INTO @NextColumn
END
CLOSE MyCursor
DEALLOCATE MyCursor

END
END

/*If no UniqueColName has been found then create one UNLESS @CreateIdentColIfNeeded = 1*/
IF (@UniqueColName IS NULL) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
BEGIN
/*Add a sequence column to the table...*/
IF NOT (@HostName IS NULL) PRINT 'Creating temporary identity column in the ' + @TableName + ' table named ' + @TempIdentColName + ' for use in this ' + LOWER(@ActionText) + ' process...'
EXEC('ALTER TABLE ' + @TableName + ' ADD ' + @TempIdentColName + ' [int] IDENTITY (1, 1)')
SET @UniqueColName = @TempIdentColName
END


/*============================================================================*/
/*======================EXECUTE DELETION OR SELECTION======================*/
/*===========================================================================*/
IF @UniqueColName IS NULL
BEGIN
/*No UniqueColName was provided by the user and none were detected by the script. This deletion algorythm cannot run.*/
IF NOT (@HostName IS NULL) PRINT 'Could not perform ' + LOWER(@ActionText) + ' process. No unique columns were located and the UniqueColName flag is set to 1 (False).'
SET @StoredProcedureResult = 3
RETURN
END
ELSE
BEGIN
IF NOT (@HostName IS NULL) PRINT 'Performing ' + LOWER(@ActionText) + ' utilizing the unique values in the ' + @UniqueColName + ' column as a reference...'
/*
Create and execute an SQL statement in the form of:

SELECT * (or DELETE)
FROM TableName WHERE UniqueColName IN
(
SELECT UniqueColName FROM TableName WHERE UniqueColName NOT IN
(
SELECT MAX(Cast(UniqueColName AS varchar(36))) FROM TableName GROUP BY DupQualifierFieldNameList, DupQualifierFieldNameList, etc
)
)
*/
/*Delete all duplicate records useing @UniqueColName as a unique ID column */
IF (@DeleteDups IS NULL) OR (@DeleteDups = 1)
SET @SQL = 'SELECT * '
ELSE
SET @SQL = 'DELETE '

SET @SQL = @SQL + 'FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' IN '
SET @SQL = @SQL + CHAR(13) + CHAR(9) + '(' + CHAR(13) + CHAR(9)
SET @SQL = @SQL + 'SELECT ' + @UniqueColName + ' FROM ' + @TableName + ' WHERE ' + @UniqueColName + ' NOT IN '
SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + '(' + CHAR(13) + CHAR(9)+CHAR(9)
SET @SQL = @SQL + 'SELECT MAX(Cast(' + @UniqueColName + ' AS varchar(36))) FROM '
SET @SQL = @SQL + @TableName + ' GROUP BY ' + @DupQualifierFieldNameList
SET @SQL = @SQL + CHAR(13) + CHAR(9) + CHAR(9) + ')' + CHAR(13) + CHAR(9) + ')'

EXEC (@SQL)
IF @@ERROR <> 0
BEGIN
IF NOT (@HostName IS NULL) PRINT @ActionText + ' process failed.'
SET @StoredProcedureResult = 3
END
ELSE
BEGIN
IF NOT (@HostName IS NULL) PRINT @ActionText + ' completed successfully with this SQL: ' + CHAR(13) + @SQL
SET @StoredProcedureResult = 0
END
END


IF (@UniqueColName = @TempIdentColName) AND ( (@CreateIdentColIfNeeded IS NULL) OR (@CreateIdentColIfNeeded = 0) )
BEGIN
/*Remove the sequence column from the table...*/
IF NOT (@HostName IS NULL) PRINT 'Removing temporary identity column named ' + @TempIdentColName + ' from the ' + @TableName + ' table...'
EXEC('ALTER TABLE ' + @TableName + ' DROP COLUMN ' + @TempIdentColName)
END

GO


/*
USAGE

sp_RemoveDups 'TableName', 'DupQualifierFieldNameList', ['DeleteDups'], ['UniqueColName'], ['CreateIdentColIfNeeded'], <'StoredProcedureResult'>
================================================================================================================================================
TableName: Required - String - Name of the table to detect duplicate records in.
DupQualifierFieldNameList: Required - String - Comma seperated list of columns that make up the
unique record within TableName.
DeleteDups: Optional - Bit, Set to 0 to delete duplicate records. A value of NULL or 1 will
return the duplicate records to be deleted.
UniqueColName: Optional - Bit - A table must have a unique column value in it to perform the
deletion logic. If no UniqueColName is provided then an attemp will be made to locate
the RowGUID column. If that fails then an attempt will be made to locate the
Identity column. If that fails then all of the columns of the table will be examined
and the first one with all unique values will be selected.
CreateIdentColIfNeeded: Optional - Bit - By default this SP will create an identity column if
no unique column can be located. Pass in a 1 here to run this feature off.
StoredProcedureResult: Optional - OUTPUT - Int - Returns a 3 if an error occured, otherwise returns a 0.

Simple Syntax:
Use [databaseName]
sp_RemoveDups '[tableName]','[uniqueColName]','[Optional bit flag - set to zero to remove dups, omit to report only]'
*/


Artificial Intelligence stands no chance against Natural Stupidity.
Post #245523
Posted Tuesday, December 20, 2005 11:36 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Tuesday, July 12, 2011 10:00 PM
Points: 96, Visits: 32

Appreciate your alternative method to delete duplicate records.

Post #245756
Posted Friday, March 10, 2006 3:18 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Monday, June 24, 2013 7:26 AM
Points: 1, Visits: 14
My two cents about database design and avoiding duplicates - if you have unstructured, flat data, let's say monitoring system data about 20 years worth, in various excel, text and other forms of (electronic) data storage, with about 27 milion lines of data, where the only identifier might be four out of 5 columns (timestamp, status1, status2, measurandID), the only solution for importing you have is to use DTS... And because you have to import the data as fast as possible, you might want to avoid having any kind of indexing in place, let alone constraint checks. So you end up with about 10% duplicates or even more, but you would not care too much, except that your database blows up to 16 GB, after you finish indexing in order to be able to use the data at all.


Post #264753
Posted Friday, March 10, 2006 4:57 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, July 5, 2007 6:14 AM
Points: 10, Visits: 1

Excellent Script and a very explanation of it!

Thanks!

 

Post #264776
Posted Friday, March 10, 2006 7:10 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Wednesday, October 31, 2007 8:15 AM
Points: 41, Visits: 2

I totally agree with avoiding cursors, they are fun only if you have a table with limited rows of data.  Try running a cursor with millions of records and you will be waiting forever for it to complete plus the resources dedicated to running the process could be a lot more useful elsewhere.

 

 



Jules Bui
IT Operations DBA
Backup and Restore Administrator
Post #264829
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse