RE: Delete duplicate rows from ANY table.

SSC Guru

Points: 100950

July 31, 2015 at 4:52 pm

Lynn Pettis (7/31/2015)
ScottPletcher (7/31/2015)
ZZartin (7/31/2015)
ben.brugman (7/31/2015)
Sean Lange (7/31/2015)
You should avoid the sp_ prefix on your procedures. It is reserved for MS and denotes it as a system procedure. The performance issues are mostly corrected but ambiguity and potential issues are still there. My preference would be to drop the prefix entirely but if you insist on them something like usp_ would be better.
That being said I would be interested in seeing the script you came up with.
Normally when I build a fairly Generic Stored Procedure, I store it in the master and use:
sp_ms_marksystemobject '<Procedure_Name, sysname, ProcedureName>' -- System sp
To make it a 'system procedure', so it is available from all databases.
As I understand this is correct usage. (???)
Sean Lange (7/31/2015)
ben.brugman (7/31/2015)
Why so many tables without primary keys.
Not many and most of them have primairy keys. (But not implemented primairy keys).
Not sure what that means. Your table either does or does not a primary key. Are you saying they are only logical primary keys and not defined as primary keys? Does those tables at least have a clustered index?
Primary Key is a conceptual idea, so it is there in the logical design.
In the implementation of a table, a primairy key can be implemented, or another surrogate key can be choosen for an implementation. There are even plenty of databases where a primary key can not be implemented.
I like to differentiate between a logical design and an implementation. For me the Primairy key belongs to the logical design, not implementing is an option. (Clustering/indexes are all parts of the implementation, not of the logical design).
The delete option of duplicates is mostly for short lived tables. (import/test etc.) Limited number of rows, no identities, no constraints.
Sean Lange (7/31/2015)
That being said I would be interested in seeing the script you came up with.
This will take some time, but I'll try to come up with a stored procedure which deletes duplicate rows of ANY table.
All thanks for the support an stimulating me in finding a solution.
Ben
Just going to say you're now doing a bunch of work because you chose not to enforce the primary key that could have been entirely avoided. And while you're right a primary key is a logical constraint it should be enforced somewhere, whether that's in the DB or in the application, but since you have users editing data directly in the database and doing so in a sloppy manner.....
Not necessarily. A pro forma primary key, such as identity, offers nothing in the way of actually preventing true data duplication. In fact, it's far easier to get duplicate data that way. That's why identity should be avoided unless it's absolutely necessary, and instead an actual data-based clustering key should be specified whenever possible.
Works great until that primary data based key is no longer usable as a primary key due to changing system requirements. Been there, done that. Like the idea, but still like to create a SID key for a PK and code the data based key as an alternate (unique, not null) key so that if requirements change and a new data based "PK" needs to be identified we at least still have a means of identifying unique data.
And in actuality, the SID should not be exposed to the users. What they don't see won't hurt them.

The minor point is that it's additional overhead to gen and maintain an identity and its corresponding index that really isn't needed. The major point is that the vast majority of time, for many if not most people, the identity is instead made the PK and the clustering key, because it's "narrow and ever increasing". Identity has become a de facto "default" clustering key, when in fact no such thing should exist.

SQL DBA,SQL Server MVP(07, 08, 09) A socialist is someone who will give you the shirt off *someone else's* back.