Duplicates

Question

Duplicates

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply

LoudClear SSC Eights! Points: 999 More actions · Answer 1

On a side note, the question was not assigned a Category. I now have 2 points in category NULL, and 2 in category "No Category". I am not sure which are from today.

Trevor Ball SSCommitted Points: 1854 More actions · Answer 2

Trevor Ball

SSCommitted

Points: 1854

February 4, 2015 at 8:20 am

#1775335

Thanks...I learned something today. 🙂

Ken SSCoach Points: 17191 More actions · Answer 3

Thanks for the question Steve. I use queries like this a lot for this same sort of cleanup.

Mick Fitz Hall of Fame Points: 3149 More actions · Answer 4

h.tobisch (2/4/2015)
You want to select all but the newest record, so,
why can't you say that ?

+2 😉

jan.heimdal Valued Member Points: 61 More actions · Answer 5

This seems simpler to me:

DELETE --or SELECT *

FROM #DuplicateRow

WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

antony-688446 Ten Centuries Points: 1221 More actions · Answer 6

jan.heimdal (2/4/2015)
This seems simpler to me:
DELETE --or SELECT *
FROM #DuplicateRow
WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

This returns all but the last editor. That isn't what the QOTD is trying to do. This is trying to remove all duplicate editor records.

Also, for this query to work (i.e. get the last date modified for each distinct editor), you will end up with a poor execution plan - the sub-select will run for every row in the table. That means you get a query time proportional to the number of rows squared. Not efficient.

Andre Ranieri SSCrazy Points: 2819 More actions · Answer 7

Great questions. I love working with windowing functions.

SQLRNNR SSC Guru Points: 281334 More actions · Answer 8

Koen Verbeeck (2/4/2015)
Nice question, thanks Steve.
Easy one as I use this almost daily 🙂

Maybe not daily for me, but frequently enough.

Jason...AKA CirqueDeSQLeil
_______________________________________________
I have given a name to my pain...MCM SQL Server, MVP
SQL RNNR
Posting Performance Based Questions - Gail Shaw[/url]
Learn Extended Events

Chris Stewart-397033 SSCrazy Points: 2040 More actions · Answer 9

jan.heimdal (2/4/2015)
This seems simpler to me:
DELETE --or SELECT *
FROM #DuplicateRow
WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

Hmm, I'm agreeing more with this response. Since there is no uniquely identifying value in the main table, there is no way for the "correct" response to tell you which rows it can delete, as all it returns is the (non-unique) LName and the arbitrarily-assigned "cnt" (via Row_Number()). If you tried to delete the "dupes" and leave only the "( 'Jones', 'Editor', '4/1/2008' )" row, there is no way as there are multiple "Jones" records in the result:

LNamecnt

Knight2

Smith3

Jones4

Smith5

Jones6

Using any of these to DELETE the dupes would remove all rows in the table. Using the query above from jan.heimdal you won't see which record is going to be kept, but all it returns will be deleted.

Side note: I'm wondering if the answer was listed as intended? The column alias for the subquery is "cnt" which suggests a COUNTed value, but "ROW_NUMBER()" was used instead. Was a COUNT intended? Or should the column alias have been "rownum"?

h.tobisch SSCommitted Points: 1671 More actions · Answer 10

One word about the term DUPLICATE.

This generally means two entities having identical attributes, as in DUPLICATE KEY.

This also means that in removing duplicates you may choose any of the duplicate entities as the survivor.

So in this exampple it is not "duplicates" but instead one might say :

remove any but the most recent editor (what if there are two of them ?) .

psingla Hall of Fame Points: 3840 More actions · Answer 11

Thanks Steve...

Just to add one point here:

Windows functions can be used directly in SELECT and ORDER BY Clauses only..

Pramod
SQL Server DBA | MCSE SQL Server 2012/2014
in.linkedin.com/in/pramodsingla/
http://pramodsingla.wordpress.com/

pmadhavapeddi22 SSCertifiable Points: 5306 More actions · Answer 12

Though I use the row_number() to remove the duplicates, I got it wrong.. 🙁

Thanks for the question

akljfhnlaflkj SSC Guru Points: 76202 More actions · Answer 13

akljfhnlaflkj

SSC Guru

Points: 76202

February 5, 2015 at 7:57 am

#1775703

Thanks. I need to use Partition By more often.

TomThomson SSC Guru Points: 104773 More actions · Answer 14

I think the only sensible answer to this question is "None of the above", which isn't an option. I think that because none of the queries return any information that is useful for the process of deleting the unwanted rows - but the last query could be trivially modified to return useful information:-

SELECT lname, jobtitle, datemodified FROM (

SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (

PARTITION BY jobtitle ORDER BY datemodified DESC

) AS 'cnt'

FROM #DuplicateRow

) a WHERE cnt > 1 ;

In an earlier comment jan heimdahl provided an alternative way of doing this. And here's a query that does it (using jan's method) for what I think is the least convoluted interpretation of the very badly worded question (which is not Steve's intended interpretation - my mind wasn't quite convoluted enough to arrive at that):

with MaxDates (LName, JobTitle, MaxDate) as (

select LName, JobTitle, MAX(DateModified) from #DuplicateRow group by JobTitle, LName )

select D.LName, D.JobTitle, D.DateModified from #DuplicateRow D

inner join MaxDates M

ON D.LName = M.LName and D.JobTitle = M.JobTitle and D.DateModified < M.MaxDate ;

Alternatively, using ROW_NUMBER()

-- to get the same result as above

SELECT lname, jobtitle, datemodified FROM (

SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (

PARTITION BY jobtitle, lname ORDER BY datemodified DESC

) AS 'cnt'

FROM #DuplicateRow

) a WHERE cnt > 1 ;

The trouble with the question was that I couldn't work out what it meant. It asked about removing duplicates from a table without any duplicates in it.

The mention of latest editor suggested that the date modified column should be disregarded in considering what was a duplicate, which would mean that two of the Jones rows were deleted but still leave total ambiguity as to what is intended for non-Jones rows:

maybe one Smith row is to be deleted because when the date column is disregared the two Smith rows are the same; maybe all three non-Jones records should be deleted because thename colums is also to be disregarded when considering duplicates - but that would mean that the question phrasing was very convoluted and confusing, as if this were a test on understanding English written by people incapable of writing clearly instead of a test of SQL knowledge (which appears to have been the case); or

maybe no non-Jones records should be deleted because thay are not duplicates of the latest editor record even disregarding the datemodified column - but that too means the phrasing is pretty convoluted.

Then only 2 of the queries were actually T-SQL queries - the others were not valid T-SQL; and neither of the valid queries returned anything remotely useful - trying to use the resulting information to choose which rows to delete would probably, as others have pointed out, end in disaster since the information was insufficient to identify any rows.

Tom

TomThomson SSC Guru Points: 104773 More actions · Answer 15

Of course it should be pointed out that the "incorrect" first option is just as easily modified to provide useful information as is the "correct" last option, so should be considered just as correct as it. The only difference is that the subsequent activity has to deal with cnt>1 in one case, while in the other case that's already done.

--option 1 fixed to provide useful information

SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (PARTITION BY jobtitle ORDER BY datemodified DESC) AS 'cnt' FROM #DuplicateRow ;

Tom