Duplicates

  • On a side note, the question was not assigned a Category. I now have 2 points in category NULL, and 2 in category "No Category". I am not sure which are from today.

  • Thanks...I learned something today. 🙂

  • Thanks for the question Steve. I use queries like this a lot for this same sort of cleanup.

  • h.tobisch (2/4/2015)


    You want to select all but the newest record, so,

    why can't you say that ?

    +2 😉

  • This seems simpler to me:

    DELETE --or SELECT *

    FROM #DuplicateRow

    WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

  • jan.heimdal (2/4/2015)


    This seems simpler to me:

    DELETE --or SELECT *

    FROM #DuplicateRow

    WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

    This returns all but the last editor. That isn't what the QOTD is trying to do. This is trying to remove all duplicate editor records.

    Also, for this query to work (i.e. get the last date modified for each distinct editor), you will end up with a poor execution plan - the sub-select will run for every row in the table. That means you get a query time proportional to the number of rows squared. Not efficient.

  • Great questions. I love working with windowing functions.

  • Koen Verbeeck (2/4/2015)


    Nice question, thanks Steve.

    Easy one as I use this almost daily 🙂

    Maybe not daily for me, but frequently enough.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • jan.heimdal (2/4/2015)


    This seems simpler to me:

    DELETE --or SELECT *

    FROM #DuplicateRow

    WHERE JobTitle = 'Editor' and datemodified < (SELECT MAX(datemodified) FROM #DuplicateRow WHERE JobTitle = 'Editor')

    Hmm, I'm agreeing more with this response. Since there is no uniquely identifying value in the main table, there is no way for the "correct" response to tell you which rows it can delete, as all it returns is the (non-unique) LName and the arbitrarily-assigned "cnt" (via Row_Number()). If you tried to delete the "dupes" and leave only the "( 'Jones', 'Editor', '4/1/2008' )" row, there is no way as there are multiple "Jones" records in the result:

    LNamecnt

    Knight2

    Smith3

    Jones4

    Smith5

    Jones6

    Using any of these to DELETE the dupes would remove all rows in the table. Using the query above from jan.heimdal you won't see which record is going to be kept, but all it returns will be deleted.

    Side note: I'm wondering if the answer was listed as intended? The column alias for the subquery is "cnt" which suggests a COUNTed value, but "ROW_NUMBER()" was used instead. Was a COUNT intended? Or should the column alias have been "rownum"?

  • One word about the term DUPLICATE.

    This generally means two entities having identical attributes, as in DUPLICATE KEY.

    This also means that in removing duplicates you may choose any of the duplicate entities as the survivor.

    So in this exampple it is not "duplicates" but instead one might say :

    remove any but the most recent editor (what if there are two of them ?) .

  • Thanks Steve...

    Just to add one point here:

    Windows functions can be used directly in SELECT and ORDER BY Clauses only..

    Pramod
    SQL Server DBA | MCSE SQL Server 2012/2014

    in.linkedin.com/in/pramodsingla/
    http://pramodsingla.wordpress.com/

  • Though I use the row_number() to remove the duplicates, I got it wrong.. 🙁

    Thanks for the question

  • Thanks. I need to use Partition By more often.

  • I think the only sensible answer to this question is "None of the above", which isn't an option. I think that because none of the queries return any information that is useful for the process of deleting the unwanted rows - but the last query could be trivially modified to return useful information:-

    SELECT lname, jobtitle, datemodified FROM (

    SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (

    PARTITION BY jobtitle ORDER BY datemodified DESC

    ) AS 'cnt'

    FROM #DuplicateRow

    ) a WHERE cnt > 1 ;

    In an earlier comment jan heimdahl provided an alternative way of doing this. And here's a query that does it (using jan's method) for what I think is the least convoluted interpretation of the very badly worded question (which is not Steve's intended interpretation - my mind wasn't quite convoluted enough to arrive at that):

    with MaxDates (LName, JobTitle, MaxDate) as (

    select LName, JobTitle, MAX(DateModified) from #DuplicateRow group by JobTitle, LName )

    select D.LName, D.JobTitle, D.DateModified from #DuplicateRow D

    inner join MaxDates M

    ON D.LName = M.LName and D.JobTitle = M.JobTitle and D.DateModified < M.MaxDate ;

    Alternatively, using ROW_NUMBER()

    -- to get the same result as above

    SELECT lname, jobtitle, datemodified FROM (

    SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (

    PARTITION BY jobtitle, lname ORDER BY datemodified DESC

    ) AS 'cnt'

    FROM #DuplicateRow

    ) a WHERE cnt > 1 ;

    The trouble with the question was that I couldn't work out what it meant. It asked about removing duplicates from a table without any duplicates in it.

    The mention of latest editor suggested that the date modified column should be disregarded in considering what was a duplicate, which would mean that two of the Jones rows were deleted but still leave total ambiguity as to what is intended for non-Jones rows:

    maybe one Smith row is to be deleted because when the date column is disregared the two Smith rows are the same; maybe all three non-Jones records should be deleted because thename colums is also to be disregarded when considering duplicates - but that would mean that the question phrasing was very convoluted and confusing, as if this were a test on understanding English written by people incapable of writing clearly instead of a test of SQL knowledge (which appears to have been the case); or

    maybe no non-Jones records should be deleted because thay are not duplicates of the latest editor record even disregarding the datemodified column - but that too means the phrasing is pretty convoluted.

    Then only 2 of the queries were actually T-SQL queries - the others were not valid T-SQL; and neither of the valid queries returned anything remotely useful - trying to use the resulting information to choose which rows to delete would probably, as others have pointed out, end in disaster since the information was insufficient to identify any rows.

    Tom

  • Of course it should be pointed out that the "incorrect" first option is just as easily modified to provide useful information as is the "correct" last option, so should be considered just as correct as it. The only difference is that the subsequent activity has to deal with cnt>1 in one case, while in the other case that's already done.

    --option 1 fixed to provide useful information

    SELECT lname, jobtitle, datemodified, ROW_NUMBER() OVER (PARTITION BY jobtitle ORDER BY datemodified DESC) AS 'cnt' FROM #DuplicateRow ;

    Tom

Viewing 15 posts - 16 through 30 (of 30 total)

You must be logged in to reply to this topic. Login to reply