Weeding through duplicates

  • Our database has a lot of duplicate records (one person with more than one record).  I'd like to write a script so that it pulls non-dupe records and only one of the dupe records for people that have more than one record based on the last time it was modified.  I have a table (data from an external source) that is only one person and one record and I'd like to compare the email addresses from that table to the email address field in a table that our in house app contains (the table with people that have more than one record).

    Here's some arbitrary data of what I'm referring to....


    ID    EMAIL              LAST_UPDATED_DT
    10    jane@abc.com       2011-01-05
    13    jane@abc.com       2012-06-15
    17    john@bcd.com       2011-01-10
    22    jill@cde.com       2015-06-01
    23    jill@cde.com       2011-12-01
    30    joe@def.com        2012-02-01
    32    pat@efg.com        2016-03-02
    39    pat@efg.com        2017-09-12

    For the sake of example, let's assume every single unique email address above is in the other table (data from an external source) that's providing the list of unique email addresses.  That actually won't be the case but I'd rather keep this message on the shorter side.

    This is what I'd like the result of the query to look like...


    ID    EMAIL
    13    jane@abc.com
    17    john@bcd.com
    22    jill@cde.com
    30    joe@def.com
    39    pat@efg.com

    While I've spent time Googling, I can't find a script that does this or examples of scripts to help me get the desired result.  Basically, when the script encounters a group of 2 or more records based on email address being the same, I want it to sort and use the record that was last updated with the most recent date.

    Even if you don't have time to give a sample script, I'd love to learn how to put this together on my own so if I could get some key words on what to search for, I'd appreciate it.  I do have a script that finds dupes and gives a count, I just don't know how to pick a single record out of a grouping.  Also, the script I have only finds dupes and that's not what I want to do in this case.

    TIA,
    John

  • If you know ID is a sequential int that should be pretty straight forward.

    SELECT MAX(ID), EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAIL

  • Your going to want to use GROUP BY in a CTE to group by the e-mail address, and select that along with the MAX(LAST_UPDATED_DT).   Then you can join the CTE name to the original table name by e-mail and by LAST_UPDATED_DT = whatever column name you gave the MAX value.   Does that help you get it going?

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • Sorry...I should've posted this in the T-SQL section.

    Anyway, while the ID is a number (stored as a varchar(15) data type), it's not always going to be the largest ID number.  Sometimes it's going to be a smaller ID number. Sorry, I should've reflected that in my example.  I'll go ahead and give this a try then...


    SELECT ID, EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAIL

    Thanks,
    John

  • sgmunson - Thursday, March 29, 2018 10:10 AM

    Your going to want to use GROUP BY in a CTE to group by the e-mail address, and select that along with the MAX(LAST_UPDATED_DT).   Then you can join the CTE name to the original table name by e-mail and by LAST_UPDATED_DT = whatever column name you gave the MAX value.   Does that help you get it going?

    Yes, this does help.  I'll give it a try and I'll report back.

  • J M-314995 - Thursday, March 29, 2018 10:15 AM

    Sorry...I should've posted this in the T-SQL section.

    Anyway, while the ID is a number (stored as a varchar(15) data type), it's not always going to be the largest ID number.  Sometimes it's going to be a smaller ID number. Sorry, I should've reflected that in my example.  I'll go ahead and give this a try then...


    SELECT ID, EMAIL, MAX(LAST_UPDATED_DT) FROM wherever GROUP BY EMAIL

    Thanks,
    John

    Don't include ID in the GROUP BY query.   That's going to return all the different ID values and you need to get down to just one e-mail address and it's maximum date.   You can do this:WITH GROUPED_DATA AS (

       SELECT EMAIL, MAX(LAST_UPDATED_DT) AS MAX_UPDATE
       FROM YOUR_TABLE_NAME_HERE
       GROUP BY EMAIL
    )
    SELECT T.ID, G.EMAIL, G.MAX_UPDATE
    FROM GROUPED_DATA AS G
       INNER JOIN YOUR_TABLE_NAME_HERE AS T
          ON G.EMAIL = T.EMAIL;

    Steve (aka sgmunson) 🙂 🙂 🙂
    Rent Servers for Income (picks and shovels strategy)

  • SELECT  ID, EMAIL
    FROM (
         SELECT  ID, EMAIL, MostRecent = ROW_NUMBER() OVER (PARTITION BY EMAIL ORDER BY LAST_UPDATED_DT DESC)
         FROM whatever
    ) m
    WHERE MostRecent = 1

  • Scott Coleman - Thursday, March 29, 2018 10:26 AM

    SELECT  ID, EMAIL
    FROM (
         SELECT  ID, EMAIL, MostRecent = ROW_NUMBER() OVER (PARTITION BY EMAIL ORDER BY LAST_UPDATED_DT DESC)
         FROM whatever
    ) m
    WHERE MostRecent = 1

    Awesome.  Thank you.  I love learning how to do things in multiple ways.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply