Deleting Duplicated data

  • soulchyld21

    SSC-Addicted

    Points: 494

    Hi I have a table with millions of records, some duplicates are needed in that they have the same customer details with different dates, how can I structure my query to select and remove only the exact duplicates? As you can probably tell I am a total noob with SQL!

    Thanks

  • Ron McCullough

    SSC Guru

    Points: 63877

    You can use a CTE (Common Table Expression) for example:

    CREATE TABLE #T(Date_Stamp DATETIME,KW1 DECIMAL(5,1), KW2 DECIMAL(5,1))

    INSERT INTO #T

    SELECT '12/10/2010', 5.3, 3.1 UNION ALL

    SELECT '12/10/2010', 5.3, 3.1 UNION ALL

    SELECT '12/9/2010', 4, 2 UNION ALL

    SELECT '12/8/2010', 3, 1 UNION ALL

    SELECT '12/7/2010', 7.4, 5 UNION ALL

    SELECT '12/7/2010', 7.4, 5 UNION ALL

    SELECT '12/7/2010', 7.4, 5

    ;with cte

    as (select row_number() over(partition by Date_Stamp,KW1,KW2 order by Date_Stamp) as rn,

    KW1,KW2

    from #T)

    SELECT * FROM cte

    Once you have tested, and tested this, change the SELECT to

    SELECT * FROM cte WHERE rn > 1. This should return only the duplicate values, and once you are sure of that replace the SELECT statement with a

    DELETE FROM cte WHERE rn > 1.

    AGAIN let me emphasis the need to test and test again to be sure that it is returning those rows that you want to delete.

    Please read:

    http://www.simple-talk.com/sql/t-sql-programming/sql-server-cte-basics/

    if you need to learn more about a CTE. The article contains sample code with which you can practice and understand how the CTE works, before using it in your situation

    Edited after SQLRNNR posting immediately below this one.

    By all means read SQLRNNR suggestion - it is excellent !!!

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • SQLRNNR

    SSC Guru

    Points: 281243

    Here's another article to help understand the process.

    http://jasonbrimhall.info/2011/03/21/dedupe-data-cte/

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • SQLFanatic

    SSCommitted

    Points: 1640

    Here's another way you may consider. This approach involves the following steps:

    1) Fetch the duplicates into a temp table. (The customers with same date in this case)

    2) Delete the duplicates from the main table.

    3) Insert back single occurance of records that appear multiple times.

    --The main table (#T)

    CREATE TABLE #T(Date DATETIME,FirstName VARCHAR(15), LastName VARCHAR(15))

    INSERT INTO #T

    SELECT '12/10/2010', 'Thaddeus', 'Jones' UNION ALL

    SELECT '12/10/2010', 'Thaddeus', 'Jones' UNION ALL

    SELECT '12/9/2010', 'Thaddeus', 'Jones' UNION ALL

    SELECT '12/8/2010', 'Hannibal', 'Hayes' UNION ALL

    SELECT '12/7/2010', 'Rachel', 'Mccormick' UNION ALL

    SELECT '12/7/2010', 'Dawn', 'Pinto' UNION ALL

    SELECT '12/7/2010', 'Dawn', 'Pinto'

    --This table (#DUPS) will hold the records that appear multiple times in the main table

    CREATE TABLE #DUPS (Date DATETIME, FirstName VARCHAR(15), LastName VARCHAR(15))

    INSERT INTO #DUPS

    SELECT Date, FirstName, LastName

    FROM #T

    GROUP BY Date,FirstName, LastName

    HAVING COUNT(*) > 1

    /*Select * from #DUPS --will return 12/7/2010,Dawn Pinto and 12/10/2010,Thaddeus Jones as they appear more than once in the main table*/

    Now, let us delete all the records from the main table that appear multiple times. If you have millions of records in your table you might want to do a batch delete to avoid log issues.

    --Delete all the records from the main table that appear multiple times

    DELETE a

    FROM #T a

    INNER JOIN #DUPS b

    ON a.DATE = b.DATE

    AND a.FirstName = b.FirstName

    AND a.LastName = b.LastName

    Once, we have deleted all the duplicates with this method, we have also deleted the records that we want to occur only a single time. So, we'll insert those records back into the main table from our #DUPS table.

    INSERT INTO #T SELECT * FROM #DUPS

    SELECT * FROM #T

    DROP TABLE #T

    DROP TABLE #DUPS

    Finally, we have all the unique records in our main table including the same customers having different dates. 🙂

    Date FirstName LastName

    2010-12-07 00:00:00.000 Dawn Pinto

    2010-12-10 00:00:00.000 Thaddeus Jones

    2010-12-09 00:00:00.000 Thaddeus Jones

    2010-12-08 00:00:00.000 Hannibal Hayes

    2010-12-07 00:00:00.000 Rachel Mccormick

    Again, testing is the key for achieving smart solutions. 😉

    --

    "The greatest ignorance is being proud of your learning"

    ________________________________________________________________
    "The greatest ignorance is being proud of your learning"

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply