Selecting And Merging Duplicate Records In SQL server 2005

  • Hi All,

    I have a requirement to clean up a DB with more than half a million records with duplicates. The challange is i need to return the duplicate records and then merge them, keeping on recod for each record set. The reason for that is that the users have been updating any of the duplicate record, so no data shouldbe lost after the clean up. Is this possible may using using other tools like SSIS or any script?

    Any suggestions welcome.

    Thanks.

  • You can do this using a script, you will probably need to use SELECT DISTINCT and GROUP BY to remove duplicates. Without knowing exactly what you are doing it is hard to give specific advices

    Post your table defs, and some sample data and I am sure someone can help out..

  • You can duplicte records by Count(*)/ Group By or Row_Number.

    But how will you find out the latest update in those duplicate records. Is there any UpdatedOnDateTime column in your tables?

    ----------------------------------------------------------------------------------------------------------------------------------------------------------------------
    Sometimes, winning is not an issue but trying.
    You can check my BLOG
    [font="Arial Black"]here[/font][/url][/right]

  • @phumlo1

    what exactly you want to know?

    1) how to select duplicate records?

    2) How to cleanup your database without keeping duplicate records?

    Please give some more specification.

    "Don't limit your challenges, challenge your limits"

  • There are lots of methods of handling duplicates in SQL. To help you design the best one for your situation, we would need table definitions ("create table..." scripts) and some sample data to put in them (insert... scripts).

    When you say you want to return the duplicates and then merge them, do you mean you need to view them first before you merge them? Or do you mean you need to keep a record of some sort of the duplicates that are being removed? Or something else?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • I confirm all other posts. More information imply better answers. Maybe have a look to the link in my signature.

    For now a very basic way to handle this:

    DECLARE @t TABLE (DuplicateId INT)

    INSERT INTO @t

    SELECT 1

    UNION ALL SELECT 1

    UNION ALL SELECT 1

    UNION ALL SELECT 2

    UNION ALL SELECT 3

    UNION ALL SELECT 2

    ; WITH

    todo (RowNum) AS

    (

    SELECT

    ROW_NUMBER() OVER (PARTITION BY DuplicateId ORDER BY DuplicateId)

    FROM @t

    )

    DELETE FROM todo WHERE RowNum != 1

    SELECT * FROM @t

    Greets

    Flo

  • Hi All,

    Thanks for all the responses. Here is a sample of my DB:

    Fullname, ID_no, Modifiedon, Membershipno,Adress1, Address2, Postal Code

    Joe Soap, 123, 2008-09-04, 001, 1 sloane str, Hampton, ,

    Joe Soap, 123, 2008-09-06, 001, 1 sloane str, Hampton, 3009

    Marie Smith,456, 2008-10-10, 002, 2 Brite str,,,

    Marie Smith,456, 2008-10-11, 002, 2 Brite str,Hampton,,

    Marie Smith, 465, 2008-10-12,,Brite str,Hampton,3009

    In this sample Joe soap has 2 records, in his case i want to keep the second one because its got more data. In Marie's case she's got

    3 records, all with different data, first one has no address2 and postal code, second has no postal code, third one has got no membershipno.The three records have pieces of data that i need for Maries complete record. I need to get all the pieces into one record(merge) and delete all others.

    Many Thanks.

  • Hi All,

    Apologies for sending the sample data in the format as in my previous post. will send a proper one.

  • This is complex case:

    To resolve this issue you can try the following way--

    1. First identify the duplicate based on your requirement (may not duplicate)

    2. Find the number of records associated with the duplicates.

    3. As per your requirement data may come from duplicate records, so you need to write a temp table and insert all the duplicate records into the temp table and check each column value with respective column of other records and insert the data into final temp table which you require.

    4. Finally delete the duplicate records from main table and insert the records of final table.

    You can do this by using dynamic query.

  • The only way I've found to handle this sort of thing requires several passes at the data. You want the most complete data possible for each duplicate:

    -- Assuming that your table name is MemberShips and that ID_no is really unique

    -- for each name.

    -- This first get the dups

    SELECT

    Fullname, ID_no, MAX(Modifiedon) AS Modifieldon

    , CAST(NULL AS VARCHAR(10)) AS Membershipno

    , CAST(NULL AS VARCHAR(50)) AS Address1

    , CAST(NULL AS VARCHAR(50)) AS Address2

    , CAST(NULL(AS VARCHAR(10)) AS PostalCode

    INTO #DUP

    FROM MemberShips

    GROUP BY

    Fullname, ID_no

    HAVING COUNT(*) > 1

    -- Get tne non-null MembershipNo

    SELECT M.ID_no, M.Membershipno

    INTO #M

    FROM #Dup

    INNER JOIN Memberships M ON

    #Dup.ID_no = M.ID_no

    WHERE

    M.Membershipno IS NOT NULL

    -- Get the non-null Address1

    SELECT

    M.ID_no, M.Address1

    INTO #A1

    FROM #Dup

    INNER JOIN MemberShips M ON

    #Dup.ID_no = M.ID_no

    WHERE

    M.Address1 IS NOT NULL

    -- Do the same for Address2, Postal Code, etc.

    -- Use these tables to update the #Dup table.

    UPDATE #Dup

    SET #Dup.Membershipno = #M.Membershipno

    FROM #Dup

    INNER JOIN #M ON

    #Dup.ID_no = #M.ID_no

    -- Update addresses and so forth.

    -- #Dup should have something for each column if there was one in

    -- the original table.

    -- Delete the dups in the original table.

    DELETE M

    FROM #Dup

    INNER JOIN Memberships M ON

    #Dup.ID_no = M.ID_no

    -- Re-insert the completed records from #Dup

    INSERT INTO Memeberships

    ( Fullname, ID_no, etc.)

    SELECT

    Fullname, IDNo, etc

    FROM #DUP

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply