Copying data and deleting duplicates

  • Hi all

    I have a table structure with a 16 digit numeric ID field, two 2000 character varchar fields, and a timestamp. Because of an error in a webpage somewhere (as yet undiscovered), instead of having one row with a distinct ID field and comments in two varchar fields, the comments have been split into two rows, both with the same ID, but with different timestamps.

    so instead of seeing:

    ld_id, ld_comments1, ld_comments2, timestamp

    123456789, les was here, and here too, 0x0000000001ECAC49

    I see:

    ld_id, ld_comments1, ld_comments2, timestamp

    123456789, NULL, and here too, 0x0000000001ECAC49

    123456789, les was here, NULL, 0x0000000001ECAC4A

    I can see what needs to be done to fix the data (*I think*) - find the rows where the ids match, then identify the row with the larger timestamp and insert the comment from comments1 into the row with the same ID but lesser timestamp. This leaves us with two identical rows (excepting the timestamp). Finally deleting one of the (now) duplicated rows.

    Any ideas how to crack this one peeps?

    One note, not all IDs are duplicates, some have saved fine and would not need adjusting.

    Any help appreciated as always

  • Use cte finding the max time stamp into one table and the min timestamp into another. Use COALESCE to insert into a temp table before updating the original

    Unfortunately I do not have time to come up with an example.

  • I'll have a shot at the UPDATE part of this:

    CREATE TABLE #Test(ID int,Comments1 varchar(2000),Comments2 varchar(2000),Stamp TimeStamp)

    INSERT INTO #Test(ID,Comments1,Comments2)

    SELECT 123456789, NULL, 'and here too'

    UNION ALL

    SELECT 123456789,'les was here',NULL;

    WITH CTE(ID,Comment) AS(

    SELECT ID,Comments1

    FROM #Test

    WHERE Comments2 IS NULL)

    UPDATE #Test

    SET Comments1 = Comment

    FROM #Test

    INNER JOIN CTE

    ON #Test.ID = CTE.ID

    WHERE Comments1 IS NULL

    SELECT * FROM #Test

    This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.

    I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!

    After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM

    WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.

    EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.

    - 😀

  • this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.

    using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂

    ;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)

    AS

    (

    SELECT

    ID,

    Comments1,

    Comments2

    , Stamp,

    Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)

    FROM #Test

    )

    SELECT

    t1.id,

    isnull(t1.Comments1,t2.Comments1) Comments1,

    isnull(t1.Comments2,t2.Comments2) Comments2,

    t1.stamp

    from #Test t1

    join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp

    --where t1.id in

    join (

    select id,MAX(stamp) stamp FROM bad_recs

    WHERE Ranking > 1 group by id

    ) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp

  • Thanks guys - I'm going to experiment with both of these queries today, I'll obviously report back on their success (feel the confidence 😀 )

    And thanks for taking the time to think it through, I'm usually very poor at articulating what I'm trying to achieve.

  • hisakimatama (3/13/2014)


    I'll have a shot at the UPDATE part of this:

    This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.

    I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!

    After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM

    WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.

    EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.

    Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.

    I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.

  • digitalox (3/13/2014)


    this sql is really ugly but if you execute it on the table described above it will give the dupes combined into one row for each.

    using it you insert them to temp table. use the dupe ids table to join on main table and delete them all. then insert the scrubbed from the temp table. hope that makes sense. Proceed at your own risk, test, and back up that table! 🙂

    ;WITH bad_recs --(application_key_value, rating_date, new_rating, Ranking)

    AS

    (

    SELECT

    ID,

    Comments1,

    Comments2

    , Stamp,

    Ranking = DENSE_RANK() OVER(PARTITION BY ID ORDER BY NEWID() ASC)

    FROM #Test

    )

    SELECT

    t1.id,

    isnull(t1.Comments1,t2.Comments1) Comments1,

    isnull(t1.Comments2,t2.Comments2) Comments2,

    t1.stamp

    from #Test t1

    join #Test t2 on t1.id = t2.id and t1.Stamp<>t2.Stamp

    --where t1.id in

    join (

    select id,MAX(stamp) stamp FROM bad_recs

    WHERE Ranking > 1 group by id

    ) dupes on dupes.id = t1.id and t1.stamp = dupes.stamp

    Took a little time to try and understand this one, but cracked it, created new table for the join and deleted from the main table based on that...

    ..and bingo. Copy Data Fixed.

    Now I just need to wait until the system is not being used and fix the real data.

    Thank you both very much

  • l.danks (3/14/2014)


    hisakimatama (3/13/2014)


    I'll have a shot at the UPDATE part of this:

    This will generate a test version of your table, for a proof-of-concept test, and make the UPDATE by generating a CTE that pulls all of the non-null Comments1 entries, joins the CTE to the original table, and sets the NULL Comments1 values to the non-null value for each ID.

    I'd recommend adding in more data to this test case (preferably, taking a good chunk of your actual table via SELECT statements, and dropping that into #Test), to make sure this will do what you want; test extensively first!

    After this is done, assuming all goes well, you'll have easily identifiable duplicates, as they'll still have NULL values in Comments2; those are the ones to be deleted, so DELETE FROM

    WHERE Comments2 IS NULL should work. Again, be sure to test before committing to this.

    EDIT: Whoops, identified the wrong column in the DELETE portion. Fixed.

    Thanks hisakimatama, this has worked a treat. I now have my own copy of the real table with the problem rows updated.

    I don't think I can just use the null value in comments2 to delete however since there are some cases of a truly distinct row where there are no comments been entered by choice.

    Ah, gotcha. Glad we could help you work the problem out, though! 🙂

    - 😀

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply