Need help with deleting/updating duplicates in two tables

  • Hello All,

    I am looking for some urgent help. Thanks in advance!

    Below is what I have:

    Tables Info:

    1. Table ABC has below fields
      a. ABCUniqueId (only 1 record/entry for each unique id)
      b. SysGenHash (Unique and should not be duplicate in the table)
      c. CreateDt

    2. Table XYZ has below fields
      a. XYZUniqueId
      b. ParentABCUniqueId (Corresponds to ABCUniqueId)
      c. SysGenCode (Max of 18 distinct values)

    3. ABCUniqueId in Table ABC ties to ParentABCUniqueId in Table XYZ.
    4. For every ABCUniqueId in Table ABC, there could be multiple records in Table XYZ (different SysGenCode for each ABCUniqueId/ParentABCUniqueId).
    5. See sample data attached.

    Issue:

    1. There are duplicate SysGenHash in Table ABC.

    Requirement:

    1. Identify the duplicate SysGenHash values in Table ABC and delete them. Retain only 1 record for the SysGenHash that has the latestr CreateDt.
    In the attached example - SysGenHash 'DUPISSUE1 (ABCUniqueId 2 and 6)', 'DUPISSUE2 (ABCUniqueId 3)' and 'DUPISSUE3 (ABCUniqueId 7)' will be deleted.

    2. For each of the ABCUniqueId in Table ABC that are deleted in step 1 above, update Table XYZ as below:

    a. If SysGenCode for the 'duplicate' ABCUniqueId/ParentABCUniqueId is already existing for the 'retained' ABCUniqueId/ParentABCUniqueId, then delete the record.
    In the attached example - ParentABCUniqueId 2 and SysGenCode 1 is deleted since it already exists under the retained ParentABCUniqueId '8'. Similarly ParentABCUniqueId 6 and SysGenCode 1 is deleted since it already exists under the retained ParentABCUniqueId '8'.

    b. Otherwise, update the 'duplicate' ParentABCUniqueId with 'retained' ParentABCUniqueId.
    In the attached example - ParentABCUniqueId 2 and SysGenCode 21 is updated with retained ParentABCUniqueId '8' since SysGenCode '21' does not exist for retained ParentABCUniqueId '8'. Similarly ParentABCUniqueId 3 - SysGenCode 1, ParentABCUniqueId 6 - SysGenCode 32 and ParentABCUniqueId 7 - SysGenCode 67 are updated with ParentABCUniqueId '10', '8' and '9' respectively.

    What I have tried:

    1. I have tried to join the tables by ABCUniqueId/ParentABCUniqueId and use Cursor for each record.
    2. However, it is running for ever and also giving Transport/Connection error after a whil.

    Is there a way to do this in a simple way?

    Thanks a lot for your inputs!!

  • Cursors are inherently slow, so I'm not surprised it's taking awhile.

    One way to identify duplicates is by using the GROUP BY(Transact-SQL) and HAVING (Transact-SQL). You could use this to create a resultset that contains your duplicates, and then use your logic above.

    Have a go at using these. If you get stuck, reply and explain where you're having difficulty and show your workings so far 🙂

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • using row number is how i would find the dupes.
    the query i'm using here can be switced to a delete by simply swapping the commenting the --DELETE and commenting the SELECT *

    WITH cteDupes
    AS
    (
    SELECT row_number() OVER (Partition By SysGenHash ORDER BY ABCUniqueId) AS RW, *
    FROM [TableABC]
    )

    --DELETE 
    SELECT * 
    FROM cteDupes
    WHERE RW > 1

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Lowell - Wednesday, January 18, 2017 11:56 AM

    using row number is how i would find the dupes.
    the query i'm using here can be switced to a delete by simply swapping the commenting the --DELETE and commenting the SELECT *

    WITH cteDupes
    AS
    (
    SELECT row_number() OVER (Partition By SysGenHash ORDER BY ABCUniqueId) AS RW, *
    FROM [TableABC]
    )

    --DELETE 
    SELECT * 
    FROM cteDupes
    WHERE RW > 1

    Thanks!

    I should have mentioned that I have already followed Partition, to do the deletion of duplicates in Table ABC. I was using cursor for Requirement 2 but it is taking too much time. 

    Is there any quick/smart way to do Requirement 2?

    Thanks again!

  • Thom A - Wednesday, January 18, 2017 11:54 AM

    Cursors are inherently slow, so I'm not surprised it's taking awhile.

    One way to identify duplicates is by using the GROUP BY(Transact-SQL) and HAVING (Transact-SQL). You could use this to create a resultset that contains your duplicates, and then use your logic above.

    Have a go at using these. If you get stuck, reply and explain where you're having difficulty and show your workings so far 🙂

    Thank you!

    I have used Partition to identify the unwanted duplicate ABCUniqueId. For each of the unwanted duplicates, I have to go update the 2nd table (Table XYZ). At that time, my issue is I do not know what is the 'ABCUniqueId' that I retained for the particular duplicate 'ABCUniqueId' i am processing. So I cannot do checks in Requirement 2a and b. 

    I have done the below to get all the records from both the tables that have SysGenHash duplicate:

    SELECT a.ABCUniqueId, a.SysGenHash, x.SysGenCode from TableABC a
    INNER JOIN TableXYZ x on t.ABCUniqueId = x.ParentABCUniqueId WHERE a.SysGenHash in
    (SELECT SysGenHash FROM TableABC
           GROUP BY SysGenHash
           HAVING (COUNT(SysGenHash) > 1)

    I have used cursor to process each record by saving the ABCUniqueId that I am retaining and then using that for the duplicate records to carry out Requirement 2a and b. 

    However, since Cursor is very slow, any other suggestions that you can provide will be of great help.

    Thanks again!

  • There are a number ways you could, do this. You could get the list of your duplicates first, and insert them into a dataset, for example a temporary table. You can then use that to define the records that are deleted in your first table, and then use the same data to do your logic on your second.

    You could, also, use Lowell's solution, and output that data (using OUTPUT and  INTO) into a dataset, then use that dataset to do your second logic.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

  • Thom A - Thursday, January 19, 2017 1:22 AM

    There are a number ways you could, do this. You could get the list of your duplicates first, and insert them into a dataset, for example a temporary table. You can then use that to define the records that are deleted in your first table, and then use the same data to do your logic on your second.

    You could, also, use Lowell's solution, and output that data (using OUTPUT and  INTO) into a dataset, then use that dataset to do your second logic.

    Thank you!

    I have the duplicates. But while i am processing each of them to execute logic for 2nd table (Table XYZ), how do i know what is the ABCUniqueId  of the retained record for the duplicate that I am processing.

    So here is the psuedo code:

    1. I find the duplicates in TableABC and create a temporary table Table123. 
    2. I process the records in Table123 to update Table XYZ for those ABCUniqueId.
    3. For Requirement 2a - I need to know the ABCUniqueId of record in TableABC that I retained for each of the duplicates in temporary table Table123. Only then I can determine if SysGenCode is already present. 
    4. For Requirement 2b - I need t oknow the ABCUniqueId of the record in TableABC that I retained for each of the duplicates in temporary table Table123. Only then i can determine, what I need to update in ParentABCUniqueId in Table XYZ.

    Please provide any inputs on how i can know for each record in Temporary table Table123, what is the ABCUniqueId that i retained in Table ABC.

    Thank in advance!

  • What have you got so far (supply your SQL)? Also, now might be an idea to provide your DDL and DLM statements. Have a look at the link in my signature, that shows how you should supply these. This means that users on SSC have real data to work with, and give you tested solutions.

    Thom~

    Excuse my typos and sometimes awful grammar. My fingers work faster than my brain does.
    Larnu.uk

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply