Validating for potential duplicate records

  • I am writing a script that will import CSV data into a staging table.  Before pushing the data to a Production table I am doing a series for validation of the file.  In part, U want to make sure we are not processing duplicate records.  

    Therefore, I am checking for duplicates:
    1. From within the new file just imported (Is there a duplicate within the file)
    2. Is there already the same record in the staging database?

    If a duplicate record already exists in the database and previously processed successfully, I want to UPDATE the newest records to "D"
    If the new file contains multiple duplicate records within the file, I want to only process the row with the most current Insert Date (Processed Status to "N") while setting the older records to "D"

    Here is my sample data:

      -- DROP TABLE #t
        CREATE TABLE #t (TID int IDENTITY(1,1), EID int, Processed char(1), InsertDt datetime)
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (10, 'E', '10/15/2018')
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (10, 'Y', '10/16/2018')
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (15, 'Y', '10/17/2018 13:03:15')
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (15, 'N', '11/17/2018 14:01:15')  
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (15, 'N', '11/01/2018 12:01:15') 
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (15, 'N', '11/01/2018 12:02:15') 
        INSERT INTO  #t (EID, Processed, InsertDt) VALUES (15, 'N', '11/01/2018 12:03:15') 
       
        SELECT * FROM #t

    My desired results would be:

    TID EID Processed InsertDt
    1 10 E 2018-10-15 00:00:00.000
    2 10 Y 2018-10-16 00:00:00.000
    3 15 Y 2018-10-17 13:03:15.000
    4 15 D 2018-11-17 14:01:15.000
    5 15 D 2018-11-01 12:01:15.000
    6 15 D 2018-11-01 12:02:15.000
    7 15 Y 2018-11-01 12:03:15.000

    MANY THANKS in advance!

  • I think you are over complicating the issue. 
    I would write a procedure to truncate my staging table, insert into the staging table using select distinct from an OpenRowset from the CSV using the BULK driver with a format file. then insert the target table with a left join where target is null.

  • Got it!  Thank you Joe! 🙂

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply