Check for duplicates before inserting

  • I have a SSIS job that is failing due to inserting duplicate values into a primary key.

    The issue is NOT inserting values that were in the database before the job started. I am already doing a merge join and conditional spit to remove records already in the database. The issue is with values that are duplicated within the CSV file.

    I want the result to be records that share the same value in a column to be split into a different output and be written to a new CSV file, while the records with unique values go into the database.

    Is there a transformation that will do that? Otherwise, what is the easiest way?

  • One option, which I've used in the past, is to not use fast load. Then you can use the error output on your destination to process those rows however you want.

  • Interesting answer. Just to be clear, that means that the first record with the non-unique value goes into the database, and the second goes to the error output, correct?

  • Yes that is how it should work. The first row is inserted successfully because it won't violate the PK and the next row with the same PK values will fail and be output to the error output.

    A concern I'd have is, if the PK values are duplicated, but other columns are different, how do you know which row is the correct row? Is there some other logic that should be applied to determine which row should be inserted?

    I didn't put this option in my first response because I thought you might be looking for the fastest way to move forward. So this option, which is probably what I'd really want to do, is to pass the data through a gated process where you apply business rules in stages and only allow rows that pass the rules through. I do this currently with what 3 gates that I call:

    1. Raw - just getting the data into a table. No data typing or constraints

    2. Transform - apply basic rules like data types, required fields, etc...

    3. Conform - apply more advanced rules like uniqueness, lookup validations, lookup transformations, etc...

    At each stage/gate, any rows that fail a rule are recorded in a violations table with what rule was violated, and a row can violate multiple rules. This type of process allows you to have a rule that determines what really is a unique row and if you have rows with duplicate business keys and different values in other columns, it allows you to apply logic to determine which row is the correct row.

  • I'm with Jack on this one. A duplicate to me is where all column values are equal, if you mean duplicate key fields... then the issue is which is the correct one? In the latter case I would run logic to determine these and then move them all to another stream away from the main data flow until you can determine which row version is the correct one, This either manually or through a waterfall process.

    ----------------------------------------------------

  • Jack Corbett (11/24/2015)


    Yes that is how it should work. The first row is inserted successfully because it won't violate the PK and the next row with the same PK values will fail and be output to the error output.

    A concern I'd have is, if the PK values are duplicated, but other columns are different, how do you know which row is the correct row? Is there some other logic that should be applied to determine which row should be inserted?

    I didn't put this option in my first response because I thought you might be looking for the fastest way to move forward. So this option, which is probably what I'd really want to do, is to pass the data through a gated process where you apply business rules in stages and only allow rows that pass the rules through. I do this currently with what 3 gates that I call:

    1. Raw - just getting the data into a table. No data typing or constraints

    2. Transform - apply basic rules like data types, required fields, etc...

    3. Conform - apply more advanced rules like uniqueness, lookup validations, lookup transformations, etc...

    At each stage/gate, any rows that fail a rule are recorded in a violations table with what rule was violated, and a row can violate multiple rules. This type of process allows you to have a rule that determines what really is a unique row and if you have rows with duplicate business keys and different values in other columns, it allows you to apply logic to determine which row is the correct row.

    So for using the error output and avoiding the fast load... Assuming this happens for every import you will get records in the error output. However you wont know that these happened because they were duplicates necessarily and so you cant just dismiss them either that easily, is my thinking. Would this be the case?

    ----------------------------------------------------

  • MMartin1 (12/3/2015)


    So for using the error output and avoiding the fast load... Assuming this happens for every import you will get records in the error output. However you wont know that these happened because they were duplicates necessarily and so you cant just dismiss them either that easily, is my thinking. Would this be the case?

    I guess I'd say yes that is true. It wouldn't only be duplicate key errors that would be routed to the error output, but every error inserting into the destination. I wrote an article about how I did it here[/url].

    Part of getting the data from the error output would include retrieving the error message so you'd know what the error was.

  • Ok thank you Jack, I shall read up on that article

    ----------------------------------------------------

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply