DTS to import from excel ignoring duplicate records

  • Greetings, everyone.

    I find myself in the need to use a DTS package that will have to import data from a user-provided excel file which may or may not contain records that are already present in the database. The table this stuff will end up into has a primary key which is also present in the excel file, so at the very least a brutal column copy wouldn't insert duplicates. However, I need the import not to be aborted by the presence of said duplicates, I need them to be ignored and the non-duplicates to be inserted in the target table.

    Now, this will probably look very naive of me, so I apologize in advance. Creating an excel connection to the file, one to the db and connecting them with a data transformation alone doesn't seem to cut it. I haven't been able to figure out any way to let the transformation check for duplicates by itself, the source just wants to know what to get and the destination where to put it, and there's no place inbetween to tell it to check for the existance of possible duplicate records.

    Going by intuition, I assume I have to get the stuff from the excel file as it is in the database, possibly in a temporary table, and only then let the database pick the non-duplicate records from said temporary table and then drop it once it's done with it.

    So, assuming I'm going in the right direction with this - and feel free to correct me if I'm not - the question becomes... er, how do I do it? Specifically, how do I do the temp table bit in the DTS package? Should I use an intermediate, duplicate connection to the database to receive the temp table, and then use a data transformation from there to the original db connection, and have it grab the goods and drop the table? Or am I complicating matters?

    Thanks in advance for your time, regards

    Immano

  • I would create a staging table and use DTS to copy the data from excel to this staging table. Then I would compare the data in the staging table with the production table and only insert the new data into the production table.

    INSERT INTO prod (...)

    SELECT ....

    FROM Staging s

    LEFT OUTER JOIN prod d on s.id = d.id

    WHERE d.id IS NULL

    Assume id is the primary key of the table

  • Hello and thanks for replying. What do you mean by "staging table", exactly? Would that be a permanent table to be used only for the intermediate step between the excel and the destination? I guess if I delete every record in there everytime I'm done with this import it should be fine, I'm just trying to make sure I go by the best possible practice, given he premises.

  • A staging table is a permanent table in the database but acts like a temporary holder of the data.

    So every time you delete all the record of the staging table before you DTS the data from the excel into the staging table and use it to compare the data with your 'production' table.

  • Most illuminating, thanks.

    So, to sum it up, to make a DTS package:

    1) create tables staging_table and final_table (such inventive, I know ^^)

    2) create dts as 3 connections (1 excel + 2 db) and 2 data transformations (from excel to staging_table and from there to final_table).

    3) data transformation #1: wipe staging_table (truncate better than delete?) column-copy everything from excel to staging_table.

    4) data transformation #2: use a query to select as source only the rows with keys not already in the final_table, and final_table as destination.

    Is that about it?

    Edit to add: Hmm, on point (3) I don't think I can wipe the table before I fill it up, so where should I place that bit of action?

    Further edit: Ok, by tinkering a bit with the system I've come up with the following:

    1) SQL task to clear the staging table (truncate).

    2) Success constraint from this one to the excel connection.

    3) Data Transformation from Excel connection to db connection #1 copies columns into staging table

    4) Data Transformation from db connection #1 to db connection #2 selects columns not in final table with your "id is null" method and insert them in the final table.

    5) For paranoia reasons, a second copy of SQL task to clear the staging table (truncate).

    Seems to work, many thanks for the advice. 🙂

  • You should truncate the staging table first before the transfer data from excel to staging table.

  • Yes, that's what I do. I also truncate it again at the end just in case I need to use it for any other task that may leave it filled.

    Thanks again. 🙂

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply