Check for New Or Updated Records using SSIS

  • I apologize if this has been asked before (I'm sure it has), but I'm stuck on an issue with SSIS and not sure the best way to word it.

    Currently, our interface pulls data from an Oracle data source and run the records through a stored procedure (whether the record was new or existing) and then updates a table in a SQL Server table. This is a pretty inefficient process and takes upward of 45 minutes per run.

    What I'd like to accomplish is to take the original data from Oracle and run it normally. But on subsequent runs of the SSIS package, I would like to only process records that are new to the Oracle table or have data that has been modified inserted into the new SQL table for the stored procedure to execute.

    Hopefully this makes sense and I was able to explain it.

    Any suggestions?

  • I'm pretty sure that you'd have to do that on the Oracle side of the house to be effective. Otherwise, you end up with a join across virtually the same thing as linked server which can be even slower than what you're doing now,

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • treyagrimes (4/28/2015)


    What I'd like to accomplish is to take the original data from Oracle and run it normally. But on subsequent runs of the SSIS package, I would like to only process records that are new to the Oracle table or have data that has been modified inserted into the new SQL table for the stored procedure to execute.

    Your options:

    * set up CDC at the Oracle side

    * have some audit columns at the Oracle source table. Something like ModifiedDate. When you keep track of this date, you can find the new and changed rows pretty easily.

    * read everything in into the data flow and use a lookup component to find out if a row is new or not.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks to both replies.

    I'm not sure what a CDC is?

    The Oracle table does have a mod_date so that might be the column to key off of. Could you explain how that would work?

    Thanks,

  • treyagrimes (4/29/2015)


    Thanks to both replies.

    I'm not sure what a CDC is?

    The Oracle table does have a mod_date so that might be the column to key off of. Could you explain how that would work?

    Thanks,

    CDC = change data capture.

    If you have a mod_date and this column is consistenly updated, you can find out which rows have changed or are new.

    For example, on your first run you take all rows. The maximum mod_date is 20150429T144800. This means that next time your ETL runs, you only need to take rows where mod_date > 20150429T144800. If someone else makes a manual change in the table without updating mod_date, the system falls apart.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Besides the options Koen mentions above, you can use ORA_ROWSCN, an Oracle system column.

    Whenever data changes in a block in Oracle, the ORA_ROWSCN gets updated for that table. Use it the same way as Last Update Date. After each load, you save the highest value of ORA_ROWSCN for each table, and only select rows with a value above that the next time you load data from that table.

    A block is similar in concept to a page in SQL Server. You will likely have more than one row in each block, and not all of those rows have changed. However you have limited the number of rows you have to compare with your existing rows. If you want that part to be faster, you could use ORA_HASH on all columns and save that with the row and only update if the hash value has changed.

  • Thanks again for the replies. I'm back from PTO and trying to tackle this problem once again.

    In regards ROWSCANS or HASH, is that created on a view as well? I am actually reading for a view within Oracle that is populated by a package that runs on a schedule.

    I'm thinking the MOD_DATE in the view might be my best best as a user should not be changing this field (or any field) in the view, only the package that runs.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply