Ways to address incremental updates in SSIS

  • Hi fellows,

    I'm trying to deploy a datamart, I have the star schema designed and I am working on the ETLs, but the issue is that there is one table that holds around 4.5 million rows and increasing, this table takes about 5 minutes to move from the OLTP system to staging... since the time isn't bad at all and we are using a test server for this, I would like to improve the times, but I haven't got to a good way to make this happen.

    I tried using the Lookup data flow transformation, but it start taking 8 minutes and needs to load not just the OLTP's 4.5 million rows, but the ones actually on staging... which increased the time... but gave me just the new rows.

    I am using an ADO NET Data source and Destination, I know that the Data source has a way that I can put a SQL Command but lack the knowledge to make that query receive a parameter.

    Any help is welcome!!!

    Thanks in advance.

    Frank.

  • What are you doing exactly in the Lookup component? Why does it have to load staging as well?

    A lookup component should be able to give you the inserts and the updates.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Thanks for your reply.

    What I am trying to do is to load into staging only the new data from a table, I would load it all, but I am not sure about the 5 minutes it takes to do it.

    How can I use the lookup component to get the inserts without loading staging?

    Regards,

    Franklin

    Frank.

  • Well, you can't find the inserts without loading the destination table, as the lookup needs something to compare the data to.

    If your source and your staging are SQL Server tables (2008 or up), you can try the MERGE statement. They would have the be on the same server to be practical though.

    How did you configure the lookup component? Did you use a query?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I don´t use the query, just set it to "Use a table or a view" and pick the one from staging that is a copy of it.

    Is there way I can use a SQL Query and send over a parameter to it? because this table works like a log, there will be no updates, just inserts, and I have a date column I can use, so if I can send the last date to the query as a parameter, I would be able to get only the new records 🙂

    Thanks.

    Frank.

  • f_ernestog (8/28/2012)


    I don´t use the query, just set it to "Use a table or a view" and pick the one from staging that is a copy of it.

    Is there way I can use a SQL Query and send over a parameter to it? because this table works like a log, there will be no updates, just inserts, and I have a date column I can use, so if I can send the last date to the query as a parameter, I would be able to get only the new records 🙂

    Thanks.

    First of all, use a query where you select only the column you need to perform the lookup. Otherwise you will be selecting every single bit of data of the lookup table into memory, which can be costly.

    Do you want to use the parameterized query in the OLE DB Source, or in the lookup component?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • In the OLE Component.

    Frank.

  • You can use parameters there - example[/url] - but this can be a bit buggy. The easiest way, in my opinion, is to use a string variable and construct your query dynamically. The downside is that the query can only be 4000 characters long.

    A good article on the subject: SSIS Design Pattern - Dynamic SQL

    Or you could use a inline table valued function as your OLE DB Source query.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • You need a way to identify the new records on the OLTP system so you can limit your 'pull' from that system.

    Typical solutions:

    1) A LateUpdateDatetime or a Timestamp which you reference from the warehouse for the last 'pull' you made, and use that as a lowerbound for the query that gets records from the OLTP system.

    2) CHECKSUMs, which are basically contatonations of the rows coming from the source system simplified (and very slightly error prone). These are checked against the target system (which hard-stores the checksums) and anyone who's different is 'pulled' from the source system. This keeps data-load lighter, since you're not pulling entire rows into memory just to be checked.

    3) Staging tables. Mass dump all the data into a staging table at the warehouse and MERGE it into the target table, taking care of all updates and inserts in a single pass. This is the worst case scenario, and requires massive amounts of data streaming. If you're on 2k5 it's worse, you need to double-pass the data with an update pass then an insert pass.

    None of these directly address deletions, either.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thank you Koen and Craig for your reply.

    I will keep in mind Craig's advice, but based on the fact I have a deadline, I will go with passing the variable to the query on the OLE source.

    I have already tried that out, the first time is a pain in the behind since it has to load the whole table, but then only the records which date is newer to the max date.

    Thanks both of you for your help.

    Regards.

    Frank.

  • Using date might be risky depending on how active your OLTP system is. What's going to happen to records that are entered into the system the same day after your process is run? If you go by the logic of newer than max date, you'll end up excluding records that were inserted on the same day after your SSIS process has run.

    Without a proper way to track INSERTS on the OLTP system, I'd say you need to pull all records greater than or equal to max date, then do a CHECKSUM comparison for records that have the same max date. This will ensure no records are missed.

  • I agree, what I am doing is that today I run it but for yesterday and tomorrow it will run for today, that way I would pick up the entire set of data for an specific day, also I am using checksums to determine if there are duplicates and I am doing a SSIS package to handle that (a data cleansing and data quality package).

    Thanks for your heads up, seeing that someone else has my own fears makes me think I should be on the right path 🙂

    Frank.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply