Anatomy of an Incremental Load

  • Hi Hzhu,

    I describe how to make a Lookup do a Left Outer join around Figure 6 in the article. I repeat this process for Right Outer joins, I simply swap the inputs. There are lots of approaches to this. I would suggest a follow-up data flow, after the incremental load data flow. Add an OLE DB Source adapter and aim it at the destination table. Add a lookup aimed at the source table. Click Configure Error Output and set the Error response for the lookup operation to Ignore Failure. This makes the Lookup into a Left Outer Join between the Destination table and the Source, with the Destination on the left. This means any rows in the Destination that are not in the Source will return Nulls for the Source fields in the Select clause. If, after the first Data Flow of New and Updated rows, rows are present in the Destination that are missing from the Source, these rows have been deleted from the Source. You can detect them using a Conditional Split (see the explanation for New Rows detection surrounding Figures 7 and 8). You can use an OLEDB Command transformation to delete the rows from the Destination. This is row-based.

    I cannot view the article properly on my computer - it may very well just be my machine. To make sure we're talking about the same things, please click here to download the document. You can also stage the business key for the rows to be deleted in a staging table, then apply the same principles discussed for set-based Updates (in the section entitled Write: Changed Rows) to do a set-based Delete.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Thanks, Andy. What you mean is to build anothe data flow task, drag an OLEDB source in it to take the destination table(In your example New_Rows OLEDB Destination) as source data, and add a Lookup tranformation to do the left outer join between Destination and source, with this time destination on the left. And add a conditional split to detect the need deleted rows and finally goes to a stage table, which will later on to delete from destination table.

    One more question, in this case here what if the source is a table but the destination is a flat file(because I swap the source and destination for deleted rows). How to edit the reference table tab in Lookup transformation. I know I need to choose a flat file connection manager, can I still use SQL query to pull out all data from the flat file since it's a file not table. What should I do?

  • I would stage the destination in a table through the data flows and then write it to a flat file in another data flow - maybe. It would depend on the size and shape of the data.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

  • Andy,

    You might midunderstand my question. What I mentioned in above post is what if the lookup transformation looks up a Flat file as destination instead of a normal table(The source is a table), how can I do the correlate here?

    -hzhu

  • I've programmed lookups against text data many times in SSIS. It works great to avoid the overhead of database engine when not needed.

    You need to use the scripting component to do this. In the initialization event of the script, use dot net code to load the text data into a collection.

    Then add code in the row event of the component to perform a compare of some value in the current row to a value in the collection, and do what you need to do depending on result - usually update a value in the row with the "looked up" value from the collection.

    Be sure to be careful about performance if you are updating strings with dot net code. Use the string-builder class to avoid recreating string objects.

    p.s. As of this time, only VB.Net is supported in the SSIS scripting component as source code. Of course you can write functionality in any language that can create a .net assembly and then reference in the component if you want to.

    Ken

  • Ken,

    Did you do the programmed lookup in SSIS 2005 version? Let me briefly introduce my case here. Against Andy's article, I use a Flat file source to read a collection of data instead of using OLEDB source to read a table in his example, then correlate in lookup transformation to do the left join(flat source data on left, destination table on right) to find new rows. Then conditional split lists two criterias to split new rows and changed rows. Then two OLEDB destination, one connects to new rows(come to the destination table), another connects to output of the changed rows(to a stage table) which I do the set-based updates later. This is like Andy's article ends for new rows and changed rows.

    Now I need to do something for deleted row. Andy suggested me to do a another follow up data flow task. Swapping the inputs by taking the destination table(new rows table from above data flow) as input source, then lookup transformation to do the left join with this time destination table on the left, the source date(flat file) on the right, so I can find the rows which in destination table but no long in input source file that need to be deleted from destination table later. Then using conditional split as almost the same way as defining the new row. Finally these deleted rows go to a stage table to be deleted.

    What's my trouble here is how to do the lookup since I cannot do correlation with a left join by a table and a file. You said to program the lookup, but how? I'm not good at coding using .Net or VB. Is there any other easy way to do this in SSIS 2005. Or maybe you like to share some sample code to show me how to do this. I would very appreciate. Thanks.

    P.S. Andy, do you have example for deleted rows(right outer join) in SSIS 2005

  • I describe how to make a Lookup do a Left Outer join around Figure 6 in the article. I repeat this process for Right Outer joins, I simply swap the inputs. There are lots of approaches to this. I would suggest a follow-up data flow, after the incremental load data flow. Add an OLE DB Source adapter and aim it at the destination table. Add a lookup aimed at the source table.

    Andy, for my project to do the deleted rows, I swap the inputs. Now my source file is OLEDB destination table from previous data flow task(new rows table), my lookup transformation reference table is a flat file, so how can I do the correlation with a table and a file. Ken suggested using a Script Component insteading of lookup transformation to code this part. Do you have any sample code or is there any other easy way to do this since I'm not familar with the .Net coding. Thanks.

    Click Configure Error Output and set the Error response for the lookup operation to Ignore Failure. This makes the Lookup into a Left Outer Join between the Destination table and the Source, with the Destination on the left. This means any rows in the Destination that are not in the Source will return Nulls for the Source fields in the Select clause. If, after the first Data Flow of New and Updated rows, rows are present in the Destination that are missing from the Source, these rows have been deleted from the Source. You can detect them using a Conditional Split (see the explanation for New Rows detection surrounding Figures 7 and 8). You can use an OLEDB Command transformation to delete the rows from the Destination. This is row-based.

    I cannot view the article properly on my computer - it may very well just be my machine. To make sure we're talking about the same things, please click here to download the document. You can also stage the business key for the rows to be deleted in a staging table, then apply the same principles discussed for set-based Updates (in the section entitled Write: Changed Rows) to do a set-based Delete.

    :{> Andy

  • ==Did you do the programmed lookup in SSIS 2005 version?==

    Yes. There is no other "production version".

    ==Is there any other easy way to do this in SSIS 2005.==

    Not to my knowledge, no.

    ==Or maybe you like to share some sample code to show me how to do this.==

    I gave you good generalized instructions. Thats about the best I can do without a full blown article/tutorial, which I just don't have the time to do for this, sorry.

    Since you are not - as you say - good in VB, I am sure you would not understand the specific code even if I posted it.

  • A very good explanation of how to incremental loads. I have implemented it and had no problems in the beginning. But recently i am having some issues and wondering if anybody could help me out.

    Previuosly me incremental data was less ( around 3000 rows) and it worked well. But now I have about 12000 records with about 9000 for insert and about 3500 for update. The problem I have is that when i run the package it does not come out of the debug mode and hangs there. If i look into the data flow task, it shows the total no:of rows, no:of insert and update rows but the packages just freezes there.

    I looked up online and read somewhere that I was due to cache size in lookup transformation. I changed its properties from full cache to none. Now sometimes the data gets loaded and sometimes it doesnt. If I break my source file into smaller sets and then do the load, it works. Is it any solution for this and what is causing it?

  • Dear Mr. Leonard,

    I followed your example "Anatomy of an Incremental Load" and came across a few questions.

    I followed all of your recommedation and did everything you said. With two OLE DB Destinations and linked one end of the Conditional Split to New and the other to Changed and mapped them. However, when I run the SSIS package, the data only goes through the Changed path; where every time it loads the records. I tried debugging but I douldn't figure out what was wrong with the package.

    The explaination of the Changed Rows was a little vague. Can you please explain further.

    Thanks,

    dbdeli

  • Hi Andy,

    I implemented the incremental load the same way as you illustrated. It is working, but I was told to add an exception table for duplicate records.

    How would you add a feature to direct the duplicate records to the exception table.

    Thanks,

    dbdeli

  • Hi Andy,

    It's pretty fast. But it only loads 3 records. Wonder how the performance will be for source tables with 500+ million records. Currently I'm designing an incremental load for 20 tables from Oracle 10G to SS2K5. If you have any suggestions, ideas about how to help me design my project. PM me. Thanks

  • Expect to need some non-trivial redesign for performance enhancement when applying a solution that will work well with 3 records to millions.

  • Hi Andy,

    Following your approach for incremental load; in the conditional split for New Inserts, if I have a composite key column, which of the following is the correct way:

    "&&" or "||" between the composite key columns.

    Another word:

    ISNULL(contractid)&&ISNULL(start_date)&&ISNULL(EndDate)&&ISNULL(effective_date)

    or

    ISNULL(contractid) || ISNULL(start_date) || ISNULL(EndDate) || ISNULL(effective_date)

    Please advise.

    Thanks,

    bdeli

  • Hi dbdeli,

    Using your example, you should define the lookup to join on all four columns of your composite key columns in the correlation step. This is the equivalent of using all four columns in the On clause of a Left Outer join.

    Technically, you can test any column returned from the correlated join for Null - they should all be Null if there is no lookup match. It's typical to test a business key to detect new rows.

    My advice: Be consistent (at least within the same project). And document (use annotation) to describe why you did it the way you choose.

    :{> Andy

    Andy Leonard, Chief Data Engineer, Enterprise Data & Analytics

Viewing 15 posts - 46 through 60 (of 101 total)

You must be logged in to reply to this topic. Login to reply