A SSIS Design Pattern for Change Data Capture

  • David Data

    SSCrazy

    Points: 2965

    I too wondered why you didn't use MERGE; I see you answered that by saying this was a method you had developed under SS 2005, which didn't support SQL MERGE.

    I had a fairly similar requirement within a SSIS 2008 ETL application. I loaded data into a staging table as you do, but then did the rest in a single MERGE statement. In my case, I didn't need to record change history, however that could be added using an OUTPUT clause. For earlier SQL Server versions, one could probably do the same thing with UPDATEs and INSERTs and triggers.

    Having built a fairly big SSIS application, I find that as I add functionality, I no longer add further SSIS task boxes. Instead, I add to the SQL procedures that process the data after loading, and also write .NET code to handle more complex input files, including loading TSV and Excel files more flexibly than the SSIS Flat File and Excel sources do. I wonder if this is a common experience; SSIS lets people new to ETL do clever things, but as they gain experience they find programming their own code works better.

    Having said that, where you HAVE to do it RBAR, SSIS often provides efficient ways to do so!

  • Koen Verbeeck

    SSC Guru

    Points: 258985

    KWymore (6/13/2012)


    I prefer Merge statements versus performing everything via SSIS tasks. I use lookups as well but find that they bog down with larger data sets. Most of my packages still have lookup tasks in them but only for updating smaller tables.

    I haven't had a chance to play with any of the CDC tasks yet. Does anyone have links to good tutorials or examples of using these tasks in SSIS 2012?

    Here you go:

    http://www.mattmasson.com/index.php/2011/12/cdc-in-ssis-for-sql-server-2012-2/

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Koen Verbeeck

    SSC Guru

    Points: 258985

    Praveen Kumar-379209 (6/13/2012)


    >>> A had a quick glance through the article. The title says it's about CDC, but I don't see anything about the change data capture functionality introduced in SQL Server 2008. Instead, you roll your own alternative.

    CDC is a term used to refer any method which captures changed data. It does not refer to any one particular method even though MS SQL server term one of their product feature as CDC.

    Usually the process of capturing changed data is referred as incremental load. When CDC is mentioned, well, usually the CDC feature is meant. So it's very confusing to put it in a title and then don't speak about the feature with the same name for an entire article.

    I put my remark there to point out that there are lots of alternatives which usually require less coding (MERGE statement) and that CDC is in fact supported in SSIS.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Ken Wymore

    SSCoach

    Points: 16642

    Koen Verbeeck (6/13/2012)


    Thanks Koen, this article looks pretty thorough!

  • Koen Verbeeck

    SSC Guru

    Points: 258985

    KWymore (6/13/2012)


    Koen Verbeeck (6/13/2012)


    Thanks Koen, this article looks pretty thorough!

    It is. It's written by Matt Masson, developer on the SSIS team.

    Check out his site, there are also follow-up articles on the CDC subject, for example on the different processing modes of the CDC source.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • rarpal

    SSC Rookie

    Points: 31

    David Data (6/13/2012)


    Having built a fairly big SSIS application, I find that as I add functionality, I no longer add further SSIS task boxes. Instead, I add to the SQL procedures that process the data after loading, and also write .NET code to handle more complex input files, including loading TSV and Excel files more flexibly than the SSIS Flat File and Excel sources do. I wonder if this is a common experience; SSIS lets people new to ETL do clever things, but as they gain experience they find programming their own code works better.

    I think it tends to be the experience among those who come from backgrounds like SQL and .Net coding who find SSIS the long route to achieving the same goal with standard coding. I think SSIS can give you that additional layer of scalability especially when you are dealing with data integration that involve multiple data environments as it runs an a separate service. Another thing that SSIS can claim to its favor is its ability to do things in parallel, like running multiple unrelated data flows inside a single sequence container which would be a bit more difficult to do purely in SQL.

    Although I have not done any serious performance tests to see if these parallel tasks actually do give better performance.

    Has anyone done or know of any performance tests on parallelism in SSIS?

  • Ken Wymore

    SSCoach

    Points: 16642

    David Data (6/13/2012)


    Having built a fairly big SSIS application, I find that as I add functionality, I no longer add further SSIS task boxes. Instead, I add to the SQL procedures that process the data after loading, and also write .NET code to handle more complex input files, including loading TSV and Excel files more flexibly than the SSIS Flat File and Excel sources do. I wonder if this is a common experience; SSIS lets people new to ETL do clever things, but as they gain experience they find programming their own code works better.

    Having said that, where you HAVE to do it RBAR, SSIS often provides efficient ways to do so!

    I generally start with SSIS to get the data into the database and then only use SSIS for tasks that are easier to set up in SSIS than in TSQL code. Things like setting up type 2 dimension loads where it is easier to use the slowly changing dimension task or the Kimball method SCD task (add-on created by Todd McDermid) than it is to create more complex merge statements or multiple insert/update/delete statements.

    Sometimes I like writing out code more than using SSIS because you have more control over how the data is handled and you can see everything that the code is doing. From a troubleshooting perspective, standard TSQL code is easier to go through as well. The errors thrown by SSIS aren't the most helpful. 🙂

    I also use SSIS to call applications such as 7-Zip so I can pull data into the database, work with it and push zip files back out as needed or loop over sets of files, unzipping archives, pulling the data in and then pushing the archives to a different folder. This eliminates the need for me to be a database developer and an application developer. When we were still on SQL Server 2000, we had a number of applications that required vb apps to acquire and/or parse data and then call DTS packages to push the data into the database. I eliminated most of these with SSIS packages where everything can be accomplished in a single package.

  • shiva1727

    SSC Enthusiast

    Points: 199

    Can you please send me the package that you have created for CDC

  • rarpal

    SSC Rookie

    Points: 31

    shiva1727 (6/19/2012)


    Can you please send me the package that you have created for CDC

    The sample package is attached in this zip. you will have to modify the connection managers for your environment.

  • tan110

    Hall of Fame

    Points: 3664

    Does anyone have a copy of the ssis package from the article?

  • David Data

    SSCrazy

    Points: 2965

    As this thread has woken up again, and as SSCrazy mentioned running 7-Zip in SSIS, here's one way to do it in .NET code, in this case to extract the content of a ZIP or 7z archive File into directory DataDir:

    Const UnZip As String = "C:\Program Files\7-Zip\7z.exe"

    Dim Command As String = String.Format("""{0}"" e -y -o""{1}"" ""{2}""", UnZip, DataDir, File)

    Dim Stat As Integer = ExecCommand(Command)

    If Stat = 0 Then '' 0 = processed file, 2 = no file to process, -1 = error

    (it worked)

    Else

    (report error)

    End

  • Federico Iori

    SSCommitted

    Points: 1625

    From MSDN : MOST , but not ALL, modification

    BINARY_CHECKSUM(*), computed on any row of a table, returns the same value as long the row is not subsequently modified. BINARY_CHECKSUM(*) will return a different value for most, but not all, changes to the row, and can be used to detect most row modifications.

Viewing 12 posts - 16 through 27 (of 27 total)

You must be logged in to reply to this topic. Login to reply