A SSIS Design Pattern for Change Data Capture

  • Comments posted to this topic are about the item A SSIS Design Pattern for Change Data Capture

  • A had a quick glance through the article. The title says it's about CDC, but I don't see anything about the change data capture functionality introduced in SQL Server 2008. Instead, you roll your own alternative.

    In the first dataflow, you do a full outer join between the staging table and the destination table. Can't that introduce locking? Suppose you have a very large dimension and the source is still reading while you are already inserting new rows. Since the OLE DB Destination with Fast Load takes by default a table lock, this might introduce issues.

    Your article starts with the following sentence:

    Within Integration Services there is no single out of the box components to handle CDC

    That is simply not true. There are build-in components available for CDC in SSIS 2012 (CDC Control Task, CDC Source and CDC Splitter). This article is published in june 2012, and SQL Server 2012 has been publicly available for some time now, so you can't just ignore that edition.

    In my opinion, the solution this article proposes is very good for SQL Server 2005, but not needed anymore starting from SQL Server 2008 (unless you don't have Enterprise Edition of course).

    edit: if Enterprise Edition is not availabe, I would rather consider the MERGE statement for SQL Server 2008 and up.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Another question:

    you claim the following:

    Efficient in handling large volumes of data

    About how much volume are we speaking here?

    It seems to me that for very large volumes the FULL OUTER JOIN can get problematic.

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I had to design a package very similar to the one demonstrated but I had the problem that the COU had to be done in sequence so unfortunately Rbar was implemented in a script task.

  • Really, staging table and a merge statement is all you need (2008+). For 2005 I'd just use a proc to replicate the functionality of the merge statement. SSIS is massive overkill for this kind of function (and many other examples I see on a daily basis).

  • chris.j.clague (6/13/2012)


    Really, staging table and a merge statement is all you need (2008+). For 2005 I'd just use a proc to replicate the functionality of the merge statement. SSIS is massive overkill for this kind of function (and many other examples I see on a daily basis).

    What is the overkill exactly between the sp and the SSIS package?

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • I don't understand how it could work with huge amount of data because the solution uses CHECKSUM function which is not reliable.

    Example:

    select binary_checksum('51;52;56;2204;') produces 1726190947

    select binary_checksum('51;53;56;2205;') produces 1726190947

    select binary_checksum('51;52;56;2205;') produces 1726190963

    select binary_checksum('51;53;56;2204;') produces 1726190963

    As you can see different input strings produces the same output.

    Those cases are not so rare if if have to work with millions of records.

    In general, we are using similar approach to load data to our DW but we had to write custom script component for comparing.

  • i agree with the comment above. merge statement is fabulous. you got your data instaging with a bulk load insert. then you can handle updates, deletes and inserts the way you want between staging and destination tables.

    before use audit columns and tables to handle what to extract (i use cozyroc to handle dynamic meta data) and in 2k8, cdc was introduced.

    Clement

  • Using the BINARY_CHECKSUM introduces a limitation to your solution. The BINARY_CHECKSUM function will only detect changes to columns of comparable types. This means that if you have columns declared as type text, ntext, image, cursor, xml, etc. then changes to those won't be detected.

  • The solution described in this article was based on an application initially implemented on 2005 where there were no built in CDC components. The underlying database server was subsequently upgraded to 2008. Since this solution was build using only the standard components, it continued to work with the 2005 backward compatibility pack without any downtime.

    You are absolutely right about the CDC features in 2008, however this solution ensured that the application continued to worked pre and post 2008 upgrade without interruption.

  • Koen Verbeeck (6/13/2012)


    chris.j.clague (6/13/2012)


    Really, staging table and a merge statement is all you need (2008+). For 2005 I'd just use a proc to replicate the functionality of the merge statement. SSIS is massive overkill for this kind of function (and many other examples I see on a daily basis).

    What is the overkill exactly between the sp and the SSIS package?

    I was wondering this too... sometimes I've found that when dealing with large volumes of data, that even MERGE is surpassed by a combination of lookups to split new records from updates via SSIS.

    Not always.... sometimes MERGE wins, but it seems about 60/40 in favor of managing upserts to my DW with SSIS these days.



    --Mark Tassin
    MCITP - SQL Server DBA
    Proud member of the Anti-RBAR alliance.
    For help with Performance click this link[/url]
    For tips on how to post your problems[/url]

  • I prefer Merge statements versus performing everything via SSIS tasks. I use lookups as well but find that they bog down with larger data sets. Most of my packages still have lookup tasks in them but only for updating smaller tables.

    I haven't had a chance to play with any of the CDC tasks yet. Does anyone have links to good tutorials or examples of using these tasks in SSIS 2012?

  • >>> A had a quick glance through the article. The title says it's about CDC, but I don't see anything about the change data capture functionality introduced in SQL Server 2008. Instead, you roll your own alternative.

    CDC is a term used to refer any method which captures changed data. It does not refer to any one particular method even though MS SQL server term one of their product feature as CDC.

    There are 4 to 5 ways of doing CDC in MS Sql server . So we could have this one as sixth way of doing it.

    My first post in SC. 😉

  • Praveen Kumar-379209 (6/13/2012)


    My first post in SC. 😉

    Welcome to SSC!

  • Spot on. This a very common misuse of checksums. In many cases, like this one, there is really no performance advantage because it doesn't reduce I/O. Unless you get paid to write software that works most of the time, you need to match the columns.

Viewing 15 posts - 1 through 15 (of 26 total)

You must be logged in to reply to this topic. Login to reply