Using Change Data Capture (CDC) for Extracting Changed Data on Multiple Tables in a Single Query

  • Question:

    What is the best way to query for changes when using CDC when your destination dimension table consists of multiple source OLTP tables?

    Explanation:

    I’m looking to get opinions and experiences from others that have used CDC in a DW environment. When you are pulling data from an OLTP system you typically have several tables that will be combined into a single denormalized table in the DW. The HOW TO PULL this data is usually straight forward by joining the tables and the WHAT TO PULL depends on the source system and if it has a created/modified date or something similar. When using CDC to capture changes to the OLTP tables the WHAT TO PULL becomes a bit easier especially if you don’t have a source that reliably uses the created/modified dates. However, the HOW TO PULL potentially becomes a bit more difficult.

    Pulling directly from an OLTP system this is usually straight forward by joining the tables in a single query. However, when you are using CDC to capture the changes

    Example:

    Using the AdentureWorks2008 OLTP sample database and specifically the Production.Product, Production.ProductSubCategory and Production.ProductCategory tables. The data will be extracted from these tables and loaded into a single denormalized table called dimProduct. The source OLTP tables have CDC enabled to capture changes and thereby creating three additional tables of cdc.Product_CT, cdc.ProductSubCategory_CT and cdc.ProductCategory_CT.

    Using a simple query as below we can see the joins on the OLTP tables

    selectprod.ProductID

    , prod.Name as ProdName

    , subcat.ProductSubcategoryID

    , subcat.Name as SubCatName

    , cat.ProductCategoryID

    , cat.Name as CatName

    fromProduction.Productprod

    joinProduction.ProductSubcategorysubcat

    onprod.ProductSubcategoryID = subcat.ProductSubcategoryID

    joinProduction.ProductCategorycat

    onsubcat.ProductCategoryID = cat.ProductCategoryID

    -- just reducing the result set

    whereprod.ProductID in (680, 706, 707, 708)

    Result set from the above query

    ProductIDProdNameProductSubcategoryIDSubCatNameProductCategoryIDCatName

    680HL Road Frame - Black, 5814Road Frames2Components

    706HL Road Frame - Red, 5814Road Frames2Components

    707Sport-100 Helmet, Red31Helmets4Accessories

    708Sport-100 Helmet, Black31Helmets4Accessories

    Example of Change:

    1.Yesterday a user changed the Category Name for Components to Component (singular). In the past month no other source changes have occurred to any of the data in these three tables.

    2.The CDC functionality identifies this change in the Production.ProductCategory table and logs it to the cdc.ProductCategory_CT table.

    3.Data in Production.Product or Production.ProductSubCategory did not change and therefore no data will be in the cdc.Product_CT or cdc.ProductSubCategory_CT tables.

    One possible way to retrieve data (simply net changes) in this situation might be as described in the pseudo code below.

    Select from OLTP tables (as in the previous query)

    where ProductID in (select ProductID from net change function for cdc.Product_CT)

    or ProductSubCategoryID in (select ProductSubCategoryID from net change function for cdc.ProductSubCategory_CT)

    or ProductCategoryID in (select ProductCategoryID from net change function for cdc.ProductCategory_CT)

    Have you encountered this before? If so, I’d appreciate any feedback on how you tackled this and any best practices you yielded from your experiences.

    Marc BeacomManaging Partner DatalereTAP the Power of Data(tm)

  • I've run into the same thing, although in a slightly different application. I built an ODS using the merge statement for each of the tables that were needed for the data warehousing solution. In the merge destination table I had a TransactionType column with "I", "U", or "D" and a TransactionDate with the date that the change occured. So in a lot of ways, it was the same type of system that the cdc tables use. I honestly can't remember why we used the merge statement ODS versus the CDC tables.

    That said, when querying from more than one table I had to check all three tables in my selection criteria for both dates against my start and stop parameters and transaction types. If even one of the three tables changed, I would bring back the entire record from all three tables and let my ETL handle the processing correctly.

    I would doubt that my way was a "best practice" as you requested, but it definitely worked and performance seemed okay. I'd be curious if your solution is better or worse performance wise than just joining your tables to the cdc tables and then checking your transaction types and dates against your parameters like I did. The only difference in using the merge statement versus the cdc tables is that the transaction type and date are already built into the source table...no joining to the cdc table needed for grabbing changes. I wonder if that helps performance? I didn't have a lot of data and definitely didn't have the time to investigate further. I'd be curious to hear from others.

  • I want to capture history record from the CDC table but as you said it is in-consistent, only one table get updated out of three table, its very difficult to get the history data since we are joining three CDC table to get value.

    What my idea is join the three cdc table using outer join and if the value is NULL in the PK of any cdc table take the previous value i.e current record from OLTP or DW dimension table.

    Please give your ideas and suggestions.

    Thanks,

    Syed

  • Thanks,

    Syed

  • You should design your process in such a way that it processes the effects of only one tables' changes at a time. So if your DWH table is the result of joining 3 tables, you should have 3 separate 'feeds', each processing only the changes from one of these source tables.

    For example, your Production.ProductSubcategory and Production.ProductCategory are likely to have very little changes. Chances are that they will not have changed within the last 3 days (the default retention period for cdc) at the time a new product is created. i.e. If you do an inner join between the 3 function's results, you'll likely end up with an empty result set: the cdc.fn_cdc_get_net_changes...() functions only return latest changes, not the entire source table's contents. If you split up each table's changes into 3 separate 'feeds', each of these feeds must reflect the effect on the DWH table of the changes made in the one source table.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply