Replication, CDC, RBAR and ODS. Oh my!

  • I work in an enterprise environment. From what I've gathered over 20 years, what that really means is that the data which defines any single "entity" from the business point of view - eg, a customer - is actually managed by multiple different systems - eg, the ERP system and the CRM system, which means getting a whole entity involves queries across multiple instances, which (for a naive implementation) means linked server queries, or some other large data movement, which means pain at query time, which is bad.

    A similar story applies to enterprise reporting.

    The solution in a lot of cases is a data warehouse. Great for the analytical reporting. But what happens when you want to be able to have your enterprise-wide entities available to other systems (internal or external) in an operational ("near-real-time") manner? An operational data store seems like a good idea here. Pull all the data from all the various systems, potentially transform the base tables into some kind of common model which can be defined by the business (instead of being locked into schemas defined by software vendors), and expose that common model of the enterprise-wide data.

    Great!

    But wait... how do we populate the ODS in a near-real-time way? Ah, there's the rub.

    It is in relation to this concept that I am looking seriously into transactional replication for the first time. I should note that we already have CDC enabled on some databases, and being read by the data warehouse and a couple of other downstream consumers in a point-to-point manner. But using CDC has drawbacks - it's pull, not push. It's enterprise only. It means modification of the source database (adding the change tables). And it grows those source databases pretty significantly.

    As I see it, transactional replication is a bit "lighter" in terms of the impact on the source. Sure, it only pushes current state, but an ODS only has current state, so that seems fine. But transactional replication also implies row-by-agonising-row mechanics when changes are written to the target. This is a major concern for me. I actually wrote a blog post on this site describing my hatred of "message oriented" architectures because of the way they take very fast - and potentially very large - set based operations and turn them into order(s)-of-magnitude-slower packet based single entity messages. I hate that whole idea.

    With CDC, I can read a whole set of data from the change function and write that whole set down to the target. That eliminates the RBAR model of transactional replication, right? But wait, here's the question: When CDC scrapes the log and writes to the change tables, are these RBAR writes?

    In other words, is it actually the case that the RBAR mechanics of transactional replication are no worse than CDC, because CDC is RBAR into the change tables?

  • I'm surprised I haven't had a single nibble on this one.

    Today I had some free time to play around and see what I could discover. 

    In one of our dev envirnoments where CDC is enabled, I added an insert trigger to one of the change tables. Not generally a good idea, but useful for this particular test. The trigger is simple - insert into cdcActivity select getdate(), count(*) from inserted.

    I then ran an update which affected 549 rows in one of the cdc enabled tables.

    When I checked my cdcActivity table I didn't have 1098 rows (2 rows for each updated row, __$operatoin = 3 and __$operation = 4) each with a count of 1 (full RBAR), nor 549 rows each with a count of 2 (which would be the case if each source row's op 3 and op 4 row were inserted into the change table at the same time). But I also didn't have one row with an inserted count of ~1098. I had 50 rows, with counts all in multiples of 6 (12, 18, 24, and 36... no rows with 30 as the count).

    How is SQL server deciding to cut my transaction into batches when inserting into the change table? I don't know. Is there any way of changing the size of these batches? I don't know. The capture job can be configured a little bit via maxtrans and maxscans, but my update was a single transaction, so these don't seem relevant.

    Perhaps this is something that can be configured by playing with replication, given that CDC is using sp_replcmds. But if sp_replcmds is running batches of rows, then does that mean all of the advice about replication being RBAR is wrong? I doubt it, I expect that refers to the eventual inserts at the subscribers, and is not related to how the information is read from the log in the first place. But I can't find any in-depth information on all this.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply