Get away from incrementally loading data

Question

Get away from incrementally loading data

eagle_usa

SSC Enthusiast

Points: 118
More actions
March 18, 2015 at 1:16 pm

#398207

Hello All,
I'm writing to see if you all can help me develop a strategy that can get me away from incrementally loading data.
I currently support 200+ ETLs that read from 100 different data warehoused databases. I inherited both the warehouse and the ETLs. The biggest issue with the existing ETLs is that they are all built around incrementally loaded data. What I've learned is incremental loading is great as long as the source data isn't ever revised, which for many of the systems is. To combat this, audit packages have been developed that compare production source to warehoused data. When unmatched, I manually run an ETL that'll correct the issue.
Our organization has decided it's time to move away from this strategy. We have procured the necessary resources to build a new data warehouse system. I’ve got a distributor server, ODS, and EDW. My original plan was to replicate databases to the distributor and then use CDC to get the data moved into the ODS & EDW. Someone told me this would work but I've learned it will not. So now I'm looking at other options, such as Change Tracking and Transactional Replication.
Based on what I've read, Change Tracking is a good option because it works with all versions. Its’ biggest issue is performance, where if changes are not harvested, database performance can seriously degrade. *I would be okay with the performance risk if it only affected the distributor/replication server. Then Transactional Replication is basically just a different type of replication, where you have a “warm standby.” I think it would work for our distributor server data, but wouldn’t help us with the main issue of getting around incrementally loading data.
Ideally, the solution involves us replicating all the databases to the distributor and then using some method (non-incremental) to get it from there to our ODS.
Can someone please point me in the right direction? Your time, thoughts and feedback are all very much appreciated.
Thank you
Jim
:smooooth:

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply

Lempster SSCoach Points: 15651 More actions · Answer 1

If you are considering using CDC or Transactional Replication then you are not really getting away from Incremental Loading are you?

Do you effectively want to completely re-populate your ODS or Data Warehouse every day/night/applicable period? That might be ok for smaller data volumes, but for much larger volumes it could take longer than the window in which the operation has to take place or it might lead to OLAs/SLAs being breached if data is not available at a certain time.

How are you doing your incremental loading currently? I infer from what you've posted that your source tables don't have a LastModifed column (or similar) which makes capturing inserts and updates easy, is that correct? In which case you can use checksums to detect when data has changed.

More difficult is handling data that exists in the target, but has been deleted from the source, but you can achieve this using t-sql MERGE or a Conditional Split transformation if you are using SSIS.

There is a school of thought that says that data should never be deleted from a Data Warehouse, but an ODS is somewhat different.

What is your underlying reason for wanting to get away from incremental loading because in my experience people usually want to do the opposite, i.e. move from 'trash and rebuild' to incremental loading.

Regards

Lempster

Alvin Ramard SSC-Forever Points: 41190 More actions · Answer 2

Jim, when you said the processes were doing incremental loads, did you mean that they were only loading new records and may have ignored old modified records?

Alvin Ramard
Memphis PASS Chapter[/url]

All my SSC forum answers come with a money back guarantee. If you didn't like the answer then I'll gladly refund what you paid for it.

For best practices on asking questions, please read the following article: Forum Etiquette: How to post data/code on a forum to get the best help[/url]

eagle_usa SSC Enthusiast Points: 118 More actions · Answer 3

First of all, thank you for the responses. I'll try to summarize with some Q&A.

1. What is my goal?

Develop a near real-time ODS.

2. What type of incremental loading are we currently using?

When we have a transaction date to work with, we use a date range. We keep it broad, where the date will be between -15 days to current date. If the table doesn't have a date, we use an auto-increment field, where we use the target's max value to then query the source. If neither of these are option, we truncate ODS and reload from source.

3. Why do we want to get away from date-based incremental loads?

Couple reasons. With us only going back 15 days on average, we will miss any data changes that were made beyond then. And with this (data updates) being somewhat common, it can be a real pain for me to manually run ETLs for specific (100 days ago) ranges. Second, we are pulling unchanged data which puts a load on both our source and destination systems.

4. Why not use a last modified date time?

We would love to use this; it'd solve this whole problem. We cannot though because in doing so (adding the column to source) we'd violate our database terms of contract.

5. Isn't Transactional Replication and CDC incremental loading?

Yes, now that I think of it, technically they are. I should have been more clear on this. The idea behind CDC is just so much better than date ranges because it only disperses changed/added data.

6. Do you effectively want to completely re-populate your ODS or Data Warehouse every day/night/applicable period?

Our preference is to add to or modify the ODS. If our only option is truncate an ODS table, we will do it that way. This methodology isn't horrible - the EDW still picks up the changes.

----

Our original plan was to use a distribution/replication server that had CDC enabled, where we would run CDC as frequently as possible. We knew that SQL Standard didn't offer this feature but assumed since the data had been replicated to an Enterprise version, CDC would work. Once we actually did this and attempted the CDC, it failed.

The idea of using a distribution server, where you centralize databases, is appealing for many reasons. Though I'm now wondering what, if any options we have with this configuration. Seems like once it's replicated, you're stuck with few options.