RE: Slowly changing dimensions using T-SQL MERGE

Say Hey Kid

Points: 707

June 30, 2011 at 7:32 am

Thanks for that suggestion amenjonathan. What you mention is precisely what we are doing. We are bringing in a copy of the production source tables to a staging server and then using those copies to build our warehouse tables. For a number of reasons we can't use replication, log shipping, etc. mostly per our agreement with the software vendor. Many of these tables are very large so bringing over entire copies of them each day is not desired and would take too long. It is best to update/sync the existing tables with production. I should also mention that we use some of these copied tables to do live reporting and also audit/track changes to some of the more important info so they need to update throughout the day to stay in sync. Our production system stays up 24/7 we have to use means that won't cause any performance issues with it. Since most of the data does not change from day to day, that's where we thought the MERGE command would be very useful to update those table copies on the staging server. It seemed to be a perfect fit, but it just wouldn't perform well enough.

As for the use of checksum, fortunately we have a last modified datetime field on every record so we really don't need to use that. Performance wise, the inserts and updates run very quick because they are easily identified. The slow part of syncing is making sure you delete the records in the copy that have also been deleted on production. In order to do that I have to pull down a list of all the primary keys from the production table and compare those to the copy to make they still exist. I have sync procedures setup for each table with parameters to control the types of updates allowed. We have a job that runs every 10 minutes to keep them all in sync for inserts and updates. Once every 3 hours it will also perform any deletes since that takes about 4 times as long to run as a sync without them. So far I've been amazed at how well it all works without impacting production at all. We ran traces for quite a while just to make sure. What I found out in doing this is just how much different a query will perform when run locally versus when run over a linked server. There can be huge differences even if the queries are simple.