SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Slowly changing dimensions using T-SQL MERGE


Slowly changing dimensions using T-SQL MERGE

Author
Message
Herman van Midden
Herman van Midden
Old Hand
Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)Old Hand (363 reputation)

Group: General Forum Members
Points: 363 Visits: 314
Great article, just what I needed. Thanks

I made a slight modification in the CSD2 variant, just for security, to avoid ererroneous data, when the procedure is executed on an earlier day, than the last update.

I admit, this does not happen when using GETDATE() to set today, but my procedure has a date input, so I decided to be sure no updates are made on records which have ValidFrom > @Today.
So I added a statement to the WHEN MATCHED condition:

WHEN MATCHED
AND IsCurrent = 1
AND ValidFrom <= @Today --Never allow updates to overwrite future records
AND (ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')
...



Thanks once more
Herman
dawalker-1068762
dawalker-1068762
SSC Rookie
SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)SSC Rookie (39 reputation)

Group: General Forum Members
Points: 39 Visits: 106
Thanks for that suggestion amenjonathan. What you mention is precisely what we are doing. We are bringing in a copy of the production source tables to a staging server and then using those copies to build our warehouse tables. For a number of reasons we can't use replication, log shipping, etc. mostly per our agreement with the software vendor. Many of these tables are very large so bringing over entire copies of them each day is not desired and would take too long. It is best to update/sync the existing tables with production. I should also mention that we use some of these copied tables to do live reporting and also audit/track changes to some of the more important info so they need to update throughout the day to stay in sync. Our production system stays up 24/7 we have to use means that won't cause any performance issues with it. Since most of the data does not change from day to day, that's where we thought the MERGE command would be very useful to update those table copies on the staging server. It seemed to be a perfect fit, but it just wouldn't perform well enough.

As for the use of checksum, fortunately we have a last modified datetime field on every record so we really don't need to use that. Performance wise, the inserts and updates run very quick because they are easily identified. The slow part of syncing is making sure you delete the records in the copy that have also been deleted on production. In order to do that I have to pull down a list of all the primary keys from the production table and compare those to the copy to make they still exist. I have sync procedures setup for each table with parameters to control the types of updates allowed. We have a job that runs every 10 minutes to keep them all in sync for inserts and updates. Once every 3 hours it will also perform any deletes since that takes about 4 times as long to run as a sync without them. So far I've been amazed at how well it all works without impacting production at all. We ran traces for quite a while just to make sure. What I found out in doing this is just how much different a query will perform when run locally versus when run over a linked server. There can be huge differences even if the queries are simple.
Adam Aspin
Adam Aspin
SSC-Addicted
SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)

Group: General Forum Members
Points: 445 Visits: 1032
Hi dwalker,

This is extremely interesting, and thanks for sharing.
I will try and find the time to look closer at this particular problem, as I had no idea that the speed difference was so great.

Adam
gxkd64
gxkd64
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 11
Adam,

Fantastic article. Your writing style is very concise and informative. I am forwarding this to all my teams, great job!

Scott
Larry Kruse
Larry Kruse
SSC Veteran
SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)

Group: General Forum Members
Points: 227 Visits: 756
Adam,

Thank you very much for this article. I have been searching around for a more intelligent way to handle massive, bulk-loaded datasets of user information. We have customers that basically send us entire refreshes of their data on daily and weekly frequencies and other that only send us updates. I need to speed up this process tremendously and stop performing delete then insert operations on entire groups of users. Instead, I need to post updates only when data really changes from load to load.

I was particularly thrilled with some of the speed enhancements suggested by others of using the CHECKSUM( ) over the range of data as opposed to string compares everywhere. I will be testing it out both ways and see if I can quantify the differences I see in performance. Some of our data loads are small 2,000 or less, but we have some problem-children in the 200,000+ range.

Very good article, and I will make a point to search for more of these from you...your coding style makes it very easy to read and understand the T-SQL code.

Larry
Larry Kruse
Larry Kruse
SSC Veteran
SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)SSC Veteran (227 reputation)

Group: General Forum Members
Points: 227 Visits: 756
In trying to use this article to tailor a solution for my organization, I have encountered a bit of a snag.

Our "user" information is kept in two tables which we call UserAuthentication and Users. The UserAuthentication table contains the IDENTITY column and it gets created first, its IDENTITY value is captured after the INSERT with a SCOPE_IDENTITY() and then this value is used when creating a new Users record using an INSERT statement.

Assuming that I have all of the records to be merged in a staging table which contains fields destined for both the UserAuthentication and Users tables, how might I most efficiently utilize the MERGE statements to create records in both tables? Should I structure a stored procedure to handle this as MERGE against the UserAuthentication table first, then follow it with a MERGE against the Users table?

Would another solution be to perform the MERGE against the UserAuthentication table, then capture the resulting INSERT records into a temp table that I can use as a basis for INSERTS on the Users table?
David Walker-278941
David Walker-278941
SSC-Enthusiastic
SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)SSC-Enthusiastic (109 reputation)

Group: General Forum Members
Points: 109 Visits: 231
In the Type 4 example, I see that you have a "ValidFrom" column in Client_SCD4_History. But it never gets populated, since Client_SCD1 doesn't have a ValidFrom date.

I see you are populating the ValidTo date in the type 4 history table, which is good.

Presumably, if you're using Type 4, all records in the main table (Client_SCD1) are "currently" valid. It would be nice to see the code that populates the ValidFrom date in the Type 4 history table. But since you're doing Type 4 based on what the Type 1 table needs, plus the history table, you haven't kept the ValidFrom date from the last time the record was updated. The Type 1 table doesn't have it.

Am I right that there's something missing here?

David Walker
Adam Aspin
Adam Aspin
SSC-Addicted
SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)SSC-Addicted (445 reputation)

Group: General Forum Members
Points: 445 Visits: 1032
You are quite right, David, this should have been added to ensure completeness. Thanks for noticing this.
Paul Hernández
Paul Hernández
Old Hand
Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)Old Hand (376 reputation)

Group: General Forum Members
Points: 376 Visits: 661
Hi everyone, and thank you Adam for this great article.

I'm wondering if it is possible to have a mix approach in one table, something like a Type 1 and 2 SCD in the same table. Suppose that in the customer dimension you have a couple of columns that change very often and some others that rarely do. This is not a typical "slow changing" scenario but if for any reason you have to face this situation what would you do?

I would like only to update the columns that frequently change (type 1) and record historical values for the other columns (type 2). I think this is possible when the SSIS Slow Change Dimension Transformation, but due to the performance issue is not possible for me to use it. In this transformation you can mark a single column as "Changing Attribute", "Historical Attribute" or "Fixed Attribute".

Any ideas or clues about how to accomplish that with the MERGE statement. At the moment I've created to tables, one to store the columns that often change and one to be treated as a classic SCD.

Any comment would be appreciated.

Paul Hernández
barkeylives
barkeylives
Forum Newbie
Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)Forum Newbie (1 reputation)

Group: General Forum Members
Points: 1 Visits: 9
Found this when looking for good examples. Great article! I noticed that one of the fields is left out of the WHEN MATCHED clause in the SCD2 version. "Country" is there, but not "County". If you modify any of the counties in the source table, it won't create a new record.
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search