Slowly changing dimensions using T-SQL MERGE

Question

Post reply

Slowly changing dimensions using T-SQL MERGE

Adam Aspin

SSCarpal Tunnel

Points: 4890
More actions
June 20, 2011 at 12:00 am

#239450

Comments posted to this topic are about the item Slowly changing dimensions using T-SQL MERGE

Viewing 15 posts - 1 through 15 (of 33 total)

You must be logged in to reply to this topic. Login to reply

Franky Leeuwerck SSCrazy Points: 2036 More actions · Answer 1

Great article Adam !

Thanx for sharing.

Franky

Franky L.

dawalker-1068762 Say Hey Kid Points: 707 More actions · Answer 2

We have been working on a new data warehouse and I thought the MERGE command was going to be a real game changer for us. I did some initial testing using small tables and it worked great. I was ready to convert all of my code over to this method when we hit a snag. Since we are merging from one server over to another we are accessing the sources tables over linked servers. As the tables got larger the merge command was no longer a viable option for us. The performance was so bad it was just unusable. In many cases I was never even able to get the command to complete and had to kill the process. I was able to write my own SQL procedures that accomplished the same thing in a very reasonable amount of time albeit a little more lengthy and less elegant. Since we are merging about 170 tables hourly throughout the day speed is very important. I never did a comparison to see if it worked with the large tables on the same server so I can't speak to that, but my experience with the linked servers was not good. I was wondering if anyone else had tried this and had similar frustrations or even better yet had gotten it to work.

David Greenberg-358875 SSC Enthusiast Points: 158 More actions · Answer 3

Very Good Article. I have a question. Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key.

For example, I have a customer record that contains an address field. If the address changes, I want to handle it as a type 2 change. However, the record also contains a phone number which I just want updated in place when it changes but does not require the creation of a new record with a different surrogate key.

Can you handle this scenario within the same merge statement?

al kv SSC Enthusiast Points: 123 More actions · Answer 4

al kv

SSC Enthusiast

Points: 123

June 20, 2011 at 7:37 am

#1341191

where can I get the CarSales db?

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 5

Hi Al Kv,

Sorr if I was not very clear - any database can be used, as long as it contains the tables given in the script. I created an empty database called CarSales for this purpose.

Regards,

Adam

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 6

Hi dwalker,

I have used MERGE succesfully on multi-million record tables - but, indeed, only when all the databases were on the same server. I use delta data detection techniques when using linked servers or, as you say, it is infernal.

Thanks for the feedback,

Adam

DEK46656 SSCrazy Points: 2111 More actions · Answer 7

David Greenberg-358875 (6/20/2011)
Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key.

I use a snowflake approach, as least in the data warehouse area: once you get to SSAS you can break it back out via a view if you wish. Anything for a dimension that is type 1 goes into its own table and is joined to the type 2 table. If the type 1 attributes change, you MERGE the new values into the T1 table, then update the key field in the type 2 table as needed.

Beer's Law: Absolutum obsoletum
"if it works it's out-of-date"

al kv SSC Enthusiast Points: 123 More actions · Answer 8

al kv

SSC Enthusiast

Points: 123

June 20, 2011 at 8:08 am

#1341210

Adam,

Thank you.

DEK46656 SSCrazy Points: 2111 More actions · Answer 9

I would suggest an alternative piece of code in one area you are using. When checking your type 2 attributes for change, you are using a lot of “not equal” checks, and transforming each attribute with a ISNULL.

This type of work falls into an area called “data fingerprinting”: checking to see if the finger print is different or not. Try wrapping the attributes you want checked in a CHECKSUM function, either individually listing each column or the entire record:

CHECKSUM(DST.ClientName, DST.Country, DST.Town,

DST.Address1, DST.Address2, DST.ClientType, DST.ClientSize) <>

CHECKSUM(SRC.ClientName, SRC.Country, SRC.Town,

SRC.Address1, SRC.Address2, SRC.ClientType, SRC.ClientSize)

Or

CHECKSUM(DST.*) <> CHECKSUM(SRC.*)

I typically use the individual fields approach myself.

Beer's Law: Absolutum obsoletum
"if it works it's out-of-date"

Adam Aspin SSCarpal Tunnel Points: 4890 More actions · Answer 10

Hi DEK46656,

You are absolutely right that there are many other ways of comparing data - I prefer SHA1 hashes, if possible in computed columns, myself. Here, of course, the accent was on the more basic MERGE concept, but you are spot on to point out that this is only a starting point, and that there is such a lot that you can do with the basic concept to make it really efficient and useful.

Regards,

Adam

yosiasz SSCommitted Points: 1699 More actions · Answer 11

I concur what dawalker-1068762 has said. MERGE is a wonderful command but use carefully. If the source (SRC ) table is small and light it is feasable and nice to use. But if the table has millions of rows then you will hit a performance snag.

To get around this we are using MERGE along with a batch process. User very carefully! and design things in batches.

Wish MERGE had a parameter to do batch processing!

amenjonathan SSCrazy Points: 2482 More actions · Answer 12

In my experience, you should always stage your data on the local server before doing any manipulation and / or merging. So for dawalker, extract new information in native form from the source server to your warehouse server before you transform and load.

I also use checksum to detect changes. Much faster!

-------------------------------------------------------------------------------------------------
My SQL Server Blog

gary.strange-sqlconsumer SSCommitted Points: 1816 More actions · Answer 13

Great article Adam.

Regarding cross server merging. Isn't it best to use a ETL tool for ETL tasks, making use of in-memory caching, instead of a linked server?

Regarding checksum comparisons how do you handle collisions. I.E. the "Birthday Paradox"? I've used 'md5' or 'sha*' in the past but never checksum because of the increased collision probability. Do you have code that resolves collisions?

Chris Harshman SSC-Forever Points: 42192 More actions · Answer 14

David Greenberg-358875 (6/20/2011)
Very Good Article. I have a question. Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key...Can you handle this scenario within the same merge statement?

The short answer is no. The problem is, in the MERGE construct, you can only have one WHEN MATCHED section. This method uses the WHEN MATCHED section to mark the "current" record in your slowly changing dimension table as no longer current, then in the OUTPUT clause the new values for this record are sent to the INSERT statement "outside" of the MERGE. We use this method for our data warehouse, but our ODS is on the same server.

Snowflaking your dimensional model as was suggested may be tempting, but in essence what you're doing then is optimizing your model for the ETL instead of the reporting, and it is likely that the extra join will hurt performance when it is queried.