Click here to monitor SSC
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Slowly changing dimensions using T-SQL MERGE


Slowly changing dimensions using T-SQL MERGE

Author
Message
Adam Aspin
Adam Aspin
SSC-Enthusiastic
SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)SSC-Enthusiastic (183 reputation)

Group: General Forum Members
Points: 183 Visits: 1028
Hi DEK46656,

You are absolutely right that there are many other ways of comparing data - I prefer SHA1 hashes, if possible in computed columns, myself. Here, of course, the accent was on the more basic MERGE concept, but you are spot on to point out that this is only a starting point, and that there is such a lot that you can do with the basic concept to make it really efficient and useful.

Regards,

Adam
yosiasz
yosiasz
Valued Member
Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)Valued Member (65 reputation)

Group: General Forum Members
Points: 65 Visits: 821
I concur what dawalker-1068762 has said. MERGE is a wonderful command but use carefully. If the source (SRC ) table is small and light it is feasable and nice to use. But if the table has millions of rows then you will hit a performance snag.

To get around this we are using MERGE along with a batch process. User very carefully! and design things in batches.

Wish MERGE had a parameter to do batch processing!
amenjonathan
amenjonathan
SSC-Enthusiastic
SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)

Group: General Forum Members
Points: 136 Visits: 434
In my experience, you should always stage your data on the local server before doing any manipulation and / or merging. So for dawalker, extract new information in native form from the source server to your warehouse server before you transform and load.

I also use checksum to detect changes. Much faster!

-------------------------------------------------------------------------------------------------
My SQL Server Blog
gary.strange-sqlconsumer
gary.strange-sqlconsumer
SSC Journeyman
SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)SSC Journeyman (75 reputation)

Group: General Forum Members
Points: 75 Visits: 610
Great article Adam.

Regarding cross server merging. Isn't it best to use a ETL tool for ETL tasks, making use of in-memory caching, instead of a linked server?

Regarding checksum comparisons how do you handle collisions. I.E. the "Birthday Paradox"? I've used 'md5' or 'sha*' in the past but never checksum because of the increased collision probability. Do you have code that resolves collisions?
Chris Harshman
Chris Harshman
SSCrazy
SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)SSCrazy (2.7K reputation)

Group: General Forum Members
Points: 2664 Visits: 3247
David Greenberg-358875 (6/20/2011)
Very Good Article. I have a question. Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key...Can you handle this scenario within the same merge statement?


The short answer is no. The problem is, in the MERGE construct, you can only have one WHEN MATCHED section. This method uses the WHEN MATCHED section to mark the "current" record in your slowly changing dimension table as no longer current, then in the OUTPUT clause the new values for this record are sent to the INSERT statement "outside" of the MERGE. We use this method for our data warehouse, but our ODS is on the same server.

Snowflaking your dimensional model as was suggested may be tempting, but in essence what you're doing then is optimizing your model for the ETL instead of the reporting, and it is likely that the extra join will hurt performance when it is queried.
amenjonathan
amenjonathan
SSC-Enthusiastic
SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)

Group: General Forum Members
Points: 136 Visits: 434
Actually you can do both. Put all the fields you wish to carry history on in the ON clause for matching. This will tell the MERGE you want a new record any time there is something that doesn't match, which is when you have new information in a field you wish to carry history on. For update only fields put an update clause on the fields you wish to update in the when matched clause. For static fields, just leave them in the insert. Don't make them part of the update.

Easier if I just show you the template I use on my MERGE statements:


--JA: MERGE source with destination
MERGE dbo.DimCustomer Destination
USING --JA: source
   (
   SELECT
      DC.[HistoryCol1]
      ,DC.[UpdateOnlyCol1]
      ,DC.[HistoryCol2]
      ,DC.[HistoryCol3]
      ,DC.[UpdateOnlyCol2]
      ,DC.[StaticCol1]
   FROM dbo.DimCustomerStage DC
   Wink
AS [Source]
   (
   [HistoryCol1]
   ,[UpdateOnlyCol1]
   ,[HistoryCol2]
   ,[HistoryCol3]
   ,[UpdateOnlyCol2]
   ,[StaticCol1]
   Wink
ON (
   [Source].[HistoryCol1]      = Destination.[HistoryCol1]
   AND [Source].[HistoryCol2]   = Destination.[HistoryCol2]
   AND [Source].[HistoryCol3]   = Destination.[HistoryCol3]
   Wink
WHEN MATCHED --JA: here i specify that in order to qualify for update, the binary sum of the two rows must not be equal (a change has occured in source) - checksum is faster than checking every column individually
   AND BINARY_CHECKSUM
         (
         Destination.[UpdateOnlyCol1]
         ,Destination.[UpdateOnlyCol2]
         Wink
      != BINARY_CHECKSUM
         (
         [Source].[UpdateOnlyCol1]
         ,[Source].[UpdateOnlyCol2]
         Wink
   THEN UPDATE SET --JA: put only the columns you wish to update here - less = better performance
      Destination.[UpdateOnlyCol1]   = [Source].[UpdateOnlyCol1]
      ,Destination.[UpdateOnlyCol2]   = [Source].[UpdateOnlyCol2]
WHEN NOT MATCHED BY TARGET THEN --JA: when no matching id can be found, insert
   INSERT
      (
      [HistoryCol1]
      ,[UpdateOnlyCol1]
      ,[HistoryCol2]
      ,[HistoryCol3]
      ,[UpdateOnlyCol2]
      ,[StaticCol1]
      Wink
   VALUES
      (
      [Source].[HistoryCol1]
      ,[Source].[UpdateOnlyCol1]
      ,[Source].[HistoryCol2]
      ,[Source].[HistoryCol3]
      ,[Source].[UpdateOnlyCol2]
      ,[Source].[StaticCol1]
      Wink
; --JA: MERGE must be terminated by a semicolon


Also if you plan on cubing your warehouse I've been told by Will Brown, a Microsoft expert in SSAS, you get some performance benefit from snowflaking your warehouse because then the cubes you build from it don't have to process as many distincts during cube processing. I guess it depends on how much of your querying will be via your cube versus directly from your warehouse.

-------------------------------------------------------------------------------------------------
My SQL Server Blog
orilefka
orilefka
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 29
Thank you! Great article!

I have question about SCD type 2.

Can we compare all columns in ON clause like this:

MERGE INTO dbo.Client_SCD2 AS DST
USING dbo.Client AS SRC
ON SRC.ID = DST.BusinessKey
AND DST.ClientName=SRC.ClientName
AND DST.Country = SRC.Country
AND DST.Town=SRC.Town
AND DST.Address1=SRC.Address1
AND DST.Address2=SRC.Address2
AND DST.ClientType=SRC.ClientType
AND DST.ClientSize=SRC.ClientSize



and then use
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday



rather than having this part of code:

WHEN MATCHED
AND IsCurrent = 1
AND (
ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')
OR ISNULL(DST.Country,'') <> ISNULL(SRC.Country,'')
OR ISNULL(DST.Town,'') <> ISNULL(SRC.Town,'')
OR ISNULL(DST.Address1,'') <> ISNULL(SRC.Address1,'')
OR ISNULL(DST.Address2,'') <> ISNULL(SRC.Address2,'')
OR ISNULL(DST.ClientType,'') <> ISNULL(SRC.ClientType,'')
OR ISNULL(DST.ClientSize,'') <> ISNULL(SRC.ClientSize,'')
)


amenjonathan
amenjonathan
SSC-Enthusiastic
SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)

Group: General Forum Members
Points: 136 Visits: 434
Probably won't work because what if your key's don't match? I would imagine they won't all be the same key, so you'll then end the time period for every row in your dimension and make all records not active?

First you have to have a match on a key. At that point you can check to see if the other fields do not match.

-------------------------------------------------------------------------------------------------
My SQL Server Blog
orilefka
orilefka
Forum Newbie
Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)Forum Newbie (6 reputation)

Group: General Forum Members
Points: 6 Visits: 29
OK.Thank you. This is the whole code i wanted to use:
MERGE INTO dbo.Client_SCD2 AS DST
USING dbo.Client AS SRC
ON SRC.ID = DST.BusinessKey
AND DST.ClientName=SRC.ClientName
AND DST.Country = SRC.Country
AND DST.Town=SRC.Town
AND DST.Address1=SRC.Address1
AND DST.Address2=SRC.Address2
AND DST.ClientType=SRC.ClientType
AND DST.ClientSize=SRC.ClientSize
WHEN NOT MATCHED BY TARGET THEN
INSERT (BusinessKey, ClientName, Country, Town, County, Address1, Address2, ClientType, ClientSize, ValidFrom, IsCurrent)
VALUES (SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.County, Address1, Address2, ClientType, ClientSize, @Today, 1)
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday;



Also i suppose it will have problems with NULL values. And this UPDATE will always go through all records in target table that were deactivated before.
amenjonathan
amenjonathan
SSC-Enthusiastic
SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)SSC-Enthusiastic (136 reputation)

Group: General Forum Members
Points: 136 Visits: 434
I don't see an update statement in your MERGE.

Not sure about how to handle NULLS. My whole warehouse is NOT NULL.

-------------------------------------------------------------------------------------------------
My SQL Server Blog
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search