Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Slowly changing dimensions using T-SQL MERGE Expand / Collapse
Author
Message
Posted Monday, June 20, 2011 8:16 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, November 3, 2014 10:24 AM
Points: 81, Visits: 905
Hi DEK46656,

You are absolutely right that there are many other ways of comparing data - I prefer SHA1 hashes, if possible in computed columns, myself. Here, of course, the accent was on the more basic MERGE concept, but you are spot on to point out that this is only a starting point, and that there is such a lot that you can do with the basic concept to make it really efficient and useful.

Regards,

Adam
Post #1128273
Posted Monday, June 20, 2011 8:34 AM
Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, November 12, 2012 4:54 PM
Points: 54, Visits: 816
I concur what dawalker-1068762 has said. MERGE is a wonderful command but use carefully. If the source (SRC ) table is small and light it is feasable and nice to use. But if the table has millions of rows then you will hit a performance snag.

To get around this we are using MERGE along with a batch process. User very carefully! and design things in batches.

Wish MERGE had a parameter to do batch processing!
Post #1128290
Posted Monday, June 20, 2011 9:01 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, December 16, 2013 10:42 AM
Points: 96, Visits: 434
In my experience, you should always stage your data on the local server before doing any manipulation and / or merging. So for dawalker, extract new information in native form from the source server to your warehouse server before you transform and load.

I also use checksum to detect changes. Much faster!


-------------------------------------------------------------------------------------------------
My SQL Server Blog
Post #1128313
Posted Monday, June 20, 2011 2:22 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Tuesday, November 25, 2014 3:28 AM
Points: 43, Visits: 476
Great article Adam.

Regarding cross server merging. Isn't it best to use a ETL tool for ETL tasks, making use of in-memory caching, instead of a linked server?

Regarding checksum comparisons how do you handle collisions. I.E. the "Birthday Paradox"? I've used 'md5' or 'sha*' in the past but never checksum because of the increased collision probability. Do you have code that resolves collisions?
Post #1128547
Posted Tuesday, June 21, 2011 9:37 AM


SSCommitted

SSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommittedSSCommitted

Group: General Forum Members
Last Login: Tuesday, November 25, 2014 2:50 PM
Points: 1,937, Visits: 2,107
David Greenberg-358875 (6/20/2011)
Very Good Article. I have a question. Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key...Can you handle this scenario within the same merge statement?


The short answer is no. The problem is, in the MERGE construct, you can only have one WHEN MATCHED section. This method uses the WHEN MATCHED section to mark the "current" record in your slowly changing dimension table as no longer current, then in the OUTPUT clause the new values for this record are sent to the INSERT statement "outside" of the MERGE. We use this method for our data warehouse, but our ODS is on the same server.

Snowflaking your dimensional model as was suggested may be tempting, but in essence what you're doing then is optimizing your model for the ETL instead of the reporting, and it is likely that the extra join will hurt performance when it is queried.
Post #1129088
Posted Tuesday, June 21, 2011 12:17 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, December 16, 2013 10:42 AM
Points: 96, Visits: 434
Actually you can do both. Put all the fields you wish to carry history on in the ON clause for matching. This will tell the MERGE you want a new record any time there is something that doesn't match, which is when you have new information in a field you wish to carry history on. For update only fields put an update clause on the fields you wish to update in the when matched clause. For static fields, just leave them in the insert. Don't make them part of the update.

Easier if I just show you the template I use on my MERGE statements:

--JA: MERGE source with destination
MERGE dbo.DimCustomer Destination
USING --JA: source
(
SELECT
DC.[HistoryCol1]
,DC.[UpdateOnlyCol1]
,DC.[HistoryCol2]
,DC.[HistoryCol3]
,DC.[UpdateOnlyCol2]
,DC.[StaticCol1]
FROM dbo.DimCustomerStage DC
)
AS [Source]
(
[HistoryCol1]
,[UpdateOnlyCol1]
,[HistoryCol2]
,[HistoryCol3]
,[UpdateOnlyCol2]
,[StaticCol1]
)
ON (
[Source].[HistoryCol1] = Destination.[HistoryCol1]
AND [Source].[HistoryCol2] = Destination.[HistoryCol2]
AND [Source].[HistoryCol3] = Destination.[HistoryCol3]
)
WHEN MATCHED --JA: here i specify that in order to qualify for update, the binary sum of the two rows must not be equal (a change has occured in source) - checksum is faster than checking every column individually
AND BINARY_CHECKSUM
(
Destination.[UpdateOnlyCol1]
,Destination.[UpdateOnlyCol2]
)
!= BINARY_CHECKSUM
(
[Source].[UpdateOnlyCol1]
,[Source].[UpdateOnlyCol2]
)
THEN UPDATE SET --JA: put only the columns you wish to update here - less = better performance
Destination.[UpdateOnlyCol1] = [Source].[UpdateOnlyCol1]
,Destination.[UpdateOnlyCol2] = [Source].[UpdateOnlyCol2]
WHEN NOT MATCHED BY TARGET THEN --JA: when no matching id can be found, insert
INSERT
(
[HistoryCol1]
,[UpdateOnlyCol1]
,[HistoryCol2]
,[HistoryCol3]
,[UpdateOnlyCol2]
,[StaticCol1]
)
VALUES
(
[Source].[HistoryCol1]
,[Source].[UpdateOnlyCol1]
,[Source].[HistoryCol2]
,[Source].[HistoryCol3]
,[Source].[UpdateOnlyCol2]
,[Source].[StaticCol1]
)
; --JA: MERGE must be terminated by a semicolon


Also if you plan on cubing your warehouse I've been told by Will Brown, a Microsoft expert in SSAS, you get some performance benefit from snowflaking your warehouse because then the cubes you build from it don't have to process as many distincts during cube processing. I guess it depends on how much of your querying will be via your cube versus directly from your warehouse.


-------------------------------------------------------------------------------------------------
My SQL Server Blog
Post #1129212
Posted Wednesday, June 22, 2011 2:16 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, July 22, 2011 8:30 AM
Points: 2, Visits: 29
Thank you! Great article!

I have question about SCD type 2.

Can we compare all columns in ON clause like this:

MERGE INTO dbo.Client_SCD2 AS DST
USING dbo.Client AS SRC
ON SRC.ID = DST.BusinessKey
AND DST.ClientName=SRC.ClientName
AND DST.Country = SRC.Country
AND DST.Town=SRC.Town
AND DST.Address1=SRC.Address1
AND DST.Address2=SRC.Address2
AND DST.ClientType=SRC.ClientType
AND DST.ClientSize=SRC.ClientSize

and then use
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday

rather than having this part of code:
WHEN MATCHED 
AND IsCurrent = 1
AND (
ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'')
OR ISNULL(DST.Country,'') <> ISNULL(SRC.Country,'')
OR ISNULL(DST.Town,'') <> ISNULL(SRC.Town,'')
OR ISNULL(DST.Address1,'') <> ISNULL(SRC.Address1,'')
OR ISNULL(DST.Address2,'') <> ISNULL(SRC.Address2,'')
OR ISNULL(DST.ClientType,'') <> ISNULL(SRC.ClientType,'')
OR ISNULL(DST.ClientSize,'') <> ISNULL(SRC.ClientSize,'')
)

Post #1130001
Posted Wednesday, June 22, 2011 2:29 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, December 16, 2013 10:42 AM
Points: 96, Visits: 434
Probably won't work because what if your key's don't match? I would imagine they won't all be the same key, so you'll then end the time period for every row in your dimension and make all records not active?

First you have to have a match on a key. At that point you can check to see if the other fields do not match.


-------------------------------------------------------------------------------------------------
My SQL Server Blog
Post #1130005
Posted Wednesday, June 22, 2011 5:25 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Friday, July 22, 2011 8:30 AM
Points: 2, Visits: 29
OK.Thank you. This is the whole code i wanted to use:
MERGE INTO dbo.Client_SCD2 AS DST
USING dbo.Client AS SRC
ON SRC.ID = DST.BusinessKey
AND DST.ClientName=SRC.ClientName
AND DST.Country = SRC.Country
AND DST.Town=SRC.Town
AND DST.Address1=SRC.Address1
AND DST.Address2=SRC.Address2
AND DST.ClientType=SRC.ClientType
AND DST.ClientSize=SRC.ClientSize
WHEN NOT MATCHED BY TARGET THEN
INSERT (BusinessKey, ClientName, Country, Town, County, Address1, Address2, ClientType, ClientSize, ValidFrom, IsCurrent)
VALUES (SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.County, Address1, Address2, ClientType, ClientSize, @Today, 1)
WHEN NOT MATCHED BY SOURCE THEN
UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday;

Also i suppose it will have problems with NULL values. And this UPDATE will always go through all records in target table that were deactivated before.
Post #1130064
Posted Thursday, June 23, 2011 9:15 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, December 16, 2013 10:42 AM
Points: 96, Visits: 434
I don't see an update statement in your MERGE.

Not sure about how to handle NULLS. My whole warehouse is NOT NULL.


-------------------------------------------------------------------------------------------------
My SQL Server Blog
Post #1130543
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse