|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Today @ 2:10 AM
Points: 81,
Visits: 848
|
|
Hi DEK46656,
You are absolutely right that there are many other ways of comparing data - I prefer SHA1 hashes, if possible in computed columns, myself. Here, of course, the accent was on the more basic MERGE concept, but you are spot on to point out that this is only a starting point, and that there is such a lot that you can do with the basic concept to make it really efficient and useful.
Regards,
Adam
|
|
|
|
|
Valued Member
      
Group: General Forum Members
Last Login: Monday, November 12, 2012 4:54 PM
Points: 54,
Visits: 816
|
|
I concur what dawalker-1068762 has said. MERGE is a wonderful command but use carefully. If the source (SRC ) table is small and light it is feasable and nice to use. But if the table has millions of rows then you will hit a performance snag.
To get around this we are using MERGE along with a batch process. User very carefully! and design things in batches.
Wish MERGE had a parameter to do batch processing!
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Thursday, September 20, 2012 5:21 PM
Points: 84,
Visits: 391
|
|
In my experience, you should always stage your data on the local server before doing any manipulation and / or merging. So for dawalker, extract new information in native form from the source server to your warehouse server before you transform and load.
I also use checksum to detect changes. Much faster!
My SQL Server Blog
|
|
|
|
|
SSC Rookie
      
Group: General Forum Members
Last Login: Yesterday @ 9:41 AM
Points: 34,
Visits: 339
|
|
Great article Adam.
Regarding cross server merging. Isn't it best to use a ETL tool for ETL tasks, making use of in-memory caching, instead of a linked server?
Regarding checksum comparisons how do you handle collisions. I.E. the "Birthday Paradox"? I've used 'md5' or 'sha*' in the past but never checksum because of the increased collision probability. Do you have code that resolves collisions?
|
|
|
|
|
SSCommitted
      
Group: General Forum Members
Last Login: Friday, May 17, 2013 7:19 AM
Points: 1,562,
Visits: 1,716
|
|
David Greenberg-358875 (6/20/2011) Very Good Article. I have a question. Suppose you have a dimension that contains certain columns that are type 2 where you want to keep track of historical changes. However, the dimension also contains type 1 columns where you just want to overwite existing values without creating a new record with a different surrogate key...Can you handle this scenario within the same merge statement?
The short answer is no. The problem is, in the MERGE construct, you can only have one WHEN MATCHED section. This method uses the WHEN MATCHED section to mark the "current" record in your slowly changing dimension table as no longer current, then in the OUTPUT clause the new values for this record are sent to the INSERT statement "outside" of the MERGE. We use this method for our data warehouse, but our ODS is on the same server.
Snowflaking your dimensional model as was suggested may be tempting, but in essence what you're doing then is optimizing your model for the ETL instead of the reporting, and it is likely that the extra join will hurt performance when it is queried.
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Thursday, September 20, 2012 5:21 PM
Points: 84,
Visits: 391
|
|
Actually you can do both. Put all the fields you wish to carry history on in the ON clause for matching. This will tell the MERGE you want a new record any time there is something that doesn't match, which is when you have new information in a field you wish to carry history on. For update only fields put an update clause on the fields you wish to update in the when matched clause. For static fields, just leave them in the insert. Don't make them part of the update.
Easier if I just show you the template I use on my MERGE statements:
--JA: MERGE source with destination MERGE dbo.DimCustomer Destination USING --JA: source ( SELECT DC.[HistoryCol1] ,DC.[UpdateOnlyCol1] ,DC.[HistoryCol2] ,DC.[HistoryCol3] ,DC.[UpdateOnlyCol2] ,DC.[StaticCol1] FROM dbo.DimCustomerStage DC ) AS [Source] ( [HistoryCol1] ,[UpdateOnlyCol1] ,[HistoryCol2] ,[HistoryCol3] ,[UpdateOnlyCol2] ,[StaticCol1] ) ON ( [Source].[HistoryCol1] = Destination.[HistoryCol1] AND [Source].[HistoryCol2] = Destination.[HistoryCol2] AND [Source].[HistoryCol3] = Destination.[HistoryCol3] ) WHEN MATCHED --JA: here i specify that in order to qualify for update, the binary sum of the two rows must not be equal (a change has occured in source) - checksum is faster than checking every column individually AND BINARY_CHECKSUM ( Destination.[UpdateOnlyCol1] ,Destination.[UpdateOnlyCol2] ) != BINARY_CHECKSUM ( [Source].[UpdateOnlyCol1] ,[Source].[UpdateOnlyCol2] ) THEN UPDATE SET --JA: put only the columns you wish to update here - less = better performance Destination.[UpdateOnlyCol1] = [Source].[UpdateOnlyCol1] ,Destination.[UpdateOnlyCol2] = [Source].[UpdateOnlyCol2] WHEN NOT MATCHED BY TARGET THEN --JA: when no matching id can be found, insert INSERT ( [HistoryCol1] ,[UpdateOnlyCol1] ,[HistoryCol2] ,[HistoryCol3] ,[UpdateOnlyCol2] ,[StaticCol1] ) VALUES ( [Source].[HistoryCol1] ,[Source].[UpdateOnlyCol1] ,[Source].[HistoryCol2] ,[Source].[HistoryCol3] ,[Source].[UpdateOnlyCol2] ,[Source].[StaticCol1] ) ; --JA: MERGE must be terminated by a semicolon
Also if you plan on cubing your warehouse I've been told by Will Brown, a Microsoft expert in SSAS, you get some performance benefit from snowflaking your warehouse because then the cubes you build from it don't have to process as many distincts during cube processing. I guess it depends on how much of your querying will be via your cube versus directly from your warehouse.
My SQL Server Blog
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Friday, July 22, 2011 8:30 AM
Points: 2,
Visits: 29
|
|
Thank you! Great article!
I have question about SCD type 2.
Can we compare all columns in ON clause like this:
MERGE INTO dbo.Client_SCD2 AS DST USING dbo.Client AS SRC ON SRC.ID = DST.BusinessKey AND DST.ClientName=SRC.ClientName AND DST.Country = SRC.Country AND DST.Town=SRC.Town AND DST.Address1=SRC.Address1 AND DST.Address2=SRC.Address2 AND DST.ClientType=SRC.ClientType AND DST.ClientSize=SRC.ClientSize
and then use
WHEN NOT MATCHED BY SOURCE THEN UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday rather than having this part of code:
WHEN MATCHED AND IsCurrent = 1 AND ( ISNULL(DST.ClientName,'') <> ISNULL(SRC.ClientName,'') OR ISNULL(DST.Country,'') <> ISNULL(SRC.Country,'') OR ISNULL(DST.Town,'') <> ISNULL(SRC.Town,'') OR ISNULL(DST.Address1,'') <> ISNULL(SRC.Address1,'') OR ISNULL(DST.Address2,'') <> ISNULL(SRC.Address2,'') OR ISNULL(DST.ClientType,'') <> ISNULL(SRC.ClientType,'') OR ISNULL(DST.ClientSize,'') <> ISNULL(SRC.ClientSize,'') )
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Thursday, September 20, 2012 5:21 PM
Points: 84,
Visits: 391
|
|
Probably won't work because what if your key's don't match? I would imagine they won't all be the same key, so you'll then end the time period for every row in your dimension and make all records not active?
First you have to have a match on a key. At that point you can check to see if the other fields do not match.
My SQL Server Blog
|
|
|
|
|
Forum Newbie
      
Group: General Forum Members
Last Login: Friday, July 22, 2011 8:30 AM
Points: 2,
Visits: 29
|
|
OK.Thank you. This is the whole code i wanted to use:
MERGE INTO dbo.Client_SCD2 AS DST USING dbo.Client AS SRC ON SRC.ID = DST.BusinessKey AND DST.ClientName=SRC.ClientName AND DST.Country = SRC.Country AND DST.Town=SRC.Town AND DST.Address1=SRC.Address1 AND DST.Address2=SRC.Address2 AND DST.ClientType=SRC.ClientType AND DST.ClientSize=SRC.ClientSize WHEN NOT MATCHED BY TARGET THEN INSERT (BusinessKey, ClientName, Country, Town, County, Address1, Address2, ClientType, ClientSize, ValidFrom, IsCurrent) VALUES (SRC.ID, SRC.ClientName, SRC.Country, SRC.Town, SRC.County, Address1, Address2, ClientType, ClientSize, @Today, 1) WHEN NOT MATCHED BY SOURCE THEN UPDATE SET DST.IsCurrent = 0, DST.ValidTo = @Yesterday; Also i suppose it will have problems with NULL values. And this UPDATE will always go through all records in target table that were deactivated before.
|
|
|
|
|
SSC Journeyman
      
Group: General Forum Members
Last Login: Thursday, September 20, 2012 5:21 PM
Points: 84,
Visits: 391
|
|
I don't see an update statement in your MERGE.
Not sure about how to handle NULLS. My whole warehouse is NOT NULL.
My SQL Server Blog
|
|
|
|