Stairway to T-SQL DML Level 12: Using the MERGE Statement

Question

Post reply

Stairway to T-SQL DML Level 12: Using the MERGE Statement

Greg Larsen

SSC-Insane

Points: 20966
More actions
August 6, 2012 at 3:55 am

#392896

Comments posted to this topic are about the item Stairway to T-SQL DML Level 12: Using the MERGE Statement
Gregory A. Larsen, MVP

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply

Ian Miles SSC Veteran Points: 200 More actions · Answer 1

An excellent primer on the MERGE statement, thank you.

Please can I point out what I believe is simply an inadvertent typo on your part? Re listing 4 you state:

"This clause tells SQL Server whenever it finds a record in the “Source” table that is not contained in the target table that it needs to perform a DELETE operation."

This should be the other way around i.e. when record is found in Target table that is not found in Source.

Great article, thanks again

Greg Larsen SSC-Insane Points: 20966 More actions · Answer 2

You are correct. A correction is on it way.

Greg

Gregory A. Larsen, MVP

ankit_recsurat SSC Journeyman Points: 85 More actions · Answer 3

Hi Greg,

Would like to extend my sincere thanks for such beautiful and elaborate articles on SQL Server.

I have a question hovering in my mind from past few months since I have read about MERGE. Can we use multiple conditions in 'ON' clause . For example :

MERGE dbo.Sales AS T -- Target

USING dbo.NewSalesNAdjustments AS S -- Source

ON T.Id = S.Id

AND T.SalesAmount = S.SalesAmount -- Can we use this condition too in 'ON' clause ??? I did try

--to test this but I wasn't getting correct results

WHEN MATCHED THEN -- Update

UPDATE SET T.SalesAmount = S.SalesAmount

WHEN NOT MATCHED THEN -- Insert

INSERT (Id,SalesAmount) VALUES (S.Id,S.SalesAmount);

Please Advise !

Vlad-207446 Right there with Babe Points: 773 More actions · Answer 4

you are most definitely can use AND/OR condition in the ON clause 🙂

I have been using them forever...

also you can use an AND in MATCHED stmt.

Like so:

MERGE T

Using S

ON

t.col = s.col

WHEN MATCHED AND <otherTcol = otherScol condition> THEN

Update blah blah blah

but very cautiously

that part of MERGE is not always working as desired/expected.

paul.barbin SSC Enthusiast Points: 187 More actions · Answer 5

Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.

We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.

So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!

It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.

Paul

R.P.Rozema SSChampion Points: 12307 More actions · Answer 6

Excellent article indeed. However like in most articles on merge, you do not mention the fact that merge doesn't have a where clause. i.e. if you want to use only a part of a table as the merge target because your source data set only represents a subset of the data included in the entire table, without precautions the merge statement will delete all rows that are not in your subset. As an example scenario: you've just received one supplier's list of products and want to merge this into your dbo.products table, that contains the list of products for all of your suppliers. The fully implemented merge statement will correctly insert all missing products for your supplier and also it will correctly update all products for which the supplier has changed information and it will even remove all products from your table that are no longer in the supplier's product list. BUT, it will also delete all products from your table that are from other suppliers. Most likely not what you intended! A where clause would have been very welcome here.

There are 2 ways around this:

1 - extend the source table for the merge such that it includes all existing products entries for all but the selected supplier plus the new list of products for the selected supplier, or

2 - apply some filtering on the target table.

Obviously method 1 will result in an unnecessary large amount of rows being processed by the merge statement. It can be implemented using for example a union statement, but I won't demonstrate this here. Method 2 does at first sight not seem possible because merge doesn't have a where clause. But it is still possible by using a common table expression as the target. Here's an example:

use tempdb

go

create table dbo.suppliers (

supplier_ID int not null,

supplier_name varchar(200) not null,

constraint pk_suppliers primary key clustered (supplier_id)

);

create table dbo.supplierproducts (

supplier_id int not null,

product_code varchar(20) not null,

product_name varchar(200) not null,

constraint pk_products primary key clustered (supplier_id, product_code),

constraint fk_products_supplier foreign key (supplier_id) references dbo.suppliers (supplier_id) on delete cascade on update cascade not for replication

);

insert dbo.suppliers( supplier_id, supplier_name)

values (1, 'supplier one'),

(2, 'supplier two'),

(3, 'supplier three');

go

insert dbo.supplierproducts( supplier_id, product_code, product_name)

values (1, 'apple', 'red apples'),

(1, 'pear', 'red pears'),

(1, 'rose', 'red roses'),

(2, 'apple', 'white apples'),

(2, 'pear', 'white pears'),

(3, 'apple', 'blue apples'),

(3, 'pear', 'blue pears');

go

-- Show the content of supplierproducts table before the merge.

select * from dbo.supplierproducts

go

-- Here's the new product list supplier 1 has just sent us.

-- Notice that this list only includes supplier 1's products,

-- not the products of any other suppliers.

declare @updatedproductlist table (

product_code varchar(20) not null,

product_name varchar(200) not null,

primary key (product_code)

);

insert @updatedproductlist( product_code, product_name)

values ('apple', 'red apples'),

('pear', 'green pears'),

('kiwi', 'green kiwis');

declare @supplier_id int;

select @supplier_id = 1;

with cteTarget as (

select sp.supplier_id,

sp.product_code,

sp.product_name

from dbo.supplierproducts sp

where sp.supplier_id = @supplier_id

)

merge into cteTarget trg

using @updatedproductlist src

on (src.product_code = trg.product_code)

when not matched by target

then

insert( supplier_id, product_code, product_name)

values( @supplier_id, src.product_code, src.product_name)

when matched and src.product_name <> trg.product_name

then

update

set product_name = src.product_name

when not matched by source

then

delete

output $action, inserted.*, deleted.*;

-- Show the content of supplierproducts table before the merge.

-- Note that the products of supplier 2 and 3 are unchanged.

select * from dbo.supplierproducts;

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?

Ryan.Polk Old Hand Points: 373 More actions · Answer 7

paul.barbin (1/16/2013)
Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.
We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.
So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!
It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.
Paul

Have you checked out MS's article (link) for MERGE optimization?

The biggest takeaway:

To improve the performance of the MERGE statement, we recommend the following index guidelines:
Create an index on the join columns in the source table that is unique and covering.
Create a unique clustered index on the join columns in the target table.

I've noticed slow performance in MERGE, but never when following these guidlines - which sometimes forces me into staging tables, but the MERGE benefits have outweighed the downsides there.

Ryan.Polk Old Hand Points: 373 More actions · Answer 8

R.P.Rozema (1/16/2013)
Excellent article indeed. However like in most articles on merge, you do not mention the fact that merge doesn't have a where clause. i.e. if you want to use only a part of a table as the merge target because your source data set only represents a subset of the data included in the entire table, without precautions the merge statement will delete all rows that are not in your subset.

This is easily preventable by extending the "WHEN NOT MATCHED BY SOURCE" clause, ie:

WHEN NOT MATCHED BY SOURCE AND T.TypeID = 1

RuiMauricio SSCommitted Points: 1607 More actions · Answer 9

In reply to paul.barbin

I have had exactly the opposite experience and i like the syntax of it.

We used MERGE, EXCEPT and INTERSECT in a few ad-hoc queries and sproc ammendments with excellent results. The last time we used merge the operation performed nearly 50 million inserts in a matter of minutes.

Brian Jones-401554 Default port Points: 1448 More actions · Answer 10

Great article - we've used MERGE for some time and it reduces the CRUD burden considerably.

However, it should not be used if you are using transaction replication. We've found (and there are others if you google) that replication does not recognise the MERGE statement correctly, and will issue a DELETE/INSERT instead of an UPDATE to the subscribers. This can then collide with any Foreign Key Constraints you may have set up, breaking the replication. Unfortunately this has meant we've had to remove the use of MERGE for a lot of our routines.

GSquared SSC Guru Points: 260824 More actions · Answer 11

paul.barbin (1/16/2013)
Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.
We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.
So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!
It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.
Paul

I use Merge in production all over the place.

For example, I have a series of databases that power websites. Thousands of hits per day. The data for them gets created/updated/deleted in a CRM, then has to be loaded into the databases for the sites. Pretty complex business rules on the load processes, and servers thousands of miles apart from each other.

Prior DBA had it all set up as Truncate then Insert, but I modified it all to use Merge. Used to be, we'd have "blank" periods a couple of time per day, and employees got used to (a) fielding questions about that from customers, and (b) making changes and not having them show up till (sometimes) late the next day.

I modified it all to use Merge, and set it to run hourly. No "blank periods", and minimal waits for data to go live. Everyone loves it.

However, since Merge has to lock the tables that are being reloaded, and usually needs a full table-lock because of the scope of the action, it would have failed if I didn't combine it with Read Committed Snapshot Isolation. That allows queries from the sites to keep running and not be blocked by the large Merge actions.

Tested without that, and it was bad. (Was in a test environment, so no impact on anyone. But if Merge had gone live without that testing, it would have been worse than the prior "the tables are temporarily empty" issues.)

So, it can be pretty useful, but test against expected concurrent load.

Also tried more traditional two-step update/insert actions, but Merge performed better than those in my case. I've seen it take longer, however, so, again, test vs more traditional methods.

Edit: Truncate+Insert will be faster. It doesn't require any join math between the source and destination. That alone will make it faster. If speed is the only concern, then truncate+insert will almost certainly be faster. However, that option also means you can't have FKs referencing the table, and all the other issues that go along with truncate. But if speed is really all that matters, use it.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

paul.barbin SSC Enthusiast Points: 187 More actions · Answer 12

@Brian, that delete/insert action in transactional replication is a known bug (or at least it used to be). We had that problem many moons ago and we weren't using the MERGE command. MS said that if the key field in an index is updated, transactional replication views that as a Delete/Insert pair. We had to have special stored procs on the subscriber to handle the error for us. When the Delete sp was called, we checked to see if the delete would fail due to FK constraints, if so, we determined that it was a "false" delete and we'd ignore the delete command. The following command would be the insert and we'd check to see if the record already existed. If so, we'd just update it with the values. That was way back in 2000 and 2005. No idea if that's corrected in the newer versions.

@GSquared, thanks. I was just curious, I haven't used MERGE outside of this one case. In this particular case, speed isn't crucial, but the time difference is SO great that it starts to become an issue. What we wanted was the ability to run the script regardless of the state of the db and MERGE gives that ability where "kill and fill" doesn't. Good to know that when used in other applications that it is as fast or faster.

Thanks.

GSquared SSC Guru Points: 260824 More actions · Answer 13

There are some very solid, and pretty intuitive, ways to speed up Merge.

For my cross-server ETL, I set up a bulk-copy database that just copies the source tables from the other server. Very fast because it's minimally logged and doesn't incure DTC locks. Then I Merge from that. But that kind of "stage first, load second" isn't specific to Merge. I'd have done that no matter what my final choice was for loading the target tables, since it would apply equally to kill'n'fill, or update/insert (2-step), or anything else.

Another way to really speed up Merge is to make sure the On statement from source to target, is properly indexed, and (preferably) has index-enforced uniqueness for the source. For complex source queries, it can be faster to load a temp table, index that, then use Merge with that as the source.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

R.P.Rozema SSChampion Points: 12307 More actions · Answer 14

@Ryan.Polk: Yes, you can do that, but it is not the same. First, as you just demonstrated, it is easy to forget that you need to add the same condition to both the "when not matched by source" and the "when matched" clauses. But on top of that, if you compare the query plans and the io statistics for both options, you will see that my approach with the cte performs better than your alternative. i.e. my cte approach is easier on the eye plus it performs better.

statistics io for cte approach:

Table 'suppliers'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'supplierproducts'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#4C8B54C9'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

statistics io for 2 Ryan.Polk method:

Table 'suppliers'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'supplierproducts'. Scan count 2, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table '#4C8B54C9'. Scan count 2, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Posting Data Etiquette - Jeff Moden[/url]
Posting Performance Based Questions - Gail Shaw[/url]
Hidden RBAR - Jeff Moden[/url]
Cross Tabs and Pivots - Jeff Moden[/url]
Catch-all queries - Gail Shaw[/url]

If you don't have time to do it right, when will you have time to do it over?