Stairway to T-SQL DML Level 12: Using the MERGE Statement

  • Comments posted to this topic are about the item Stairway to T-SQL DML Level 12: Using the MERGE Statement

    Gregory A. Larsen, MVP

  • An excellent primer on the MERGE statement, thank you.

    Please can I point out what I believe is simply an inadvertent typo on your part? Re listing 4 you state:

    "This clause tells SQL Server whenever it finds a record in the “Source” table that is not contained in the target table that it needs to perform a DELETE operation."

    This should be the other way around i.e. when record is found in Target table that is not found in Source.

    Great article, thanks again

  • You are correct. A correction is on it way.

    Greg

    Gregory A. Larsen, MVP

  • Hi Greg,

    Would like to extend my sincere thanks for such beautiful and elaborate articles on SQL Server.

    I have a question hovering in my mind from past few months since I have read about MERGE. Can we use multiple conditions in 'ON' clause . For example :

    MERGE dbo.Sales AS T -- Target

    USING dbo.NewSalesNAdjustments AS S -- Source

    ON T.Id = S.Id

    AND T.SalesAmount = S.SalesAmount -- Can we use this condition too in 'ON' clause ??? I did try

    --to test this but I wasn't getting correct results

    WHEN MATCHED THEN -- Update

    UPDATE SET T.SalesAmount = S.SalesAmount

    WHEN NOT MATCHED THEN -- Insert

    INSERT (Id,SalesAmount) VALUES (S.Id,S.SalesAmount);

    Please Advise !

  • you are most definitely can use AND/OR condition in the ON clause 🙂

    I have been using them forever...

    also you can use an AND in MATCHED stmt.

    Like so:

    MERGE T

    Using S

    ON

    t.col = s.col

    WHEN MATCHED AND <otherTcol = otherScol condition> THEN

    Update blah blah blah

    but very cautiously

    that part of MERGE is not always working as desired/expected.

  • Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.

    We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.

    So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!

    It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.

    Paul

  • Excellent article indeed. However like in most articles on merge, you do not mention the fact that merge doesn't have a where clause. i.e. if you want to use only a part of a table as the merge target because your source data set only represents a subset of the data included in the entire table, without precautions the merge statement will delete all rows that are not in your subset. As an example scenario: you've just received one supplier's list of products and want to merge this into your dbo.products table, that contains the list of products for all of your suppliers. The fully implemented merge statement will correctly insert all missing products for your supplier and also it will correctly update all products for which the supplier has changed information and it will even remove all products from your table that are no longer in the supplier's product list. BUT, it will also delete all products from your table that are from other suppliers. Most likely not what you intended! A where clause would have been very welcome here.

    There are 2 ways around this:

    1 - extend the source table for the merge such that it includes all existing products entries for all but the selected supplier plus the new list of products for the selected supplier, or

    2 - apply some filtering on the target table.

    Obviously method 1 will result in an unnecessary large amount of rows being processed by the merge statement. It can be implemented using for example a union statement, but I won't demonstrate this here. Method 2 does at first sight not seem possible because merge doesn't have a where clause. But it is still possible by using a common table expression as the target. Here's an example:

    use tempdb

    go

    create table dbo.suppliers (

    supplier_ID int not null,

    supplier_name varchar(200) not null,

    constraint pk_suppliers primary key clustered (supplier_id)

    );

    create table dbo.supplierproducts (

    supplier_id int not null,

    product_code varchar(20) not null,

    product_name varchar(200) not null,

    constraint pk_products primary key clustered (supplier_id, product_code),

    constraint fk_products_supplier foreign key (supplier_id) references dbo.suppliers (supplier_id) on delete cascade on update cascade not for replication

    );

    insert dbo.suppliers( supplier_id, supplier_name)

    values (1, 'supplier one'),

    (2, 'supplier two'),

    (3, 'supplier three');

    go

    insert dbo.supplierproducts( supplier_id, product_code, product_name)

    values (1, 'apple', 'red apples'),

    (1, 'pear', 'red pears'),

    (1, 'rose', 'red roses'),

    (2, 'apple', 'white apples'),

    (2, 'pear', 'white pears'),

    (3, 'apple', 'blue apples'),

    (3, 'pear', 'blue pears');

    go

    -- Show the content of supplierproducts table before the merge.

    select * from dbo.supplierproducts

    go

    -- Here's the new product list supplier 1 has just sent us.

    -- Notice that this list only includes supplier 1's products,

    -- not the products of any other suppliers.

    declare @updatedproductlist table (

    product_code varchar(20) not null,

    product_name varchar(200) not null,

    primary key (product_code)

    );

    insert @updatedproductlist( product_code, product_name)

    values ('apple', 'red apples'),

    ('pear', 'green pears'),

    ('kiwi', 'green kiwis');

    declare @supplier_id int;

    select @supplier_id = 1;

    with cteTarget as (

    select sp.supplier_id,

    sp.product_code,

    sp.product_name

    from dbo.supplierproducts sp

    where sp.supplier_id = @supplier_id

    )

    merge into cteTarget trg

    using @updatedproductlist src

    on (src.product_code = trg.product_code)

    when not matched by target

    then

    insert( supplier_id, product_code, product_name)

    values( @supplier_id, src.product_code, src.product_name)

    when matched and src.product_name <> trg.product_name

    then

    update

    set product_name = src.product_name

    when not matched by source

    then

    delete

    output $action, inserted.*, deleted.*;

    -- Show the content of supplierproducts table before the merge.

    -- Note that the products of supplier 2 and 3 are unchanged.

    select * from dbo.supplierproducts;



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

  • paul.barbin (1/16/2013)


    Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.

    We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.

    So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!

    It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.

    Paul

    Have you checked out MS's article (link) for MERGE optimization?

    The biggest takeaway:

    To improve the performance of the MERGE statement, we recommend the following index guidelines:

    Create an index on the join columns in the source table that is unique and covering.

    Create a unique clustered index on the join columns in the target table.

    I've noticed slow performance in MERGE, but never when following these guidlines - which sometimes forces me into staging tables, but the MERGE benefits have outweighed the downsides there.

  • R.P.Rozema (1/16/2013)


    Excellent article indeed. However like in most articles on merge, you do not mention the fact that merge doesn't have a where clause. i.e. if you want to use only a part of a table as the merge target because your source data set only represents a subset of the data included in the entire table, without precautions the merge statement will delete all rows that are not in your subset.

    This is easily preventable by extending the "WHEN NOT MATCHED BY SOURCE" clause, ie:

    WHEN NOT MATCHED BY SOURCE AND T.TypeID = 1

  • In reply to paul.barbin

    I have had exactly the opposite experience and i like the syntax of it.

    We used MERGE, EXCEPT and INTERSECT in a few ad-hoc queries and sproc ammendments with excellent results. The last time we used merge the operation performed nearly 50 million inserts in a matter of minutes.

  • Great article - we've used MERGE for some time and it reduces the CRUD burden considerably.

    However, it should not be used if you are using transaction replication. We've found (and there are others if you google) that replication does not recognise the MERGE statement correctly, and will issue a DELETE/INSERT instead of an UPDATE to the subscribers. This can then collide with any Foreign Key Constraints you may have set up, breaking the replication. Unfortunately this has meant we've had to remove the use of MERGE for a lot of our routines.

  • paul.barbin (1/16/2013)


    Curious to hear from others on whether they use the MERGE command in production and what it's performance is like.

    We began using it in our unit test scripts to build data for developers immediately following the database build (we're using SSDT 2012). We wanted to use MERGE so the scripts could run even if developers already had data in their database.

    So, rather than having an insert script of 10,000 records, we use the Merge command along with the 10,000 rows using the VALUES statement. This is EXTREMELY slow!

    It takes 30 minutes to populate a small database (5GB) vs 2-3 minutes using TRUNCATE/INSERT.

    Paul

    I use Merge in production all over the place.

    For example, I have a series of databases that power websites. Thousands of hits per day. The data for them gets created/updated/deleted in a CRM, then has to be loaded into the databases for the sites. Pretty complex business rules on the load processes, and servers thousands of miles apart from each other.

    Prior DBA had it all set up as Truncate then Insert, but I modified it all to use Merge. Used to be, we'd have "blank" periods a couple of time per day, and employees got used to (a) fielding questions about that from customers, and (b) making changes and not having them show up till (sometimes) late the next day.

    I modified it all to use Merge, and set it to run hourly. No "blank periods", and minimal waits for data to go live. Everyone loves it.

    However, since Merge has to lock the tables that are being reloaded, and usually needs a full table-lock because of the scope of the action, it would have failed if I didn't combine it with Read Committed Snapshot Isolation. That allows queries from the sites to keep running and not be blocked by the large Merge actions.

    Tested without that, and it was bad. (Was in a test environment, so no impact on anyone. But if Merge had gone live without that testing, it would have been worse than the prior "the tables are temporarily empty" issues.)

    So, it can be pretty useful, but test against expected concurrent load.

    Also tried more traditional two-step update/insert actions, but Merge performed better than those in my case. I've seen it take longer, however, so, again, test vs more traditional methods.

    Edit: Truncate+Insert will be faster. It doesn't require any join math between the source and destination. That alone will make it faster. If speed is the only concern, then truncate+insert will almost certainly be faster. However, that option also means you can't have FKs referencing the table, and all the other issues that go along with truncate. But if speed is really all that matters, use it.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • @Brian, that delete/insert action in transactional replication is a known bug (or at least it used to be). We had that problem many moons ago and we weren't using the MERGE command. MS said that if the key field in an index is updated, transactional replication views that as a Delete/Insert pair. We had to have special stored procs on the subscriber to handle the error for us. When the Delete sp was called, we checked to see if the delete would fail due to FK constraints, if so, we determined that it was a "false" delete and we'd ignore the delete command. The following command would be the insert and we'd check to see if the record already existed. If so, we'd just update it with the values. That was way back in 2000 and 2005. No idea if that's corrected in the newer versions.

    @GSquared, thanks. I was just curious, I haven't used MERGE outside of this one case. In this particular case, speed isn't crucial, but the time difference is SO great that it starts to become an issue. What we wanted was the ability to run the script regardless of the state of the db and MERGE gives that ability where "kill and fill" doesn't. Good to know that when used in other applications that it is as fast or faster.

    Thanks.

  • There are some very solid, and pretty intuitive, ways to speed up Merge.

    For my cross-server ETL, I set up a bulk-copy database that just copies the source tables from the other server. Very fast because it's minimally logged and doesn't incure DTC locks. Then I Merge from that. But that kind of "stage first, load second" isn't specific to Merge. I'd have done that no matter what my final choice was for loading the target tables, since it would apply equally to kill'n'fill, or update/insert (2-step), or anything else.

    Another way to really speed up Merge is to make sure the On statement from source to target, is properly indexed, and (preferably) has index-enforced uniqueness for the source. For complex source queries, it can be faster to load a temp table, index that, then use Merge with that as the source.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • @Ryan.Polk: Yes, you can do that, but it is not the same. First, as you just demonstrated, it is easy to forget that you need to add the same condition to both the "when not matched by source" and the "when matched" clauses. But on top of that, if you compare the query plans and the io statistics for both options, you will see that my approach with the cte performs better than your alternative. i.e. my cte approach is easier on the eye plus it performs better.

    statistics io for cte approach:

    Table 'suppliers'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'supplierproducts'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#4C8B54C9'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    statistics io for 2 Ryan.Polk method:

    Table 'suppliers'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'supplierproducts'. Scan count 2, logical reads 13, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#4C8B54C9'. Scan count 2, logical reads 17, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.



    Posting Data Etiquette - Jeff Moden[/url]
    Posting Performance Based Questions - Gail Shaw[/url]
    Hidden RBAR - Jeff Moden[/url]
    Cross Tabs and Pivots - Jeff Moden[/url]
    Catch-all queries - Gail Shaw[/url]


    If you don't have time to do it right, when will you have time to do it over?

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply