RE: More RBAR and "Tuning" UPDATEs

SSC Guru

Points: 64780

March 14, 2008 at 5:02 am

Hi all,

I'm sorry for having to be the party pooper, but there is just too much wrong with this article to let it slip.

many people don't know how to do joins in the UPDATE statement and they end up using correlated sub-queries, instead. Of course, that's a major form of RBAR that should be avoided

A sub-query is not an automatic RBAR. That's the conceptual explanation of subqueries, but in reality, the optimizer will often internally convert the query to an equivalent join. It is clear that this did not happen in the example in your article, but please don't start thinking that all subqueries are bad.

A lot of folks don't know that the UPDATE statement in SQL Server has a FROM clause, so they'll do it like folks do it in, say, Oracle. They'll use a correlated sub-query which is kind of a derivative of the "Direct Data" UPDATE

And there are other folks who don't know that what you call Oracle syntax is actually ANSI-standard syntax, and that the optional FROM clause in SQL Server is non-standard. In fact, even Access (also microsoft) has a different syntax for the same thing! So beware - if you choose to use this, you can't port your code.

What's worse, it is document to produce nondeterministic results in some scenarios. So beware - if you choose to use this, test very carefully (and due to the nature of nondeterministic results combined with Murphy's Law, you'll probably find that it always goes right in test, but fails in production)

I have blogged about how UPDATE FROM is no longer needed once we have MERGE (in SQL Server 2008, that is), and even opened a suggestion on Connect to deprecate UPDATE FROM. To my surprise, the Connect suggestion was not turned down, but instead, Microsoft confirmed that they are looking at the future of UPDATE FROM. So beware - if you choose to use this, you might find yourself forced to rewrite a few years from now.

Notice that "sometable" in the "UPDATE sometable" and the "FROM sometable st1" are the SAME table. Some other folks noticed that, too. So, they came up with this method... it's actually undocumented in BOL, but it works very well and seems to make it less confusing for some developers. Notice the use of the table alias instead of the table name in the UPDATE?

I was surprised to see that this second version is indeed not doocumented. I have used this an I know it works.

However, the first version you posted should be forbidden. Once you give "sometable" the alias "st1", you should no longer be able to refer to "sometable" without getting an error (and if you try this in a SELECT query, you'll see that it does indeed work this way). So it's acutally quite weird that you can still refer to the table name in the UPDATE. BOL forbids the use of an alias in case of a self-join, but I think that aliases should always be forbidden for the target of the UPDATE - or the alias should be specified as the target (as in your undocumented second example). Well, I guess this kind of confusion on weird rules is just one more reason to deprecate this proprietary feature.

The "Death by SQL" Update

This one is just as undocumented as the one before - BOL says that you can use a FROM clause in an UPDATE statement, but there is no mention at all whether the target table should or should not be included in it. Yet another reason to deprecate... 😛

However, I can assure you that it is completely equivalent to an UPDATE FROM with the target table in the FROM clause:

UPDATE sometable

SET somecolumn = st2.somecolumn

FROM sometable2 st2 --LOOK!!! FROM ID MISSING OBJECT OF THE UPDATE!!!

WHERE sometable.somecolumn = st2.somecolumn

AND some condition exists;

-- This is completely equivalent (check execution plans!) to

UPDATE sometable

SET somecolumn = st2.somecolumn

FROM sometable2 st2

INNER JOIN sometable

ON sometable.somecolumn = st2.somecolumn

WHERE sometable.somecolumn = st2.somecolumn

AND some condition exists;

It looks so harmless that, like my DBA and I, you will overlook it as the problem for hours.

You obviously have been bitten by this one day, but I suspect the real cause was something else. Maybe it was harder to find because of this (and I personally do indeed always include the target of the update in the FROM clause if I use UPDATE FROM at all), but it was not caused specifically by this syntax.

Okay, with the basics out of the way, lets go down to the actual code.

we're first going to make a copy of the target table of the UPDATE. Here's the code to do that.

Since you used SELECT INTO, you are making a copy with no indexes at all. Not even a primary key or a clustered index. I have added the below to my test runs (note that it didn't change the results, but it's still bad practice to use tables with no keys).

ALTER TABLE Sales.SPTest

ADD CONSTRAINT PK_SPTest PRIMARY KEY(SalesPersonID);

The Microsoft Example Code

I know it's not your fault (you just copied from BOL), but I found this example to be extremely weird.

* First, SalesYTD should not be stored in the table. It's a derived attribute, so it should be in a view.

* Second, if there are indeed good reasons (performance?) to persist SalesYTD, a trigger should be used to keep it current. Not a query that has to be run daily in order to keep the information correct.

* Third, if both the previous argumentts are disregarded, the query presented here is not doing what it should do. Imagine having a great day with lots of sales, and then calling in sick the next day. Since you won't make any sales, the last OrderDate recorded for you will be yesterday, so those huge sale numbers get added into SalesYTD once more. Great way to get closer to your yearly goals!

Okay, I'll skip over the part where you explain the reasoning, right to the finished code:

... and, we're done... notice that the code is a wee bit (a lot, actually) longer than the original code

An important observation that you did not make is that the queries are not 100% equivalent. If a salesperson never made any sales, the BOL example will find no rows, calculate NULL as the result of the SUM() function in the subquery, add that to SalesYTD (resulting in NULL), and attempt to store that. This will fail, since SalesYTD is defined as NOT NULL. (The BOL example should have used COALESCE to prevent this).

Your query on the other hand has no rows for thie salesperson in the derived table, so the join will remove that salesperson from the set of rows to be updated, and his or her SalesYTD will not change.

You might think that this makes your query superior, and in this case it indeed is - but what if we attempted to update a column SalesLastDay? The subquery method (with COALESCE added) will nicely set SalesLastDay to 0 if a salesperson has no sales. Your method leaves it unchanged. So if I entered a sale incorrectly yesterday (giving a sale to an inactive salesperson) and correct it today, the SalesLastDay for the inactive salesperson will not be cleared.

There are situation where you want a row to be excluded from the UPDATE if there is no match. There are other situations where you don't want that. This behaviour difference is a common pitfall when people try to move between the two UPDATE syntax variations, and it's a shame that your article fails to point that out.

When I run that code, it only takes 15 milliseconds to update what used to take 313 milliseconds...

Well done. I got the same performance when I ran the test (we obviously have similar computing power under our desks :)). But to put things in perspective - this is NOT a result of replacing UPDATE with a subquery with UDPATE FROM - as I can achieve the eexact same performance with an ANSI-standard UPDATE with subquery by using a CTE

PRINT REPLICATE('=',70);

PRINT 'Correlated sub query using CTE method...';

SET STATISTICS TIME ON;

WITH LastDay AS

(SELECT SalesPersonID, MAX(OrderDate) AS MaxDate

FROM Sales.SalesOrderHeader

WHERE SalesPersonID IS NOT NULL

GROUP BY SalesPersonID),

TotalLastDay AS

(SELECT so.SalesPersonID, SUM(SubTotal) AS Total

FROM Sales.SalesOrderHeader AS so

INNER JOIN LastDay AS l

ON l.SalesPersonID = so.SalesPersonID

AND l.MaxDate = so.OrderDate

GROUP BY so.SalesPersonID)

UPDATE Sales.SPTest

SET SalesYTD = SalesYTD +

(SELECT Total

FROM TotalLastDay AS t

WHERE Sales.SPTest.SalesPersonID = t.SalesPersonID);

SET STATISTICS TIME OFF;

GO

I think this code is similar enough to your code that I don't need to explain it. But do let me know if you don't understand it (I'll activate email notification for this thread).

For the Microsoft example, it would take 0.313/17*1000000 milliseconds or 18.4 seconds to update a million rows. Not bad... until you realize that the performance enabled version does it in 0.015/17*1000000 milliseconds or only 0.882 seconds.

I currently can't justify spending the time to really test this, but I hope someone will. I think that, once you hit this amount of rows, the optimizer will consider more possibilities so there is a better chance that it creates a faster plan for the currently slower version. Remember that the optimizer is cost-based - so for example, if a query will probably run in 300 milliseconds and the next step in the optimization process will increase the compilation time by 500 milliseconds, it won't bother.

We also saw how easy it was to rewrite one piece of code to get it to run 19 times faster that it originally did and we didn't touch the server or indexes.

Which is a shame, since I think that an index on (SalesPersonID, OrderDate) with SubTotal as included column would have speeded up things a lot more. In fact, I just tested this and this brings execution time for "your" joined method down to 6 ms, "my" subquery with CTE to 7 ms, and BOL's original method down to a mere 2 ms. How's that, eh?

And that's another lesson learned: you can spend a few hours rewriting bad performing code, or you can spend five minutes adding an index, and often, the latter solution works best (but do take care not to add too many indexes, since they will slow any data modification!)

Please note that this does not imply that you should not rewrite bad code. I agree with the comment earlier in this thread that one should always try to write at least two completely different queries for the same task and compare performance. But I do want to point out that "Tune the Code... that's where the performance is!" tells only half of the story.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/