More RBAR and "Tuning" UPDATEs

  • Comments posted to this topic are about the item More RBAR and "Tuning" UPDATEs

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • what about this:

    update table1

    set table1.column1 = #table.column1, table1.column2 = #table.column2

    from table1, #table

    where table1.pk=#table.pk

  • The links to UPDATE are to the UPDATE() function, not the UPDATE statement syntax, which is 1 line above in the index. 🙂

  • Very good article.

    I just think RBAR would catch on more, if it were pronounced AR-BAR. (Sorry 😉 )

  • Indeed 70% or our tuning time is actualy sql-education.

    If our dev's would only apply the basic rules and also try to write their

    code at least in two different ways,

    they would soon discover where performance is to be found.

    Once again, thank you for sharing the knowledge 😎

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • That is simply one of the best articles on RBAR I have ever seen!

    Excellent!! 🙂

  • Hi all,

    I'm sorry for having to be the party pooper, but there is just too much wrong with this article to let it slip.

    many people don't know how to do joins in the UPDATE statement and they end up using correlated sub-queries, instead. Of course, that's a major form of RBAR that should be avoided

    A sub-query is not an automatic RBAR. That's the conceptual explanation of subqueries, but in reality, the optimizer will often internally convert the query to an equivalent join. It is clear that this did not happen in the example in your article, but please don't start thinking that all subqueries are bad.

    A lot of folks don't know that the UPDATE statement in SQL Server has a FROM clause, so they'll do it like folks do it in, say, Oracle. They'll use a correlated sub-query which is kind of a derivative of the "Direct Data" UPDATE

    And there are other folks who don't know that what you call Oracle syntax is actually ANSI-standard syntax, and that the optional FROM clause in SQL Server is non-standard. In fact, even Access (also microsoft) has a different syntax for the same thing! So beware - if you choose to use this, you can't port your code.

    What's worse, it is document to produce nondeterministic results in some scenarios. So beware - if you choose to use this, test very carefully (and due to the nature of nondeterministic results combined with Murphy's Law, you'll probably find that it always goes right in test, but fails in production)

    I have blogged about how UPDATE FROM is no longer needed once we have MERGE (in SQL Server 2008, that is), and even opened a suggestion on Connect to deprecate UPDATE FROM. To my surprise, the Connect suggestion was not turned down, but instead, Microsoft confirmed that they are looking at the future of UPDATE FROM. So beware - if you choose to use this, you might find yourself forced to rewrite a few years from now.

    Notice that "sometable" in the "UPDATE sometable" and the "FROM sometable st1" are the SAME table. Some other folks noticed that, too. So, they came up with this method... it's actually undocumented in BOL, but it works very well and seems to make it less confusing for some developers. Notice the use of the table alias instead of the table name in the UPDATE?

    I was surprised to see that this second version is indeed not doocumented. I have used this an I know it works.

    However, the first version you posted should be forbidden. Once you give "sometable" the alias "st1", you should no longer be able to refer to "sometable" without getting an error (and if you try this in a SELECT query, you'll see that it does indeed work this way). So it's acutally quite weird that you can still refer to the table name in the UPDATE. BOL forbids the use of an alias in case of a self-join, but I think that aliases should always be forbidden for the target of the UPDATE - or the alias should be specified as the target (as in your undocumented second example). Well, I guess this kind of confusion on weird rules is just one more reason to deprecate this proprietary feature.

    The "Death by SQL" Update

    This one is just as undocumented as the one before - BOL says that you can use a FROM clause in an UPDATE statement, but there is no mention at all whether the target table should or should not be included in it. Yet another reason to deprecate... 😛

    However, I can assure you that it is completely equivalent to an UPDATE FROM with the target table in the FROM clause:

    UPDATE sometable

    SET somecolumn = st2.somecolumn

    FROM sometable2 st2 --LOOK!!! FROM ID MISSING OBJECT OF THE UPDATE!!!

    WHERE sometable.somecolumn = st2.somecolumn

    AND some condition exists;

    -- This is completely equivalent (check execution plans!) to

    UPDATE sometable

    SET somecolumn = st2.somecolumn

    FROM sometable2 st2

    INNER JOIN sometable

    ON sometable.somecolumn = st2.somecolumn

    WHERE sometable.somecolumn = st2.somecolumn

    AND some condition exists;

    It looks so harmless that, like my DBA and I, you will overlook it as the problem for hours.

    You obviously have been bitten by this one day, but I suspect the real cause was something else. Maybe it was harder to find because of this (and I personally do indeed always include the target of the update in the FROM clause if I use UPDATE FROM at all), but it was not caused specifically by this syntax.

    Okay, with the basics out of the way, lets go down to the actual code.

    we're first going to make a copy of the target table of the UPDATE. Here's the code to do that.

    Since you used SELECT INTO, you are making a copy with no indexes at all. Not even a primary key or a clustered index. I have added the below to my test runs (note that it didn't change the results, but it's still bad practice to use tables with no keys).

    ALTER TABLE Sales.SPTest

    ADD CONSTRAINT PK_SPTest PRIMARY KEY(SalesPersonID);

    The Microsoft Example Code

    I know it's not your fault (you just copied from BOL), but I found this example to be extremely weird.

    * First, SalesYTD should not be stored in the table. It's a derived attribute, so it should be in a view.

    * Second, if there are indeed good reasons (performance?) to persist SalesYTD, a trigger should be used to keep it current. Not a query that has to be run daily in order to keep the information correct.

    * Third, if both the previous argumentts are disregarded, the query presented here is not doing what it should do. Imagine having a great day with lots of sales, and then calling in sick the next day. Since you won't make any sales, the last OrderDate recorded for you will be yesterday, so those huge sale numbers get added into SalesYTD once more. Great way to get closer to your yearly goals!

    Okay, I'll skip over the part where you explain the reasoning, right to the finished code:

    ... and, we're done... notice that the code is a wee bit (a lot, actually) longer than the original code

    An important observation that you did not make is that the queries are not 100% equivalent. If a salesperson never made any sales, the BOL example will find no rows, calculate NULL as the result of the SUM() function in the subquery, add that to SalesYTD (resulting in NULL), and attempt to store that. This will fail, since SalesYTD is defined as NOT NULL. (The BOL example should have used COALESCE to prevent this).

    Your query on the other hand has no rows for thie salesperson in the derived table, so the join will remove that salesperson from the set of rows to be updated, and his or her SalesYTD will not change.

    You might think that this makes your query superior, and in this case it indeed is - but what if we attempted to update a column SalesLastDay? The subquery method (with COALESCE added) will nicely set SalesLastDay to 0 if a salesperson has no sales. Your method leaves it unchanged. So if I entered a sale incorrectly yesterday (giving a sale to an inactive salesperson) and correct it today, the SalesLastDay for the inactive salesperson will not be cleared.

    There are situation where you want a row to be excluded from the UPDATE if there is no match. There are other situations where you don't want that. This behaviour difference is a common pitfall when people try to move between the two UPDATE syntax variations, and it's a shame that your article fails to point that out.

    When I run that code, it only takes 15 milliseconds to update what used to take 313 milliseconds...

    Well done. I got the same performance when I ran the test (we obviously have similar computing power under our desks :)). But to put things in perspective - this is NOT a result of replacing UPDATE with a subquery with UDPATE FROM - as I can achieve the eexact same performance with an ANSI-standard UPDATE with subquery by using a CTE

    PRINT REPLICATE('=',70);

    PRINT 'Correlated sub query using CTE method...';

    SET STATISTICS TIME ON;

    WITH LastDay AS

    (SELECT SalesPersonID, MAX(OrderDate) AS MaxDate

    FROM Sales.SalesOrderHeader

    WHERE SalesPersonID IS NOT NULL

    GROUP BY SalesPersonID),

    TotalLastDay AS

    (SELECT so.SalesPersonID, SUM(SubTotal) AS Total

    FROM Sales.SalesOrderHeader AS so

    INNER JOIN LastDay AS l

    ON l.SalesPersonID = so.SalesPersonID

    AND l.MaxDate = so.OrderDate

    GROUP BY so.SalesPersonID)

    UPDATE Sales.SPTest

    SET SalesYTD = SalesYTD +

    (SELECT Total

    FROM TotalLastDay AS t

    WHERE Sales.SPTest.SalesPersonID = t.SalesPersonID);

    SET STATISTICS TIME OFF;

    GO

    I think this code is similar enough to your code that I don't need to explain it. But do let me know if you don't understand it (I'll activate email notification for this thread).

    For the Microsoft example, it would take 0.313/17*1000000 milliseconds or 18.4 seconds to update a million rows. Not bad... until you realize that the performance enabled version does it in 0.015/17*1000000 milliseconds or only 0.882 seconds.

    I currently can't justify spending the time to really test this, but I hope someone will. I think that, once you hit this amount of rows, the optimizer will consider more possibilities so there is a better chance that it creates a faster plan for the currently slower version. Remember that the optimizer is cost-based - so for example, if a query will probably run in 300 milliseconds and the next step in the optimization process will increase the compilation time by 500 milliseconds, it won't bother.

    We also saw how easy it was to rewrite one piece of code to get it to run 19 times faster that it originally did and we didn't touch the server or indexes.

    Which is a shame, since I think that an index on (SalesPersonID, OrderDate) with SubTotal as included column would have speeded up things a lot more. In fact, I just tested this and this brings execution time for "your" joined method down to 6 ms, "my" subquery with CTE to 7 ms, and BOL's original method down to a mere 2 ms. How's that, eh?

    And that's another lesson learned: you can spend a few hours rewriting bad performing code, or you can spend five minutes adding an index, and often, the latter solution works best (but do take care not to add too many indexes, since they will slow any data modification!)

    Please note that this does not imply that you should not rewrite bad code. I agree with the comment earlier in this thread that one should always try to write at least two completely different queries for the same task and compare performance. But I do want to point out that "Tune the Code... that's where the performance is!" tells only half of the story.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Excellent! Both the article (thanks Jeff) and the comments (thanks Hugo.)

    This is the beauty of public forums; none of us know more than all of us!

  • The timing of this article was perfect for me.

    I'm in the process of evaluating some stored procedures that are poor performers (which is causing a host of other problems) and I had narrowed it down to some update queries that were architecturally identical to the 'death by SQL' example (intuitively, they just didn't look right). This article reinforces what I believed and shows me why.

    Like you say, it is data dependent, because in my test environment (small data set), I couldn't see any performance difference when comparing to the same queries rewritten with the 'target' table in the 'FROM' (which is how I would have done it if I had written it). The execution plans didn't reveal much, if any, difference. In Production, the dataset is huge and the queries involve transactions and linked servers, which are probably compounding the problem (from a stand point of execution time).

    Like I said, this is not my code, so I have to 'prove' to the owner that there is a better way to write the queries (a sensitive issue) so as to improve the performance, and now I have some reinforcement to do that.

    Once I get this code changed, I'll report back a before and after result (but don't hold your breath...it will take awhile where I work).

    Thanks Jeff!

    If it was easy, everybody would be doing it!;)

  • I always know that I'm going to learn something when Jeff posts an article. It looks like I'm going to have to look closer when I am coding in a WHERE IN.

  • :P:P Excellent article Jeff, I have a question for you though. Is there a FROM clause for the DELETE statement as well and how can I code that?:w00t:

    :-PManie Verster
    Developer
    Johannesburg
    South Africa

    I can do all things through Christ who strengthens me. - Holy Bible
    I am a man of fixed and unbending principles, the first of which is to be flexible at all times. - Everett Mckinley Dirkson (Well, I am trying. - Manie Verster)

  • Hugo Kornelis (3/14/2008)


    And there are other folks who don't know that what you call Oracle syntax is actually ANSI-standard syntax, and that the optional FROM clause in SQL Server is non-standard. In fact, even Access (also microsoft) has a different syntax for the same thing! So beware - if you choose to use this, you can't port your code.

    What's worse, it is document to produce nondeterministic results in some scenarios. So beware - if you choose to use this, test very carefully (and due to the nature of nondeterministic results combined with Murphy's Law, you'll probably find that it always goes right in test, but fails in production)

    Well done. I got the same performance when I ran the test (we obviously have similar computing power under our desks :)). But to put things in perspective - this is NOT a result of replacing UPDATE with a subquery with UDPATE FROM - as I can achieve the eexact same performance with an ANSI-standard UPDATE with subquery by using a CTE

    PRINT REPLICATE('=',70);

    PRINT 'Correlated sub query using CTE method...';

    SET STATISTICS TIME ON;

    WITH LastDay AS

    (SELECT SalesPersonID, MAX(OrderDate) AS MaxDate

    FROM Sales.SalesOrderHeader

    WHERE SalesPersonID IS NOT NULL

    GROUP BY SalesPersonID),

    TotalLastDay AS

    (SELECT so.SalesPersonID, SUM(SubTotal) AS Total

    FROM Sales.SalesOrderHeader AS so

    INNER JOIN LastDay AS l

    ON l.SalesPersonID = so.SalesPersonID

    AND l.MaxDate = so.OrderDate

    GROUP BY so.SalesPersonID)

    UPDATE Sales.SPTest

    SET SalesYTD = SalesYTD +

    (SELECT Total

    FROM TotalLastDay AS t

    WHERE Sales.SPTest.SalesPersonID = t.SalesPersonID);

    SET STATISTICS TIME OFF;

    GO

    I think this code is similar enough to your code that I don't need to explain it. But do let me know if you don't understand it (I'll activate email notification for this thread).

    We also saw how easy it was to rewrite one piece of code to get it to run 19 times faster that it originally did and we didn't touch the server or indexes.

    Which is a shame, since I think that an index on (SalesPersonID, OrderDate) with SubTotal as included column would have speeded up things a lot more. In fact, I just tested this and this brings execution time for "your" joined method down to 6 ms, "my" subquery with CTE to 7 ms, and BOL's original method down to a mere 2 ms. How's that, eh?

    And that's another lesson learned: you can spend a few hours rewriting bad performing code, or you can spend five minutes adding an index, and often, the latter solution works best (but do take care not to add too many indexes, since they will slow any data modification!)

    Hugo,

    I'm not going to disagree with you on any of the points you make, especially since you obviously know more than I do about the subject. I also appreciate the fact that you not only raised the points, but provided alternate solutions instead of just ripping the article like I have seen many times.

    I am going to make a couple of points though. The UPDATE FROM may be SQL Server specific, but from my perspective that's fine since I am on a SQL Server site and am looking for help with SQL Server. To not use it becasue it could be deprecated, is in my opinion, not a reason to refrain from using it. Any SQL Server specific feature could be deprecated at some point. Could you post the where the documentation of it producing indeterminate results is? I'd be interested in reading that.

    Your CTE example is excellent and may be the best way to solve the problem in SQL 2005, but there are an awful lot of SQL 200 installs out there still, and the CTE won't work there. One thing Jeff usually does is post a solution that will work in 2000, although he is getting more and more into 2005 solutions now. OF course your earlier points about indeterminate results in the UPDATE FROM certainly need to be considered.

    You mention adding and index and the performance gained from it. You do mention, in a parenthetical, being careful not to add too many indexes, but I think you could have more clearly stated that you need to really study out the afffects of adding an index and that will take a lot more than 5 minutes. There are many beginners that are on this site regularly who may read that and start adding indexes whenever they encounter performance issues with a single query and then start having problems elsewhere.

    I also wanted to comment on this statement:

    An important observation that you did not make is that the queries are not 100% equivalent. If a salesperson never made any sales, the BOL example will find no rows, calculate NULL as the result of the SUM() function in the subquery, add that to SalesYTD (resulting in NULL), and attempt to store that. This will fail, since SalesYTD is defined as NOT NULL. (The BOL example should have used COALESCE to prevent this).

    Your query on the other hand has no rows for thie salesperson in the derived table, so the join will remove that salesperson from the set of rows to be updated, and his or her SalesYTD will not change.

    You might think that this makes your query superior, and in this case it indeed is - but what if we attempted to update a column SalesLastDay? The subquery method (with COALESCE added) will nicely set SalesLastDay to 0 if a salesperson has no sales. Your method leaves it unchanged. So if I entered a sale incorrectly yesterday (giving a sale to an inactive salesperson) and correct it today, the SalesLastDay for the inactive salesperson will not be cleared.

    There are situation where you want a row to be excluded from the UPDATE if there is no match. There are other situations where you don't want that. This behaviour difference is a common pitfall when people try to move between the two UPDATE syntax variations, and it's a shame that your article fails to point that out

    I think you made a great point here and I missed it as well. It is very important that you take this into account when doing any kind of aggregation based on dates, products, etc... when there are chances of there being missing data.

    I have to admit that when I first started working with SQL Server I was more likely to use correlated subqueries, but in the last couple of years I have moved to derived tables (not necessarily because of performance), and now to CTE's which I like best, mainly for readability.

    Thanks to Jeff and Hugo for sharing their expertise and making me think before I code!

  • Manie Verster (3/14/2008)


    :P:P Excellent article Jeff, I have a question for you though. Is there a FROM clause for the DELETE statement as well and how can I code that?:w00t:

    Manie - there IS a FROM in the Delete statement. Just like update, you can only delete from one table at a time, but that doesn't mean you can't use a join to arrive at the data set to delete.

    Syntax looks like this:

    DELETE Mytable --this is the table where records are being deleted

    FROM

    MyTable

    inner join MyTable2 on mytable.id=mytable2.ID

    WHERE

    .....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • RE CTEs. I love them and use them, but I had gotten the impression that they too were a MS nonstandard extension. Are they part of the SQL standard and are they available in Oracle. (Same syntax?)

  • And there are other folks who don't know that what you call Oracle syntax is actually ANSI-standard syntax, and that the optional FROM clause in SQL Server is non-standard. In fact, even Access (also microsoft) has a different syntax for the same thing! So beware - if you choose to use this, you can't port your code.

    Interesting read. I have to ask though - I have to see any combinations of DBMS' where a straight port of code is simply a matter of cut and paste. Is that really a valid concern?

    Yes it does work some of the time, but once you get into "real-life code", every one of the major's seems to have their own intricacies, so you're going to HAVE to re-write some things. Even using "standard code", just because it runs doesn't mean it runs well.

    I keep hearing that "what if your RDBMS changed tomorrow" line being thrown around, and just have to take a skeptical approach to that. I've seen ports from one to the other, but they always involve a LOT of re-writing, reoptimizing. If all of them do the SAME thing, using the same code, same optimization tricks, etc... - then they will be no reason for them to be separate products. The cheapest one to license will be the one who takes over, and the others will just fall by the wayside.

    Until they act the same, having the same lines of code, well - is worthless IMO. My first priority is efficient and correct, not portable.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Viewing 15 posts - 1 through 15 (of 76 total)

You must be logged in to reply to this topic. Login to reply