• Hugo Kornelis (3/14/2008)


    And there are other folks who don't know that what you call Oracle syntax is actually ANSI-standard syntax, and that the optional FROM clause in SQL Server is non-standard. In fact, even Access (also microsoft) has a different syntax for the same thing! So beware - if you choose to use this, you can't port your code.

    What's worse, it is document to produce nondeterministic results in some scenarios. So beware - if you choose to use this, test very carefully (and due to the nature of nondeterministic results combined with Murphy's Law, you'll probably find that it always goes right in test, but fails in production)

    Well done. I got the same performance when I ran the test (we obviously have similar computing power under our desks :)). But to put things in perspective - this is NOT a result of replacing UPDATE with a subquery with UDPATE FROM - as I can achieve the eexact same performance with an ANSI-standard UPDATE with subquery by using a CTE

    PRINT REPLICATE('=',70);

    PRINT 'Correlated sub query using CTE method...';

    SET STATISTICS TIME ON;

    WITH LastDay AS

    (SELECT SalesPersonID, MAX(OrderDate) AS MaxDate

    FROM Sales.SalesOrderHeader

    WHERE SalesPersonID IS NOT NULL

    GROUP BY SalesPersonID),

    TotalLastDay AS

    (SELECT so.SalesPersonID, SUM(SubTotal) AS Total

    FROM Sales.SalesOrderHeader AS so

    INNER JOIN LastDay AS l

    ON l.SalesPersonID = so.SalesPersonID

    AND l.MaxDate = so.OrderDate

    GROUP BY so.SalesPersonID)

    UPDATE Sales.SPTest

    SET SalesYTD = SalesYTD +

    (SELECT Total

    FROM TotalLastDay AS t

    WHERE Sales.SPTest.SalesPersonID = t.SalesPersonID);

    SET STATISTICS TIME OFF;

    GO

    I think this code is similar enough to your code that I don't need to explain it. But do let me know if you don't understand it (I'll activate email notification for this thread).

    We also saw how easy it was to rewrite one piece of code to get it to run 19 times faster that it originally did and we didn't touch the server or indexes.

    Which is a shame, since I think that an index on (SalesPersonID, OrderDate) with SubTotal as included column would have speeded up things a lot more. In fact, I just tested this and this brings execution time for "your" joined method down to 6 ms, "my" subquery with CTE to 7 ms, and BOL's original method down to a mere 2 ms. How's that, eh?

    And that's another lesson learned: you can spend a few hours rewriting bad performing code, or you can spend five minutes adding an index, and often, the latter solution works best (but do take care not to add too many indexes, since they will slow any data modification!)

    Hugo,

    I'm not going to disagree with you on any of the points you make, especially since you obviously know more than I do about the subject. I also appreciate the fact that you not only raised the points, but provided alternate solutions instead of just ripping the article like I have seen many times.

    I am going to make a couple of points though. The UPDATE FROM may be SQL Server specific, but from my perspective that's fine since I am on a SQL Server site and am looking for help with SQL Server. To not use it becasue it could be deprecated, is in my opinion, not a reason to refrain from using it. Any SQL Server specific feature could be deprecated at some point. Could you post the where the documentation of it producing indeterminate results is? I'd be interested in reading that.

    Your CTE example is excellent and may be the best way to solve the problem in SQL 2005, but there are an awful lot of SQL 200 installs out there still, and the CTE won't work there. One thing Jeff usually does is post a solution that will work in 2000, although he is getting more and more into 2005 solutions now. OF course your earlier points about indeterminate results in the UPDATE FROM certainly need to be considered.

    You mention adding and index and the performance gained from it. You do mention, in a parenthetical, being careful not to add too many indexes, but I think you could have more clearly stated that you need to really study out the afffects of adding an index and that will take a lot more than 5 minutes. There are many beginners that are on this site regularly who may read that and start adding indexes whenever they encounter performance issues with a single query and then start having problems elsewhere.

    I also wanted to comment on this statement:

    An important observation that you did not make is that the queries are not 100% equivalent. If a salesperson never made any sales, the BOL example will find no rows, calculate NULL as the result of the SUM() function in the subquery, add that to SalesYTD (resulting in NULL), and attempt to store that. This will fail, since SalesYTD is defined as NOT NULL. (The BOL example should have used COALESCE to prevent this).

    Your query on the other hand has no rows for thie salesperson in the derived table, so the join will remove that salesperson from the set of rows to be updated, and his or her SalesYTD will not change.

    You might think that this makes your query superior, and in this case it indeed is - but what if we attempted to update a column SalesLastDay? The subquery method (with COALESCE added) will nicely set SalesLastDay to 0 if a salesperson has no sales. Your method leaves it unchanged. So if I entered a sale incorrectly yesterday (giving a sale to an inactive salesperson) and correct it today, the SalesLastDay for the inactive salesperson will not be cleared.

    There are situation where you want a row to be excluded from the UPDATE if there is no match. There are other situations where you don't want that. This behaviour difference is a common pitfall when people try to move between the two UPDATE syntax variations, and it's a shame that your article fails to point that out

    I think you made a great point here and I missed it as well. It is very important that you take this into account when doing any kind of aggregation based on dates, products, etc... when there are chances of there being missing data.

    I have to admit that when I first started working with SQL Server I was more likely to use correlated subqueries, but in the last couple of years I have moved to derived tables (not necessarily because of performance), and now to CTE's which I like best, mainly for readability.

    Thanks to Jeff and Hugo for sharing their expertise and making me think before I code!