Solving the Running Total and Ordinal Rank Problems (Rewritten)

  • Paul White NZ (9/25/2010)


    Paul White NZ (9/7/2010)


    There is an optimizer quirk that means that some plans using INDEX(0) may include an unnecessary sort (which does not apply to INDEX(1)). I will be blogging the details for next week's T-SQL Tuesday (the topic is indexes).

    Yeah well I missed T-SQL Tuesday by some margin, but it's up now:

    http://sqlblog.com/blogs/paul_white/archive/2010/09/23/a-tale-of-two-index-hints.aspx

    Nice. I learned new things from this. Write more like this one, please.

    But I noticed that query 1 and query 2 swapped names at the beginning of the index hints section. Was that deliberate to keep the readers alert?

    Tom

  • Tom.Thomson (9/25/2010)


    Nice. I learned new things from this. Write more like this one, please.

    Thank you, I appreciate it. More in the same style or area of the product?

    But I noticed that query 1 and query 2 swapped names at the beginning of the index hints section. Was that deliberate to keep the readers alert?

    I make that error so frequently I don't even bother to check for it now :rolleyes:

    You're the first one of 1,745 reads to spot it though - so well done! I've fixed it now, by the way.

  • mjswart (9/22/2010)


    Hi Jeff,

    Records aren't always stored physically ordered on a data page. Most of the time they are physically ordered correctly, but the actual order comes from the slot array. I was curious to see whether a different physical order breaks the quirky update.

    Turns out it does not break the quirky update:

    -- create the table

    USE tempdb

    GO

    CREATE TABLE test

    (

    id int not null,

    value uniqueidentifier,

    deltaFromPrevious int,

    constraint pk_test primary key clustered (id)

    )

    INSERT test (id,value)

    VALUES

    (2,NEWID()),

    (4,NEWID()),

    (6,NEWID()),

    (8,NEWID()),

    (10,NEWID())

    GO

    -- run quirky update

    DECLARE @dfp int, @prev int;

    UPDATE test

    SET @dfp = deltaFromPrevious = id-@prev,

    @prev = id;

    -- everything looks fine here.

    select * from test

    -- mess up the physical order:

    UPDATE test SET id = 5 WHERE id = 10;

    GO

    -- dbcc page now tells us that physically the

    -- rows are out of order.

    -- run the quirky update again:

    DECLARE @dfp int, @prev int;

    UPDATE test

    SET @dfp = deltaFromPrevious = id-@prev,

    @prev = id;

    -- everything still looks fine here:

    select * from test

    DBCC may indeed show the data out of order but not in your code. You forgot to include the DBCC command examples in your code. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Paul, Jeff, et all...

    I want to ensure that I understand the safety check as posted by Paul at http://www.sqlservercentral.com/Forums/FindPost980118.aspx, and modified by Tom at http://www.sqlservercentral.com/Forums/FindPost981258.aspx.

    The resultant (combined) code would be:WITH SafeTable

    AS (

    SELECT Sequence = ROW_NUMBER() OVER (ORDER BY TD.AccountID ASC, TD.Date ASC, TD.TransactionDetailID ASC),

    TD.AccountID,

    TD.AccountRunningCount,

    TD.AccountRunningTotal,

    TD.Amount

    -- removed 3 columns from CTE not used in update statement

    FROM dbo.TransactionDetail TD

    )

    UPDATE SafeTable

    SET @AccountRunningTotal = AccountRunningTotal =

    CASE

    WHEN Sequence = @Sequence+1

    THEN CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningTotal+Amount

    ELSE Amount

    END

    ELSE 1/0

    END,

    @AccountRunningCount = AccountRunningCount =

    CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @Sequence =

    CASE

    WHEN Sequence = @Sequence + 1 THEN Sequence -- updating in sequence

    ELSE 1/0 -- quirky update is broken!

    END,

    @PrevAccountID = AccountID;

    And, even though the ROW_NUMBER() function does include a sort, thus presenting the data to the update statement in the proper order, we STILL need the clustered index, on the same columns and in the same order as utilized in the ROW_NUMBER() function.

    Has any testing been done to show that the OPTION (MAXDOP 1), and the WITH (TABLOCKX) hints are now not necessary? Obviously, the safety check will catch that if anything is off, but are they still needed to ensure that the update succeeds?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • I believe that the MAXDOP 1 option and TABLOCKX hint are still required.

    Unless parallel processing jumped in at just the right spot, the serial processing of a single CPU is still necessary to do the calculation correctly.

    For the TABLOCKX hint, why wouldn't you use it? The UPDATE is going to eventually lock the whole table... why not just take a single lock to begin with and enjoy the guarantee that no one is going to add data in the middle of your calculation?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • WayneS (10/10/2010)


    And, even though the ROW_NUMBER() function does include a sort, thus presenting the data to the update statement in the proper order...

    It's the order of the rows arriving at the Compute Scalar that is important, not at the Update iterator itself.

    WayneS (10/10/2010)


    ...we STILL need the clustered index, on the same columns and in the same order as utilized in the ROW_NUMBER() function.

    Almost always yes. On anything more than a very few rows, the optimizer will choose to introduce a sort on the clustered index keys. This sort is almost guaranteed to occur before the critical Compute Scalar, thus buggering up the essential ordering at that point.

    WayneS (10/10/2010)


    Has any testing been done to show that the OPTION (MAXDOP 1), and the WITH (TABLOCKX) hints are now not necessary? Obviously, the safety check will catch that if anything is off, but are they still needed to ensure that the update succeeds?

    It's all but impossible (right now) to generate a parallel plan for a quirky update, but it seems logical that the work performed by the Compute Scalar has to occur on a single thread for correct results. So, I vote for MAXDOP 1 being essential.

    TABLOCKX isn't essential in quite the same way, but it seems like a sensible precaution to me, for the reasons Jeff gave.

  • Thanks for your responses guys. It's right in line with what I was thinking, but it's always good to get the double-check.

    Now, I just gotta get used to using the QU with the PW/TT safety check modification.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Paul White NZ (10/10/2010)


    TABLOCKX isn't essential in quite the same way, but it seems like a sensible precaution to me, for the reasons Jeff gave.

    I think there's another reason as well: if this query is the only one in a transaction, using TABLOCKX will reduce the risk of this query deadlocking with something else that's going on.

    Tom

  • Hello All,

    I have been playing with the QU code in Jeff's article and discovered a couple of things....

    It seems you CAN update in any order you like regardless of the clustered index ! :pinch:

    You may not therefore NEED a clustered index.. My testing so far shows that I can perform a correct update regardless of having a CI or not...but it is very limited testing so far....

    The key to this is the use of ordering in a CTE - which I know Jeff covered but he missed the clincher I believe. The TOP clause has the answer.

    It makes sense, does it not, that if I select 10 rows in my CTE ordered by anything I like, and then update that CTE I will update those 10 rows? Yes?

    Yes! So extending that to the whole table is as simple as counting the rows in the table first, then specifying that count in the TOP clause..

    So, you will be dying to try it... Here is the code - taken from Jeff's article and amended:

    EXEC dbo.ResetTestTable

    DECLARE @PrevAccountID INT

    DECLARE @AccountRunningTotal MONEY

    DECLARE @AccountRunningCount INT

    --===== Update the running total and running count for this row using the "Quirky

    -- Update" and a "Pseudo-cursor". The order of the UPDATE is controlled by the

    -- order by clause in the CTE!

    --===== Get the rowcount - this is vital - it must not exceed the actual rowcount.

    DECLARE @count bigint = (select COUNT(*) from TransactionDetail)

    --===== Update the running total using multipart updates

    -- applied to an "ordered" CTE.

    ;WITH cteOrdered AS

    (

    SELECT TOP (@count) --===== This forces the new ordering.

    AccountID ,

    NCID,

    Amount,

    AccountRunningTotal,

    AccountRunningCount

    FROM dbo.TransactionDetail

    ORDER BY NCID DESC --Don't forget... reverse order here

    )

    UPDATE cteOrdered

    SET @AccountRunningTotal = AccountRunningTotal = CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningTotal + Amount

    ELSE Amount

    END,

    @AccountRunningCount = AccountRunningCount = CASE

    WHEN AccountID = @PrevAccountID

    THEN @AccountRunningCount + 1

    ELSE 1

    END,

    @PrevAccountID = AccountID

    FROM cteOrdered WITH (TABLOCKX)

    GO

    You will want this new Verify SP as well - VerifyNCID - to check the running totals. Again this is taken from Jeff's code and amended to check the results ordered by NCID.

    CREATE PROCEDURE dbo.VerifyNCID AS

    /*************************************************************************************

    Code to verify that the account running total calculation worked correctly.

    Please read the comments to see how it works.

    *************************************************************************************/

    --===== Conditionally drop the verification table to make

    -- it easy to rerun the verification code

    IF OBJECT_ID('TempDB..#Verification') IS NOT NULL

    DROP TABLE dbo.#Verification

    --===== Define a variable to remember the number of rows

    -- copied to the verification table

    DECLARE @MyCount INT

    --===== Copy the data from the test table into the

    -- verification table in the correct order.

    -- Remember the correct order with a ROW_NUMBER - otherwise we don't use the correct ordering

    SELECT ROW_NUMBER() OVER(ORDER BY NCID DESC) AS RowNum,

    AccountID,

    Amount,

    AccountRunningTotal

    INTO #Verification

    FROM dbo.TransactionDetail

    ORDER BY NCID DESC

    --===== Remember the number of rows we just copied

    SELECT @MyCount = @@ROWCOUNT

    --===== Check the running total calculations

    SELECT CASE

    WHEN COUNT(hi.RowNum) + 1 = @MyCount

    THEN 'Account Running Total Calculations are correct'

    ELSE 'There are some errors in the Account Running Totals'

    END

    FROM #Verification lo

    INNER JOIN

    #Verification hi

    ON lo.RowNum + 1 = hi.RowNum

    WHERE (-- Compare lines with the same AccountID

    hi.AccountID = lo.AccountID

    AND hi.AccountRunningTotal = lo.AccountRunningTotal + hi.Amount)

    OR

    (-- First line of account has running total same as amount

    hi.AccountID <> lo.AccountID

    AND hi.AccountRunningTotal = hi.Amount)

    GO

    Now, the question is - does this make anyone happier about using it as you can define the sort order specifically?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • MM -

    Looks interesting. It will take more investigation.

    I have some initial questions:

    --===== Get the rowcount - this is vital - it must not exceed the actual rowcount.

    1. So, what happens if a record (or more) gets deleted between the start of this statement and the next where the variable is used?

    2. Wouldn't it be faster to get the record count from sys.dm_db_partition_stats?

    3. Is there any special reason that you didn't include the safety check?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • mister.magoo (11/17/2010)


    Now, the question is - does this make anyone happier about using it as you can define the sort order specifically?

    No! Let me explain why:

    The TOP is the same idea, in principle, as that used by the safety check. I mentioned in previous posts that the safety check allows a quirky update to be performed without a clustered index - specifically on a heap table. If there is a clustered index, in a different order from that specified either by your TOP clause, or by my ROW_NUMBER safety check, the quirky update will usually fail.

    The reason is that the optimizer will introduce a sort to optimize the clustered index update for sequential IO, and that sort appears before the vital Compute Scalar, messing up the quirky update.

    I prefer the ordering provided by the safety check over your TOP method because:

    (a) The safety check ensures the QU completes successfully or throws an error;

    (b) No extra COUNT_BIG(*) step is required

    Both the safety check ROW_NUMBER() clause and the TOP...ORDER BY method introduce a sorting requirement. The important thing (for the quirky update to work) is that this order is preserved later in the plan when the Compute Scalar processes the running total.

    In general, there is no guarantee at all that this sort order will be preserved for later iterators. That's why we need the safety check. To be clear, the engine does not guarantee that sort order will be preserved any longer than is required for correctness. For example, a sort performed to provide order to a TOP iterator, is guaranteed to be retained as far as the TOP, but no further. The next iterator after the TOP might have rows presented to it in any order, in principle.

    This may seem counter-intuitive, since you can look at a plan and 'see' that rows are sorted at one point, and never re-sorted. You might even reason that there aren't any non-order-preserving iterators (like a hash) and so order must be preserved. This is the faulty assumption I want people to stop making.

    The advantage of the safety check is that it will throw an error if the assumption about preserved sort order does not hold.

    Paul

  • WayneS (11/17/2010)


    I have some initial questions:

    --===== Get the rowcount - this is vital - it must not exceed the actual rowcount.

    This seems wrong to me - surely the row count can safely exceed the actual, but must not be less. I usually use TOP (9223372036854775807) for this sort of 'intermediate materialization' query. In TOP (expression), expression is BIGINT (when PERCENT is not specified), and 9223372036854775807 is the maximum value of a BIGINT

    So, what happens if a record (or more) gets deleted between the start of this statement and the next where the variable is used?

    It would be problematic if new records were added. Another reason to use TOP (9223372036854775807).

    Wouldn't it be faster to get the record count from sys.dm_db_partition_stats?

    If the exact number were required for the TOP (which it isn't) you couldn't rely on that view to provide it - it's not guaranteed to be transactionally correct, or even close.

    Is there any special reason that you didn't include the safety check?

    The safety check replaces the TOP idea completely.

  • All interesting points, which I also will have to investigate.

    The only reason for not including Paul's safety check was that it was not in the Article which is where I took the code from.

    As far as guaranteeing sort order is concerned, obviously the safety check is an excellent idea.

    I have to admit I had not thought about it as it was not in the article - the fact it allows updates by any order is excellent and I bow to this as a much better solution.

    However, I still am interested in the fact (so far as I can see) that making sure the TOP number does not exceed the rowcount means that the update does not use the clustered index for ordering and does not even seem to require one on the table....

    Thanks for the feedback.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Hi Mister Magoo,

    In addition to the points raised by Paul, here is one more.

    Older versions of SQL Server (prior to SQL Server 2005) allowed you to create "ordered views". This is in fact nonsense, as views and tables are sets which are unsorted by definition - this is why the standard SQL syntax does not even allow an ORDER BY in a view definition. And neither does SQL Server - unless the view definition contains a TOP clause, in which case an accompanying ORDER BY clause is allowed; this ORDER BY clause is then used to define which rows qualify the TOP clause.

    People quickly observed that a SELECT from that VIEW always returned rows in the specified order. This was never documented or guaranteed; it was a coincidental side effect. And yet, people came in the habit of creating views with a TOP 100 PERCENT clause and an ORDER BY. expecting a SELECT from that view to guarantee results in the "correct" order.

    In SQL Server 2005, the optimizer got smarter. It realized that TOP (100 PERCENT) is a no-op, and simply disregarded both the TOP operator and the accompanying ORDER BY. And people started complaining (and even reporting bugs) because their "ordered views" suddenly stopped working. Most people then changed their code, moving the ORDER BY to the SELECT that returns dat to the client. But there were also the stubborn people, who found out thet the undocumented behaviour persists if you use TOP (99.999999 PERCENT), or TOP (2147483647) [that is the maximum integer - I have not seen Paul's suggestion of TOP (9223372036854775807) before, but that is of course the same idea].

    These suggestion do indeed seem to work - for now. And only if the amount of rows that qualify the WHERE clause does not exceed the specified number, or the number where 0.0000001% is 1 or more.

    But is this safe? Should we not expect that the SQL Server developers, who constantly try to improve the optimizer, will one day replace the current "simple" logic to remove a TOP (100 PERCENT) with a more advanced logic that, at run time, removes any TOP clause that does not remove any rows? It may never happen. Or it may. It's definitely not a dependency I am willing to take for my code.

    I have already written before in thsis discussion that Paul's safety check is a good improvement. It does not really make the method any more reliable (as it still relies on undocumented and unguaranteed functionality), but it does at least promise to throw an error when incorrect results would be generated without having to do a seperate check (which, I am sure, many people simply won't do after some testing time in a dev environment).

    Whether or not the combination of your TOP trick with Paul's safety check makes this method more reliable is something I'll leave for others to judge. My point merely is that without Paul's safety check, this method (with or without TOP) is unsafe and therefor not suited for production code.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • All,

    Thanks for taking the time to explain things to me - this is certainly a tricky subject and I appreciate your help.

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

    Viewing 15 posts - 211 through 225 (of 308 total)

    You must be logged in to reply to this topic. Login to reply