Linking to next row using a join on row_number()+1=row_number()

  • Derek Dongray (4/16/2008)


    Matt Miller (4/16/2008)


    For what it's worth - the update method for forcing order will fall apart with a parallel plan, so don't change the MAXDOP on the test I gave you. Which might explain the difference in time.

    A parallel plan will wreck any guarantees as to order during the updates, so you might care to check the accuracy of your results if any of your methods use that. You may unfortunately find that you're getting very fast and unfortunately very inaccurate results.......

    I'm not sure I understand this.

    Our server has the 'Max degree of parallelism' parameter set to 0 (default) which means it will actually use a value of 4 (the number of processors) in looking for a parallel plan. However, from what you say, the update methods expecting a particular order (using a clustered index) may produce inaccurate results if MAXDOP is not 1. So the update methods, while being fast producing the correct number of records, may need careful checking.

    Have I got that right?

    Essentially correct - a parallel plan will mean that the data gets handed in chunks to multiple threads, each doing its own thing. Problem is - there's no guarantee that the "chunks" being handed out fall along significant breaks (say - by grouping), so the breaks would cause unpredictable and inaccurate results. The order WITHIN the "chunks" may be used, but not from one to the other. So you could end up with groups "broken up" over multiple threads, and thus - bad results.

    If you follow those discussions in the article I mentioned earlier - you'll see some of the difficulties in using algorithms that rely on the physical order to do something, and things that would cause the physical order not to be used. The part where Jeff goes into the Merry-go-round discussion is one of the pitfalls/extreme versions of those inaccuracies.

    Which is why you see folks recommending using the MAXDOP 1 option, as well as the INDEX(clusteredIndex) query hint to "force" the optimizer to use that order.

    Edit: all of this work stems to having to talk SQL Server down from its default position of "I can process this in whatever best way I think, to include order, join methods, etc...". Otherwise worded as "SQL server doesn't guarantee the order in which things are processed" in a lot of the disclaimers I've seen.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Building on what Matt said:

    I actually think that the whole Parallelism thing is one of the keys to what's going on here with these tests and why your Triangular query seems to scale-up so much better than the others (I'll mention the other reason at the bottom):

    That is, most of the souped-up tricks that we use for this problem are actually just very efficient serialization techniques, which is how we usually address the whole "find my previous row" comparison. The catch is that "serialization" is almost the opposite of "parallelism", and typically the two don't mix. So the ones that are really fast on our laptops, seem not so fast on the big servers because they cannot leverage all of the processors & cores. Your Triangle query can precisely because it doesn't use those tricks.

    Which brings me to your Triangle query itself: first, it isn't really a Triangle query: the Group By's effectively fix that problem, secondly why is it somewhat faster than other similar queries? Because it cleverly uses the nested Group Bys to find the next state change using effectively only three scans instead of the four scans that most other versions need. It's really quite ingenious, and it does not seem to depend on (or benefit from) clustered indexes.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • No, no... the following makes it a triangular join even in the presence of a GROUP BY...

    join

    #t b

    on a.id=b.id

    and a.date < b.date

    and a.st <> b.st

    It's just that the triangles are very small... as Matt pointed out, you only need one ID that has a larger number of dates and BOOOM! Performance starts making gurgling noises compared to both set based and procedureal (serial) ops.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • For info, I analysed some of our data by number of statuses a document went through and then added a '% time' based on the assumption that the time was proportional to the square of the number of steps. Using steps +/- 1 gave similar results, so it's as good as anything. Here's the result.

    steps c % time

    2 24148 7.0826

    3 4035 2.6628

    4 4379 5.1375

    5 6164 11.2994

    6 4388 11.5830

    7 5841 20.9864

    8 2040 9.5733

    9 1614 9.5861

    10 634 4.6488

    11 387 3.4336

    12 254 2.6819

    13 173 2.1438

    14 150 2.1558

    15 81 1.3364

    16 53 0.9949

    17 40 0.8476

    18 30 0.7127

    19 23 0.6088

    20 12 0.3520

    21 6 0.1940

    22 7 0.2484

    23 6 0.2327

    24 5 0.2112

    25 3 0.1375

    26 3 0.1487

    27 2 0.1069

    28 2 0.1150

    29 4 0.2467

    30 1 0.0660

    32 1 0.0751

    33 1 0.0799

    45 1 0.1485

    47 1 0.1620

    While there are some quite long sequences, the majority are less that 10.

    Derek

  • Jeff Moden (4/16/2008)


    No, no... the following makes it a triangular join even in the presence of a GROUP BY...

    join

    #t b

    on a.id=b.id

    and a.date < b.date

    and a.st <> b.st

    It's just that the triangles are very small... as Matt pointed out, you only need one ID that has a larger number of dates and BOOOM! Performance starts making gurgling noises compared to both set based and procedureal (serial) ops.

    Jeff, but it is possible for other parts of the query to undo the bad "physical implication" of one clause, if the compiler can determine a better "logical implication" from their consolidation or cancellation and the optimiser can take advantage of that.

    And based on my logical analysis of the query and from looking at the query plans, it is obvious that that is exactly what is going on here. Instead of the stacked Nested Loops over scans that should typify any triangular join (being O(n^2) or worse), it is doing parallel hash joins over parallel scans resulting in O(n * k/p) and it only needs 3 scans instead of 4 on top of that, which is pretty darn good. Whoever wrote this is one smart cookie and my hat is off to them.

    Mind you, I am not yet convinced that this is the fastest possible query, but it might be.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Yeaup... I absolutely agree that things like GROUP BY and by making sure that the triangular joins make very small triangles can be very effective. Still, it's something to be very aware of and I guess that's my real point... even code that can present a potential problem can be used very well and with good performance if it's used the right way. For example, I pretty much frown on correlated subqueries, but there's a method for finding missing numbers in sequential ID's that uses them and it absolutely flies (see below). Like everything else, even with triangular joins, "It Depends". Doing what you did (analysis) followed by a performance test on a million row test is a good way to ensure future scalability.

    SELECT GapStart = (SELECT ISNULL(MAX(b.SerialNumber),0)+1

    FROM dbo.yourtable b

    WHERE b.SerialNumber < a.SerialNumber),

    GapEnd = SerialNumber - 1

    FROM dbo.yourtable a

    WHERE a.SerialNumber - 1 NOT IN (SELECT SerialNumber FROM dbo.yourtable)

    AND a.SerialNumber - 1 > 0

    The code above even beats a Tally table solution especially in the presence of large gaps because it only returns 1 row for a large gap because the triangular join has been so well constrained and further limited my MAX (like using a GROUP BY). The Tally table solution would return all the rows for a large gap.

    The OP did a fine job of limiting the triangular join in all aspects... that's why the OPs triangular code works so fast compared to the others. Still, like the code above, a triangular join is present and if you change the code (or, sometimes the data), you need to be aware of it so that it doesn't actually get out of hand. Most developers that I've met aren't even aware of what a triangular join is or the devestating effect it has on performance for things like running totals.

    Another thing I use highly constrained triangular joins for is duplicate deletions in the presence of an IDENTITY column... like the missing sequence code above, it runs like lightning, but only when there aren't many identical dupes for a given row. The code is simple for other people to understand and the risk of having something like 20,000 dupes for a single row doesn't exist for the data we have. Still, like Matt said, it is a risk in this case. If I wanted to guarantee scalability, I'd copy the keys into a temp table and do an update similar to the update I used in the running balance problem to mark the dupes (7 seconds per million rows regardless of how many dupes there may be).

    Like I said in the "Hidden RBAR" article about triangular joins...

    Not all Triangular Joins are bad. With some restraint and the right criteria, Triangular Joins can be used for some pretty remarkable things... make finite schedules... do high speed dupe checks... etc. But, you've really got to be careful. Improperly written Triangular Joins are worse than even Cursors or While Loops and can bring a CPU and Disk System right to it's knees.

    Watch out for "Hidden RBAR".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Well - whichever way you decide to go - it sounds like you should hang on to this thread in case something changes. Given that each method seems to have one or more caveats, you may find yourself changing methods "in a hurry" some day when your data distribution changes..... Considering that there are 4 viable methods at last count - holding on to this would give you a quick way to switch in case the you picked goes "boom".....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • This thread is gold 🙂

    Regards,

    Jacob

Viewing 8 posts - 31 through 37 (of 37 total)

You must be logged in to reply to this topic. Login to reply