Tally Tables

  • Comments posted to this topic are about the item Tally Tables


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks for a nice question - good way to start a work week

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • bitbucket-25253 (9/2/2013)


    Thanks for a nice question - good way to start a work week

    +1 🙂

    _______________________________________________________________
    To get quick answer follow this link:
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

  • Nice question....

  • Just look at the execution plan and tell me if there is no difference.

    Or even try this, where NEWID() is completely eliminated from the second query.

    -- 1

    SELECT ROW_NUMBER() OVER (ORDER BY NEWID())

    FROM sys.all_columns

    -- 2

    SELECT ROW_NUMBER() OVER (ORDER BY (select 1))

    FROM sys.all_columns

    I'd argue that they are different but yield the same result, as is the example given in the question. The first query will indeed evaluate NEWID() once for every row whereas the second query does a single call to NEWID(), which lets SQL Server eliminate a sort and thus generates a more efficient query plan.

    Also, strictly speaking the will both return the result sets in random order since there is no ORDER BY specified for the result set. The fact that due to implementation they will render sorted output is not to be trusted if it's important.


    Just because you're right doesn't mean everybody else is wrong.

  • I chose the third answer: same results but randomized, because if you'd include a column from sys.all_columns, you'd see the first result set is random.

    Ah well, nice question anyway. 🙂

    Need an answer? No, you need a question
    My blog at https://sqlkover.com.
    MCSE Business Intelligence - Microsoft Data Platform MVP

  • Rune Bivrin (9/3/2013)


    Also, strictly speaking the will both return the result sets in random order since there is no ORDER BY specified for the result set.

    That's the reason I got it right - ie on the basis that they both return the information in a random order (seemed to be the best fit).

  • This was removed by the editor as SPAM

  • Rune Bivrin (9/3/2013)


    Just look at the execution plan and tell me if there is no difference.

    There's very little difference. On SQL Server 2008 R2 SP2, there was an extra Compute Scalar operator for the second query, with a cost of 0%.

    The first query will indeed evaluate NEWID() once for every row whereas the second query does a single call to NEWID(), which lets SQL Server eliminate a sort and thus generates a more efficient query plan.

    Are you sure about that? Try this and see what you get:

    SELECT (SELECT NEWID()) FROM sys.all_columns

    Also, strictly speaking the will both return the result sets in random order since there is no ORDER BY specified for the result set. The fact that due to implementation they will render sorted output is not to be trusted if it's important.

    Yes, that's true.

    John

  • John Mitchell-245523 (9/3/2013)


    Rune Bivrin (9/3/2013)


    Just look at the execution plan and tell me if there is no difference.

    There's very little difference. On SQL Server 2008 R2 SP2, there was an extra Compute Scalar operator for the second query, with a cost of 0%.

    Interesting. On SQL Server 2012 SP1 there's an extra Compute Scalar and an extra Sort for the first query.

    John Mitchell-245523 (9/3/2013)


    The first query will indeed evaluate NEWID() once for every row whereas the second query does a single call to NEWID(), which lets SQL Server eliminate a sort and thus generates a more efficient query plan.

    Are you sure about that? Try this and see what you get:

    SELECT (SELECT NEWID()) FROM sys.all_columns

    True dat. Only goes to show that trying to understand/predict what SQL Server does can be tricky.

    John Mitchell-245523 (9/3/2013)


    Also, strictly speaking the will both return the result sets in random order since there is no ORDER BY specified for the result set. The fact that due to implementation they will render sorted output is not to be trusted if it's important.

    Yes, that's true.

    John


    Just because you're right doesn't mean everybody else is wrong.

  • Nice clear and easy question (assuming the last option "no difference" means "no difference in what they the language specification says they return" - if it means what it actually says then of course none of the options is correct). Absolutely terrible explanation, though!

    The reason the two queries return the same is that when the only column returned is the generated row number, the over(order) clause is irrelevant (provided it's a valid over(order) clause) because it doesn't affect any row, and no return ordering is specified in either of the two statements. Also the BOL page referenced in the explanation tells us exactly nothing about any of the statements in the explanation and provides exactly no clue as to what is the correct answer (all such clues are in the row_number() page, which is not referenced).

    Tom

  • Rune Bivrin (9/3/2013)


    Just look at the execution plan and tell me if there is no difference.

    Or even try this, where NEWID() is completely eliminated from the second query.

    -- 1

    SELECT ROW_NUMBER() OVER (ORDER BY NEWID())

    FROM sys.all_columns

    -- 2

    SELECT ROW_NUMBER() OVER (ORDER BY (select 1))

    FROM sys.all_columns

    I'd argue that they are different but yield the same result, as is the example given in the question.

    I'd be somewhat surprised if the first query generated the error message

    Windowed functions and NEXT VALUE FOR functions do not support integer indices as ORDER BY clause expressions.

    So I suspect these two don't both generate the same.

    Tom

  • L' Eomot Inversé (9/3/2013)


    Rune Bivrin (9/3/2013)


    Just look at the execution plan and tell me if there is no difference.

    Or even try this, where NEWID() is completely eliminated from the second query.

    -- 1

    SELECT ROW_NUMBER() OVER (ORDER BY NEWID())

    FROM sys.all_columns

    -- 2

    SELECT ROW_NUMBER() OVER (ORDER BY (select 1))

    FROM sys.all_columns

    I'd argue that they are different but yield the same result, as is the example given in the question.

    I'd be somewhat surprised if the first query generated the error message

    Windowed functions and NEXT VALUE FOR functions do not support integer indices as ORDER BY clause expressions.

    So I suspect these two don't both generate the same.

    Well, on SQL Server 2012 they do. ORDER BY (select 1) is no different from ORDER BY (SELECT NEWID())

    SELECT ROW_NUMBER() OVER (ORDER BY 1) FROM sys.all_columns

    yields an error, however, as would be expected.


    Just because you're right doesn't mean everybody else is wrong.

  • Interesting question, but incorrect and misleading answer. I did score a point, but only because I eliminated the other incorrect options first and didn't see the error in the option I was left with.

    Neither query will return an error, so I could quickly eliminate option 1.

    I didn't really understand options 2 and 3. As others have said, without ORDER BY, there is no guarantee at all on the order of either result set. And both result sets will, in the current implementation, normally be returned in ascending order, giving the illusion that this might be guaranteed. I eliminated these options because I could see no way how the author could come up with this, either from theory (both have undefined order) or from trial and error (both appear to be sorted).

    That leaves option 4. Based on the results, one could believe this. But even though the results are the same, the queries are not identical - as John Mitchell already called out.

    John Mitchell-245523 (9/3/2013)


    Rune Bivrin (9/3/2013)


    Just look at the execution plan and tell me if there is no difference.

    There's very little difference. On SQL Server 2008 R2 SP2, there was an extra Compute Scalar operator for the second query, with a cost of 0%.

    First - please stop looking at those costs as if they mean something. They do mean something, but not what most people think. These are nothing but the estimations that the optimizer uses to choose between plans. They do not reflect the real cost of an operator at all. (And neither do the percentages of the plan overall if you run multiple queries).

    Second - on my system (2012 SP1), the difference is an extra compute scalar and an extra sort for the second query.

    Third - this extra compute scalar makes a whole lot of difference.

    - In the first plan, the NEWID() function is evaluated for each row. The results are then sorted, so that row numbers can be returned.

    - In the second plan, the (SELECT NEWID()) is a non-correlated subquery. A non-correlated subquery does not have to be evaluated for each row. The optimizer does not guarantee only a single execution for non-correlated subqueries, but it definitely can do so. And it is definitely what happens in this case. The optimizer first creates a plan with a single execution of the subquery; then realizes that the result is only used for sorting and not returned - and because sorting in a constant is a no-op regardless of the exact value of the constant, the optimizer then decides to simply not evaluate the subquery at all. (*)

    (*) The last part is an assumption. I was unable to find anywhere in the plan where NEWID() is evaluated. But I might have overlooked it; maybe it is actually evaluated. But definitely not more than once, as the samples below show.

    (All the above applies to SQL Server 2012 SP1, where I ran my tests; other versions may have differences)

    The first query will indeed evaluate NEWID() once for every row whereas the second query does a single call to NEWID(), which lets SQL Server eliminate a sort and thus generates a more efficient query plan.

    Are you sure about that? Try this and see what you get:

    SELECT (SELECT NEWID()) FROM sys.all_columns

    Yeah, I'll admit that I have no clue why the optimizer decides to evaluate the non-correlated subquery oince for each row in this case. It does not HAVE to.

    But I do know that this is a different query from the one in the question. And by making other variations on the original query, I think I can prove that John is in fact correct.

    First - let's add some extra rows from the same table source:

    -- 1

    SELECT ROW_NUMBER() OVER (ORDER BY NEWID()), *

    FROM sys.all_columns;

    -- 2

    SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())), *

    FROM sys.all_columns;

    -- 3

    SELECT *

    FROM sys.all_columns;

    The order is still by ascending ROW_NUMBER() for me (This is still not guaranteed - I considered adding an ORDER BY at the outer query level, but wanted to change the original query as little as possible. I did yield to my OCD to add the semicolons, though).

    The second and third query show the rows from sys.all_columns in the same order, and do so every time they are executed. Apparently, the ROW_NUMBERs are dished out to rows in whatever order they are read from the DMV, wth no attempt to reorder. The first query, on the other hand, clearly reorders the sys.all_columns data. To a different order every time it is executed (due to the pseudo-random nature of NEWID()).

    Another way to show the difference is to change the ranking function:

    -- 1

    SELECT RANK() OVER (ORDER BY NEWID())

    FROM sys.all_columns;

    -- 2

    SELECT RANK() OVER (ORDER BY (SELECT NEWID()))

    FROM sys.all_columns;

    By using RANK instead of ROW_NUMBER, it is immediately obvious that the (SELECT NEWID()) returns the same value for every row - that's why all rows get the same rank number 1, because they all tie for the first place. With ROW_NUMBER, this was hidden, because it resolves ties by giving different numbers to rows even if the ORDER BY value is the same. So when ORDER BY produces 1, 2, 3, there is no way to know if it is based on three different values, or on three equal values. With RANK, equal values generate 1, 1, 1 instead.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Awesome! Thanks for the in-depth analysis, Hugo, and for being more generous to me than I think I deserved. The first time I saw the queries, I assumed (rightly, as it turned out) that the NEWID would only be evaluated once in the query with the subSELECT. But when I ran that simple SELECT query, I thought I'd proved myself wrong. I hadn't thought of using RANK to check.

    The stuff about the execution plans was interesting, too. One thing I don't understand - if the second query has less work to do because of the single NEWID evaluation, why is it that query that has the extra Compute Scalar operator?

    John

Viewing 15 posts - 1 through 15 (of 48 total)

You must be logged in to reply to this topic. Login to reply