Interesting question, but incorrect and misleading answer. I did score a point, but only because I eliminated the other incorrect options first and didn't see the error in the option I was left with.
Neither query will return an error, so I could quickly eliminate option 1.
I didn't really understand options 2 and 3. As others have said, without ORDER BY, there is no guarantee at all on the order of either result set. And both result sets will, in the current implementation, normally be returned in ascending order, giving the illusion that this might be guaranteed. I eliminated these options because I could see no way how the author could come up with this, either from theory (both have undefined order) or from trial and error (both appear to be sorted).
That leaves option 4. Based on the results, one could believe this. But even though the results are the same, the queries are not identical - as John Mitchell already called out.
John Mitchell-245523 (9/3/2013)
Rune Bivrin (9/3/2013)
Just look at the execution plan and tell me if there is no difference.
There's very little difference. On SQL Server 2008 R2 SP2, there was an extra Compute Scalar operator for the second query, with a cost of 0%.
First - please stop looking at those costs as if they mean something. They do mean something, but not what most people think. These are nothing but the estimations that the optimizer uses to choose between plans. They do not reflect the real cost of an operator at all. (And neither do the percentages of the plan overall if you run multiple queries).
Second - on my system (2012 SP1), the difference is an extra compute scalar and an extra sort for the second query.
Third - this extra compute scalar makes a whole lot of difference.
- In the first plan, the NEWID() function is evaluated for each row. The results are then sorted, so that row numbers can be returned.
- In the second plan, the (SELECT NEWID()) is a non-correlated subquery. A non-correlated subquery does not have to be evaluated for each row. The optimizer does not guarantee only a single execution for non-correlated subqueries, but it definitely can do so. And it is definitely what happens in this case. The optimizer first creates a plan with a single execution of the subquery; then realizes that the result is only used for sorting and not returned - and because sorting in a constant is a no-op regardless of the exact value of the constant, the optimizer then decides to simply not evaluate the subquery at all. (*)
(*) The last part is an assumption. I was unable to find anywhere in the plan where NEWID() is evaluated. But I might have overlooked it; maybe it is actually evaluated. But definitely not more than once, as the samples below show.
(All the above applies to SQL Server 2012 SP1, where I ran my tests; other versions may have differences)
The first query will indeed evaluate NEWID() once for every row whereas the second query does a single call to NEWID(), which lets SQL Server eliminate a sort and thus generates a more efficient query plan.
Are you sure about that? Try this and see what you get:
SELECT (SELECT NEWID()) FROM sys.all_columns
Yeah, I'll admit that I have no clue why the optimizer decides to evaluate the non-correlated subquery oince for each row in this case. It does not HAVE to.
But I do know that this is a different query from the one in the question. And by making other variations on the original query, I think I can prove that John is in fact correct.
First - let's add some extra rows from the same table source:
SELECT ROW_NUMBER() OVER (ORDER BY NEWID()), *
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NEWID())), *
The order is still by ascending ROW_NUMBER() for me (This is still not guaranteed - I considered adding an ORDER BY at the outer query level, but wanted to change the original query as little as possible. I did yield to my OCD to add the semicolons, though).
The second and third query show the rows from sys.all_columns in the same order, and do so every time they are executed. Apparently, the ROW_NUMBERs are dished out to rows in whatever order they are read from the DMV, wth no attempt to reorder. The first query, on the other hand, clearly reorders the sys.all_columns data. To a different order every time it is executed (due to the pseudo-random nature of NEWID()).
Another way to show the difference is to change the ranking function:
SELECT RANK() OVER (ORDER BY NEWID())
SELECT RANK() OVER (ORDER BY (SELECT NEWID()))
By using RANK instead of ROW_NUMBER, it is immediately obvious that the (SELECT NEWID()) returns the same value for every row - that's why all rows get the same rank number 1, because they all tie for the first place. With ROW_NUMBER, this was hidden, because it resolves ties by giving different numbers to rows even if the ORDER BY value is the same. So when ORDER BY produces 1, 2, 3, there is no way to know if it is based on three different values, or on three equal values. With RANK, equal values generate 1, 1, 1 instead.