Join yeilds different results if used with CTE or a real table

  • Hi,

    Here is the SQL I used as an example:

    --CREATE AND POPULATE TABLE AAA ------------------------------------------

    CREATE TABLE [dbo].[AAA](

    [Col1] [nchar](10) NOT NULL,

    [RowNumber] [int] NOT NULL

    ) ON [PRIMARY]

    GO

    INSERT INTO [dbo].[AAA]

    ([Col1]

    ,[RowNumber])

    VALUES

    ('a', 1),

    ('s', 2),

    ('d', 3),

    ('e', 4),

    ('f', 5),

    ('g', 6)

    GO

    --QUERY No 1 ----------------------------------------------------------------

    ;with BBB as

    (

    select top 3 ABS(checksum(newid())) % 6 + 1 as RandomRow, RowNumber, Col1

    from AAA

    order by RowNumber

    )

    select A.Col1, A.RowNumber

    from AAA as A

    join BBB

    On A.RowNumber = BBB.RandomRow;

    --QUERY No 2 ----------------------------------------------------------------

    select top 3 ABS(checksum(newid())) % 6 + 1 as RandomRow, RowNumber, AAA.Col1

    into BBB

    from AAA

    order by RowNumber;

    select A.Col1, A.RowNumber

    from AAA as A

    join BBB

    On A.RowNumber = BBB.RandomRow;

    --Cleaning -------------------------------------------------------------------

    --drop table BBB;

    --drop table AAA;

    I have a table AAA with just two columns, one of which is a row number. I use CTE to create another table “BBB”, which based on the AAA. The BBB table contains the same columns as the AAA table, plus a new column RandomRow.

    The RandomRow is populated by random numbers, generated from the range 1 to @N, where @N is the number of rows in the table AAA – in this case 6. The table BBB contains 3 rows only: the row numbers will be 1,2,3.

    The Select statement should return rows from AAA, which are identified by the number in the [BBB].[RandomRows].

    The Select statement contains the join, which is expected to return exactly 3 rows, because the BBB is subset of AAA. However, each time I run the query the Select returns different numbers of rows.

    To see the effect, please try to re-run the query No 1 several times, and you should get different numbers of returned rows each time.

    If I replace the CTE with real table (see the query 2/) I get expected results, exactly 3 rows each time I run the query.

    My questions are:

    A/ Why the join yields different results for CTE and for the real table?

    B/ If I replace the join in the query No 1 with RIGHT OUTER JOIN, the query works fine and returns just 3 rows. However, I think it should work with “join” also, because the BBB is subset of AAA, and number in RandomRows column will always identify the valid row in AAA.

    Thank you for help.

  • itlk (3/21/2013)


    A/ Why the join yields different results for CTE and for the real table?

    B/ If I replace the join in the query No 1 with RIGHT OUTER JOIN, the query works fine and returns just 3 rows. However, I think it should work with “join” also, because the BBB is subset of AAA, and number in RandomRows column will always identify the valid row in AAA.

    Thank you for help.

    A) You're connecting Rownumber to randomRow, which changes EACH time you run the cte, as you generate a newID each run. You'll be very variable there. You need to connect rownumber to rownumber for consistency.

    B) Right outer join is going to return you the three rows from BBB, sure, with varying connectivity.

    It all revolves around you using a randomized value in the cte for your join.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks for answer.

    I am not concerning about values per se. Sure, the numbers are random, so values will be different each time I run the query.

    My concern is, that I should get each time exactly 3 rows (I am not voried about the content of rows). With CTE I get less or more rows, each time I run the query.

    With the real table, the same query returns exactly 3 rows.

    Why is the difference between CTE and real table?

    Thanks.

  • Interesting. Looks like you hit a similar problem to this:

    http://www.glorf.it/blog/2008/05/16/sql-talk/sql-server-is-not-aware-of-nondeterministic-functions

    In query #1, the non-deterministic function in the subquery is applied after the join (also note the "No Join Predicate" warning in the join operator.

    The non-deterministic subquery has already been materialised in query #2, therefore is guaranteed to be correct.

  • You're joining to an indeterminent value and at execution plan time, it can't determine what to join on and so it ignores the predicate. Look at the execution plan for the "Nested Loops (Inner Join)" symbol. Notice the little yellow alert symbol on it. If you right click on the symbol and bring up "Properties", you'll see a warning message that says "No Join Predicate".

    {Edit} Heh... beat out by 2 minutes. 🙂

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • EDIT: Please ignore this entire post. One dog, wrong tree, right forest.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Jeff Moden (3/21/2013)


    You're joining to an indeterminent value and at execution plan time, it can't determine what to join on and so it ignores the predicate. Look at the execution plan for the "Nested Loops (Inner Join)" symbol. Notice the little yellow alert symbol on it. If you right click on the symbol and bring up "Properties", you'll see a warning message that says "No Join Predicate".

    {Edit} Heh... beat out by 2 minutes. 🙂

    Hey Jeff, I know that no join predicate basically turns it into a FULL OUTER JOIN, but why is the filter failing on its predicate upstream?

    EDIT:

    I want to clarify this question. To my understanding there's a few possibilities of how this plan actually results, but I'm not sure which is right and none of them show the logic that my conclusions reach.

    First, we go out to a table of 3 and 6 rows, and turn it into a FULL OUTER JOIN query, returning that dataset back to the Compute Scalar request. The 18 rows in the actual confirm this as a reasonable pathing.

    Now, the Compute Scalar occurs for each row (non-deterministic). So, one of two things happen. Either each row gets the exact same NEWID() (which isn't true) or they all get different NEWIDs. In either case, the result is the same... a value between 1 and 6.

    The predicate filter is... Oh god bless rubber ducky coding... of course.

    Because each of these rows are getting unique ABS(Checksum results, and they are then compared INTERNALLY to the row, instead of between the rowsets... *headsmack*... of course.

    Answer: The Join is evaluated long after any join is involved, and the newID result is checked on each row uniquely. Whatever comes out of there is just luck that they matched up.


    - Craig Farrell

    Never stop learning, even if it hurts. Ego bruises are practically mandatory as you learn unless you've never risked enough to make a mistake.

    For better assistance in answering your questions[/url] | Forum Netiquette
    For index/tuning help, follow these directions.[/url] |Tally Tables[/url]

    Twitter: @AnyWayDBA

  • Thanks for help.

  • It's down to an optimisation introduced in SQL Server2005 often called "Deferred expression evaluation". Here's a great article by Paul White to add to others already posted.

    The original "failing" query can be modified so that the calculation is written as a row-level CROSS APPLY:

    WITH bbb AS (

    SELECT TOP 3 x.RandomRow

    FROM AAA

    CROSS APPLY (SELECT RandomRow = ABS(checksum(newid())) % 6 + 1) x

    ORDER BY RowNumber)

    SELECT a.Col1, bbb.RandomRow

    FROM AAA AS a

    JOIN bbb

    ON a.RowNumber = bbb.RandomRow;

    The execution plan is identical to Query No 1:

    Changing the CROSS APPLY to an OUTER APPLY

    WITH bbb AS (

    SELECT TOP 3 x.RandomRow

    FROM AAA

    OUTER APPLY (SELECT RandomRow = ABS(checksum(newid())) % 6 + 1) x

    ORDER BY RowNumber)

    SELECT a.Col1, bbb.RandomRow

    FROM AAA AS a

    JOIN bbb

    ON a.RowNumber = bbb.RandomRow;

    is sufficient to prevent the optimiser from implementing deferred expression evaluation and it generates a completely different plan where RandomRow is evaluated prior to the join between the CTE and the table:

    With RandomRow evaluated prior to the join, it's now available to use as a join predicate and the query returns the expected results.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Evil Kraig F (3/21/2013)


    Jeff Moden (3/21/2013)


    You're joining to an indeterminent value and at execution plan time, it can't determine what to join on and so it ignores the predicate. Look at the execution plan for the "Nested Loops (Inner Join)" symbol. Notice the little yellow alert symbol on it. If you right click on the symbol and bring up "Properties", you'll see a warning message that says "No Join Predicate".

    {Edit} Heh... beat out by 2 minutes. 🙂

    Hey Jeff, I know that no join predicate basically turns it into a FULL OUTER JOIN, but why is the filter failing on its predicate upstream?

    EDIT:

    I want to clarify this question. To my understanding there's a few possibilities of how this plan actually results, but I'm not sure which is right and none of them show the logic that my conclusions reach.

    First, we go out to a table of 3 and 6 rows, and turn it into a FULL OUTER JOIN query, returning that dataset back to the Compute Scalar request. The 18 rows in the actual confirm this as a reasonable pathing.

    Now, the Compute Scalar occurs for each row (non-deterministic). So, one of two things happen. Either each row gets the exact same NEWID() (which isn't true) or they all get different NEWIDs. In either case, the result is the same... a value between 1 and 6.

    The predicate filter is... Oh god bless rubber ducky coding... of course.

    Because each of these rows are getting unique ABS(Checksum results, and they are then compared INTERNALLY to the row, instead of between the rowsets... *headsmack*... of course.

    Answer: The Join is evaluated long after any join is involved, and the newID result is checked on each row uniquely. Whatever comes out of there is just luck that they matched up.

    Correct. The only difference is that a lack of a predicate doesn't make a FULL OUTER JOIN. It makes a CROSS JOIN.

    Heh... that's one of my favoritie interview questions, BTW. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply