UNION

  • steve.jacobs (11/4/2013)


    twin.devil (11/4/2013)


    That produces the following plan featuring a Merge Join (my answer!) running in Union mode:

    Its a boobby trap ... 🙂

    twin.devil, you stated it perfectly, it is a trap question.

    No, it's not. It is simply incorrect. As can be seen from some of the repro scripts in this topic, multiple answers are correct. The "correct" answer is definitely incorrect - at least two of the above can be correct, so it can never be "none of the above".

    Sure, I could have put "It Depends" as one of the answers, but looking at the query, setting up two simple tables and then running an execution plan against it, proves there is no join involved.

    The only thing that proves is that there was no join involved in that particular test scenario. Declare the tables differently, populate them differently, execute them on other hardware ... all that can change the answer. There is no way others could have arrived at the same answer, other than by sheer luck. I am not good at reading minds.

    I guess that, based on the feedback you have gotten, you now understand that the optimzer has an enormous amount of freedom to choose between possible execution plans, and that (other than with hints), it is almost impossible to predict a query plan with certainty. Others reading the discussion will probably understand that as well. Unfortunately, a lot of people only check the question and the answer and never go to the discussion topic - so I am a bit disappointed that Steve has not yet edited the question and explanation to ensure that those people don't get misinformed.

    Ford Fairlane (11/4/2013)


    Why should one have to over complicate everything - the question was given was enough information to come up with the correct answer.

    I'm really surprised by that comment. The execution plan depends on so many assumptions. The assumptions needed to get to a different answer are not far-fetched.

    * If the city column in the first contact table is indexed, the plan will almost certainly use merge join.

    * If the city column in the first contact table is not indexed and the tables contain a sizeable amount of data, a hash join is much more likely than a concatenation + distinct sort combination.

    * The plan described in the answer is most likely to be generated when the tables contain a fairly low amount of rows.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • L' Eomot Inversé (11/4/2013)


    the explanation is wrong even when there is no join. It is unlikely that there will be a "unique sort", because there is no need for one: the union can (and nearly always will) be achieved by a hash merge; there is of course no concatenation - the two bunches of values might be concatenated if UNION ALL were used but it would be strange (to say the lest) to do that for UNION without ALL.

    You are over-assuming, Tom. It is definitely possible to get concatenation and distinct sort. If the optimizer estiimates relatively low row-counts, then simply stuffing all cities after each other and then distinct sorting them could be the cheapest option.

    Example (works on my SQL 2012 system):

    CREATE TABLE Contacts

    (

    City nvarchar(25) NOT NULL

    )

    CREATE TABLE Contacts2

    (

    City nvarchar(25) NOT NULL

    )

    INSERT INTO Contacts (City)

    VALUES ('New York'), ('London');

    INSERT INTO Contacts (City)

    VALUES ('New York'), ('Paris'), ('New York');

    SELECT City

    FROM Contacts

    UNION

    SELECT City

    FROM Contacts2

    GROUP BY City

    ORDER BY 1;

    DROP TABLE dbo.Contacts, dbo.Contacts2;


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • Hugo Kornelis (11/9/2013)


    L' Eomot Inversé (11/4/2013)


    the explanation is wrong even when there is no join. It is unlikely that there will be a "unique sort", because there is no need for one: the union can (and nearly always will) be achieved by a hash merge; there is of course no concatenation - the two bunches of values might be concatenated if UNION ALL were used but it would be strange (to say the lest) to do that for UNION without ALL.

    You are over-assuming, Tom. It is definitely possible to get concatenation and distinct sort. If the optimizer estiimates relatively low row-counts, then simply stuffing all cities after each other and then distinct sorting them could be the cheapest option.

    Yes, you are right as usual, Hugo.

    I had a look at where the cross-over ought in theory to be (there certainly ought to be one, since the computational complexity of a merge union is less than that of any sort algorithm); it depends on the computational complexity of the sort algorith used, which is likely to be diferent for in-RAM sorts and sorts which don't fit into available RAM. Also, there are two complications with our particular query. The first one is that a final sort is needed anyway, since the query has an order by clause, and while a unique sort can double as final sort the merge needs that separate sort to follow it; whether that final sort is enough cheaper than the unique sort to outweigh the merge cost depends on how many rows are eliminated as duplicates by the merge. If that number is small compared to the total number of rows, it is unlikely that the merge approach will be as good as the unique sort approach. The second is that one of the two rowsets to be merged has values made unique by a group by clause, and that uniquifying may be done either by a merge or by a sort; if it's done by a sort, it's very probable that the fact that half the input is presorted will change the effective complexity of a following uniquifying sort or but maybe the optimiser will not take account of that because it's too complicated. I wish I knew a bit more bout the innards of the optimiser.

    Of course this may be all screwed up by the existence of indexes that the optimiser may decide are better to use than the tables themselves, which could change the picture sharply.

    All that is good in theory, and suggests that generally the cut-over should be at quite a small number of rows even in the absence of relevant indexes, assuming a reasonable proportion of rows in the ungrouped rowset occur also in the grouped rowset. But theory won't tell us what that small number is unless we have a lot of information about the cost estimates used by the optimiser; and in practise things may be different anyway.

    Tom

  • L' Eomot Inversé (11/9/2013)


    Of course this may be all screwed up by the existence of indexes that the optimiser may decide are better to use than the tables themselves, which could change the picture sharply.

    Indeed. Add an index with City as the leading column on the first table, and you will (almost) always get a merge sort.


    Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
    Visit my SQL Server blog: https://sqlserverfast.com/blog/
    SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

  • SADSAC (11/4/2013)


    Agree with Paul

    without table definitions, it was difficult to make an informed decision.

    +1

    If one of the columns you select is a clustered index key, then it could create a merge join (perhaps because the data is already sorted?). That's my experience, anyway. Good question to ponder either way.

  • Moral of the story: TEST! TEST! TEST!

  • Paul White (11/3/2013)


    The above query does not produce a join. It produces a distinct sort then concatenation of values from one table and the other.

    Well it might do, or it might do something else. Without table definitions and data, how could we know?

    CREATE TABLE Contacts

    (

    City nvarchar(25) NOT NULL

    )

    CREATE TABLE Contacts2

    (

    City nvarchar(25) NOT NULL

    )

    CREATE NONCLUSTERED INDEX i ON dbo.Contacts (City);

    CREATE NONCLUSTERED INDEX i2 ON dbo.Contacts (City);

    UPDATE STATISTICS dbo.Contacts WITH ROWCOUNT = 10000, PAGECOUNT = 1000;

    UPDATE STATISTICS dbo.Contacts2 WITH ROWCOUNT = 10000, PAGECOUNT = 1000;

    SELECT city

    FROM contacts

    UNION

    SELECT city

    FROM contacts2

    GROUP BY city

    ORDER BY 1;

    DROP TABLE dbo.Contacts, dbo.Contacts2;

    That produces the following plan featuring a Merge Join (my answer!) running in Union mode:

    I agree with you

    /HÃ¥kan Winther
    MCITP:Database Developer 2008
    MCTS: SQL Server 2008, Implementation and Maintenance
    MCSE: Data Platform

Viewing 7 posts - 31 through 36 (of 36 total)

You must be logged in to reply to this topic. Login to reply