Unions and Distincts - theoretical question

  • Hi.

    Consider the 4 scenarios below:

    1:

    SELECT a

    FROM t1

    UNION ALL

    SELECT b

    FROM t2

    2:

    SELECT distinct a

    FROM t1

    UNION ALL

    SELECT distinct b

    FROM t2

    3:

    SELECT a

    FROM t1

    UNION

    SELECT b

    FROM t2

    4:

    SELECT distinct a

    FROM t1

    UNION

    SELECT distinct b

    FROM t2

    I have variously used, or can at least think of, a use-case for each of the scenarios 1-3, but to the best of my knowledge, scenario 4 will never be functionally different from (but will be more expensive than) scenario 3, regardless of the actual complexity of the queries.

    Or, to put it another way, there is never a situation when you are using UNION rather than UNION ALL, in which there is ever any point in specifying DISTINCT in any of the component query parts of the full union statement? it is always just extra overhead.

    There are 3 reasons I'm curious and asking this question: (1) I think I recalled a circumstance in which I did use scenario 4 but I either can't remember or have mis-remembered, (2) the optimiser does NOT optimise the distincts out of the execution plan (ie scenario 4 is typically more expensive than scenario 3 which gives me pause to wonder if there could be a situation in which it matters, and (3) Conversely, I'm actually arguing with a colleague that he should remove the extraneous DISTINCT statements from his query!

    Any thoughts very much appreciated.

  • 3 and 4 should be identical, both in cost and results. While the optimiser doesn't optimise the distincts out, it appears to use them to make the final union more efficient.

    I did some really, really rough tests and the two had the same costs and the same CPU time in the limited examples that I checked.

    As for your colleague, take his code and look at the execution stats (duration and CPU time) with and without the distincts. If there's a significant difference then you have grounds to request him remove them. If there aren't, well...

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (3/25/2015)


    3 and 4 should be identical, both in cost and results. While the optimiser doesn't optimise the distincts out, it appears to use them to make the final union more efficient.

    I did some really, really rough tests and the two had the same costs and the same CPU time in the limited examples that I checked.

    As for your colleague, take his code and look at the execution stats (duration and CPU time) with and without the distincts. If there's a significant difference then you have grounds to request him remove them. If there aren't, well...

    Cheers Gila. Curiously, I did some really rough and ready tests prior to posting and got the opposite (SQL 2008R2 - maybe this makes a difference?):

    CREATE TABLE #1 (a int)

    CREATE TABLE #2 (b int)

    INSERT #1 (a) VALUES (1),(2),(2),(3),(3),(3)

    INSERT #2 (b) VALUES (1),(4),(4),(4),(4)

    SELECT a FROM #1

    UNION

    SELECT b FROM #2

    SELECT distincta FROM #1

    UNION

    SELECT distinctb FROM #2

    https://www.dropbox.com/s/sofjsytdabtdzce/Exec1.png (I can't get dropboxs pics to render here so here's a link instead :() What this shows is a cost split of 66:33 in favour of removing the distincts

    Similarly, In my colleague's query, the same is true - in terms of stats (IO, CPU), simpler execution plan, and simple eyeballing, the version with the distinct statements removed is faster and more efficient, so I have already convinced myself that there is a performance gain to be had

    The question here is, whether anyone can think of any edge cases in which the final result set might be different with the distinct keyword included vs excluded, considering far more complex queries but basically remaining in the form <query> UNION <query> (... UNION <query> etc.)?

  • Zagyg (3/25/2015)


    Cheers Gila. Curiously, I did some really rough and ready tests prior to posting and got the opposite (SQL 2008R2 - maybe this makes a difference?):

    Or indexes. Or data volume. Or time of day. Or direction the pigeons are flying. Or ....

    The question here is, whether anyone can think of any edge cases in which the final result set might be different with the distinct keyword included vs excluded

    Can't think of any offhand

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Yeah, the pigeons in my particular group of servers are unpredictable little buggers.

Viewing 5 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply