Unions and Distincts - theoretical question

Question

Unions and Distincts - theoretical question

Zagyg

SSCrazy

Points: 2091
More actions
March 25, 2015 at 5:58 am

#99080

Hi.
Consider the 4 scenarios below:
1:
SELECT a
FROM t1
UNION ALL
SELECT b
FROM t2
2:
SELECT distinct a
FROM t1
UNION ALL
SELECT distinct b
FROM t2
3:
SELECT a
FROM t1
UNION
SELECT b
FROM t2
4:
SELECT distinct a
FROM t1
UNION
SELECT distinct b
FROM t2
I have variously used, or can at least think of, a use-case for each of the scenarios 1-3, but to the best of my knowledge, scenario 4 will never be functionally different from (but will be more expensive than) scenario 3, regardless of the actual complexity of the queries.
Or, to put it another way, there is never a situation when you are using UNION rather than UNION ALL, in which there is ever any point in specifying DISTINCT in any of the component query parts of the full union statement? it is always just extra overhead.
There are 3 reasons I'm curious and asking this question: (1) I think I recalled a circumstance in which I did use scenario 4 but I either can't remember or have mis-remembered, (2) the optimiser does NOT optimise the distincts out of the execution plan (ie scenario 4 is typically more expensive than scenario 3 which gives me pause to wonder if there could be a situation in which it matters, and (3) Conversely, I'm actually arguing with a colleague that he should remove the extraneous DISTINCT statements from his query!
Any thoughts very much appreciated.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

3 and 4 should be identical, both in cost and results. While the optimiser doesn't optimise the distincts out, it appears to use them to make the final union more efficient.

I did some really, really rough tests and the two had the same costs and the same CPU time in the limited examples that I checked.

As for your colleague, take his code and look at the execution stats (duration and CPU time) with and without the distincts. If there's a significant difference then you have grounds to request him remove them. If there aren't, well...

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Zagyg SSCrazy Points: 2091 More actions · Answer 2

GilaMonster (3/25/2015)
3 and 4 should be identical, both in cost and results. While the optimiser doesn't optimise the distincts out, it appears to use them to make the final union more efficient.
I did some really, really rough tests and the two had the same costs and the same CPU time in the limited examples that I checked.
As for your colleague, take his code and look at the execution stats (duration and CPU time) with and without the distincts. If there's a significant difference then you have grounds to request him remove them. If there aren't, well...

Cheers Gila. Curiously, I did some really rough and ready tests prior to posting and got the opposite (SQL 2008R2 - maybe this makes a difference?):

CREATE TABLE #1 (a int)

CREATE TABLE #2 (b int)

INSERT #1 (a) VALUES (1),(2),(2),(3),(3),(3)

INSERT #2 (b) VALUES (1),(4),(4),(4),(4)

SELECT a FROM #1

UNION

SELECT b FROM #2

SELECT distincta FROM #1

UNION

SELECT distinctb FROM #2

https://www.dropbox.com/s/sofjsytdabtdzce/Exec1.png (I can't get dropboxs pics to render here so here's a link instead :() What this shows is a cost split of 66:33 in favour of removing the distincts

Similarly, In my colleague's query, the same is true - in terms of stats (IO, CPU), simpler execution plan, and simple eyeballing, the version with the distinct statements removed is faster and more efficient, so I have already convinced myself that there is a performance gain to be had

The question here is, whether anyone can think of any edge cases in which the final result set might be different with the distinct keyword included vs excluded, considering far more complex queries but basically remaining in the form <query> UNION <query> (... UNION <query> etc.)?

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Zagyg (3/25/2015)
Cheers Gila. Curiously, I did some really rough and ready tests prior to posting and got the opposite (SQL 2008R2 - maybe this makes a difference?):

Or indexes. Or data volume. Or time of day. Or direction the pigeons are flying. Or ....

The question here is, whether anyone can think of any edge cases in which the final result set might be different with the distinct keyword included vs excluded

Can't think of any offhand

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Zagyg SSCrazy Points: 2091 More actions · Answer 4

Yeah, the pigeons in my particular group of servers are unpredictable little buggers.