Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Distincting an IN subquery

This is going to be a quick one…

I keep seeing forum code (and production code) that includes the DISTINCT in IN or EXISTS subqueries. The rationale is given either as a performance enhancement or as necessary for correct results.

Is it necessary or useful? Only one way to find out.

Let’s check for correct results first, because that can be done with nice small tables.

CREATE TABLE DistinctOuter (
ID INT
);

CREATE TABLE DistinctInner (
ID INT
);

INSERT INTO DistinctOuter
VALUES (1), (2), (3), (4), (5), (6), (7), (8)

INSERT INTO DistinctInner
VALUES (1), (2), (2), (2), (2), (4), (6), (7)

DistinctIN

No difference there, results are the same. I’m not going to run the test for EXISTS because, if anyone remembers how EXISTS works (or remembers a blog post I wrote a while back), EXISTS doesn’t depend on what’s in the SELECT clause at all, it just looks for existence of rows, and DISTINCT cannot remove unique rows, just duplicates.

A look at the execution plan shows why there are no duplicate values returned in the first query (the one without DISTINCT).

DistinctIN2

That’s a semi-join there, not a complete join. A semi-join is a join that just checks for matches but doesn’t return rows from the second table. Since it’s just a check for existence, duplicate rows in the inner table are not going to make any difference to the results.

So that answers the correctness aspect, distinct is not necessary to get correct results. But does it improve performance by having it there? Or does it perhaps reduce the performance? Time for larger tables.

Stolen from my last look at EXISTS and IN:

CREATE TABLE PrimaryTable_Large (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
);

CREATE TABLE SecondaryTable_Large (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
);
GO

INSERT INTO PrimaryTable_Large (SomeColumn)
SELECT top 1000000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from msdb.sys.columns a cross join msdb.sys.columns b;

INSERT INTO SecondaryTable_Large (LookupColumn)
SELECT SomeColumn
FROM PrimaryTable_Large TABLESAMPLE (25 PERCENT);

Some row counts first.

  • Total rows in PrimaryTable_Large: 1000000
  • Total rows in SecondaryTable_Large: 256335
  • Total distinct values in LookupColumn in SecondaryTable_Large: 10827

First test is without indexes on the lookup columns:

SELECT ID, SomeColumn FROM PrimaryTable_Large
WHERE SomeColumn IN (SELECT LookupColumn FROM SecondaryTable_Large)

SELECT ID, SomeColumn FROM PrimaryTable_Large
WHERE SomeColumn IN (SELECT DISTINCT LookupColumn FROM SecondaryTable_Large)

The reads are identical, which shouldn’t be a surprise as there’s no way with the current tables to run those queries without doing a full table scan.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘PrimaryTable_Large’. Scan count 1, logical reads 14548, physical reads 0.
Table ‘SecondaryTable_Large’. Scan count 1, logical reads 798, physical reads 0.

For durations and CPU, I’m going to run each 10 times and aggregate the results from the profiler T-SQL:BatchCompleted event. The results are just about identical.

  • IN without DISTINCT: CPU 1.21 seconds, duration 12.2 seconds
  • IN with DISTINCT: CPU 1.25 seconds, duration 11.9 seconds

Furthermore, the execution plans are identical. Something interesting to notice in this case is that the join is not a semi-join, it’s a complete join and to ensure that the complete join doesn’t return duplicate rows (which would be incorrect), there’s a hash match (aggregate) right before the join that’s removing duplicate rows from the inner resultset, and that’s present in both execution plans, when the distinct is specified and when it’s not.

LargeINNoIndexes

One last question to answer – does the presence of indexes change anything?

CREATE INDEX idx_Primary
ON dbo.PrimaryTable_Large (SomeColumn)

CREATE INDEX idx_Secondary
ON dbo.SecondaryTable_Large (LookupColumn)

The execution plan has changed, in operators if not in general form. The hash join is replaced by a merge join (still a complete join, not a semi-join), the hash match (aggregate) has been replaced by a stream aggregate and the clustered index scans are now (nonclustered) index scans

LargeINIndexes

The reads are still identical between the two, which should be no surprise at all. As for the durations:

  • IN without DISTINCT: CPU 0.82 seconds, duration 10.1 seconds
  • IN with DISTINCT: CPU 0.79 seconds, duration 10.5 seconds

Again so close that the small difference should be ignored.

So in conclusion, is there any need or use for DISTINCT in the subquery for an IN predicate? By all appearances, none whatsoever. The SQL query optimiser is smart enough to ignore the specified DISTINCT if it’s not necessary (as we saw in the first example) and to add an operator to remove duplicates if it is necessary (as we saw in the 2nd and 3rd examples), regardless of whether or not there’s a DISTINCT specified in the query.

Comments

Posted by ThomasLL on 19 January 2011

Another great post, thanks Gail.

Thomas

Posted by Jason Strate on 19 January 2011

Good information.  Thanks, Gail.

Posted by GAckerman on 26 January 2011

Our company is blocking the view of your images. Can you host them on SQLServerCentral?

Posted by David Bird on 26 January 2011

Thanks for the info and the scripts to test your conclusions on different releases of SQL Server.

Posted by ddriver on 26 January 2011

Shouldn't these types of queries be done with the EXISTS statement rather than a correleted subquery?

Posted by JRP on 26 January 2011

I did not see the same results on Indexed data when running on my 8xCPU 8GB test server using RAID+0.

I wonder if the system hardware has an affect on this  difference?

Could it be we are timing the results differantly? How are you getting your times?

Posted by GilaMonster on 27 January 2011

ddriver: Exists and IN perform the same, so either works. EXISTS subqueries are correlated, none of the IN subqueries I used here were.

jparker: STATISTICS TIME and SQL Profiler. CPU or memory might affect, check the execution plans. Disk won't. Make sure you run more than once and ignore first run due to time needed to compile and cache data.

Leave a Comment

Please register or log in to leave a comment.