IN vs INNER JOIN

Often in forum threads discussing query performance I’ll see people recommending replacing an INNER JOIN with an IN (or recommending replacing an IN with an INNER JOIN) for performance reasons. Hence it seems to be a good idea to investigate and see what the performance differences (if any) really are.

One very important thing to note right off is that they are not equivalent in all cases.

An inner join between two tables does a complete join, it checks for matches and returns rows. This means, if there are multiple matching rows in the second table, multiple rows will be returned. Also, when two tables are joined, columns can be returned from either. As a quick example (definition of BigTable towards the end of the post)

DECLARE @SomeTable (IntCol int)
Insert into @SomeTable (IntCol) Values (1)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (2)
Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable b INNER JOIN @SomeTable  s ON b.SomeColumn IN s.IntCol

This returns 7 rows and returns columns from both tables. Because the values in @SomeTable are duplicated, the matching rows from BigTable are returned twice.

With an IN, what is done is a semi-join, a join that checks for matches but does not return rows. This means if there are multiple matching tables in the resultset used for the IN, it doesn’t matter. Only one row from the first table will be returned. Also, because the rows are not returned, columns from the table referenced in the IN cannot be returned. As a quick example

DECLARE @SomeTable (IntCol int)
 Insert into @SomeTable (IntCol) Values (1)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (2)
 Insert into @SomeTable (IntCol) Values (3)
Insert into @SomeTable (IntCol) Values (4)
Insert into @SomeTable (IntCol) Values (5)
 Insert into @SomeTable (IntCol) Values (5)

SELECT *
 FROM BigTable
 WHERE SomeColumn IN (Select IntCol FROM @SomeTable)

This returns 5 rows and only columns from BigTable.

So, that said, how does the performance of the two differ for the cases where the results are identical (no duplicates in the second table, no columns needed from the second table)? For that, I’m going to need larger tables to play with.

DROP TABLE dbo.BigTable
DROP TABLE dbo.SmallerTable

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn CHAR(4),
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn CHAR(4),
SomeArbDate DATETIME DEFAULT GETDATE()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000 char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) 
 + char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) 
 + char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) 
 + char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
FROM master.sys.columns a CROSS JOIN master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3819 row(s) affected)

That’s the setup done, now for the two test cases. Let’s first try without indexes and see how the INNER JOIN and IN compare. I’m selecting from just the first table to ensure that the two queries are logically identical. The DISTINCT used to populate the smaller table ensures that there are no duplicate rows in the smaller table.

SELECT BigTable.ID, SomeColumn
FROM BigTable
WHERE SomeColumn IN (SELECT LookupColumn FROM dbo.SmallerTable)

SELECT BigTable.ID, SomeColumn
FROM BigTable
INNER JOIN SmallerTable ON dbo.BigTable.SomeColumn = dbo.SmallerTable.LookupColumn

Something of interest straight away, the execution plans are almost identical. Not completely identical, but the only difference is that the hash join for the IN shows a Hash Match (Right Semi Join) and the hash join for the INNER JOIN shows a Hash Match (Inner Join)

The IOs are the same and the durations are extremely similar. Here’s the IO results and durations for five tests.

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 2502 ms.
CPU time = 157 ms, elapsed time = 2323 ms.
CPU time = 156 ms, elapsed time = 2555 ms.
CPU time = 188 ms, elapsed time = 2381 ms.
CPU time = 203 ms, elapsed time = 2312 ms.

INNER JOIN

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.
SQL Server Execution Times:
CPU time = 125 ms, elapsed time = 2922 ms.
CPU time = 140 ms, elapsed time = 2372 ms.
CPU time = 188 ms, elapsed time = 2530 ms.
CPU time = 203 ms, elapsed time = 2323 ms.
CPU time = 187 ms, elapsed time = 2512 ms.

Now let’s try with some indexes on the join columns.

CREATE INDEX idx_BigTable_SomeColumn ON BigTable (SomeColumn)
CREATE INDEX idx_SmallerTable_LookupColumn ON SmallerTable (LookupColumn)

Now when I run the two queries, the execution plans are different, and the costs of the two are no longer 50%. Both do a single index scan on each table, but the IN has a Merge Join (Inner Join) and the INNER JOIN has a Hash Match (Inner Join)

The IOs are still identical, other than the WorkTable that only appears for the Hash Join.

Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

INNER JOIN

Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 14, physical reads 0.

So what about the durations? Honestly it’s hard to say anything completely conclusive, the durations of both queries are quite small and they are very close. To see if there is any measurable different, I’m going to run each one 100 times, use Profiler to log the duration and CPU and then average the results over the 100 executions. While running this, I’m also going to close/disable everything else I can on the computer, to try and get reasonably accurate times.

Average CPU: 130.

Avg duration: 2.78 seconds

INNER JOIN

Average CPU: 161.

Avg duration: 2.93 seconds

Now is that enough to be significant? I’m not sure. However, looking at those results along with the IO and execution plans, I do have a recommendation for In vs Inner Join

If all you need is to check for matching rows in the other table but don’t need any columns from that table, use IN. If you do need columns from the second table, use Inner Join.

I still intend to go over NOT IN and NOT EXISTS and, after this one, I also want to take a look at the LEFT JOIN with IS NULL check vs NOT IN for when you want rows from Table1 that don’t have a match in Table 2.

Rate

Share

Share

Rate

IN vs INNER JOIN

Rate

Share

Share

Rate

Related content

Book Review: Big Red - Voyage of a Trident Submarine

Database Mirroring FAQ: Can a 2008 SQL instance be used as the witness for a 2005 database mirroring setup?

Inserting Markup into a String with SQL

Networking - Part 4

Speaking at Community Events - More Thoughts