Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 

Left outer join vs NOT EXISTS

And to wrap up the miniseries on IN, EXISTS and JOIN, a look at NOT EXISTS and LEFT OUTER JOIN for finding non-matching rows.

For previous parts, see

I’m looking at NOT EXISTS and LEFT OUTER JOIN, as opposed to NOT IN and LEFT OUTER JOIN, because, as shown in the previous part of this series, NOT IN behaves badly in the presence of NULLs. Specifically, if there are any NULLs in the result set, NOT IN returns 0 matches.

The LEFT OUTER JOIN, like the NOT EXISTS can handle NULLs in the second result set without automatically returning no matches. It behaves the same regardless of whether the join columns are nullable or not. Seeing as NULL does not equal anything, any rows in the second result set that have NULL for the join column are eliminated by the join and have no further effect on the query.

It is important, when using the LEFT OUTER JOIN … IS NULL, it is important to carefully pick the column used for the IS NULL check. It should either be a non-nullable column (the primary key is a somewhat classical choice) or the join column (as nulls in that will be eliminated by the join)

Onto the tests

The usual test tables…

CREATE TABLE BigTable (
id INT IDENTITY PRIMARY KEY,
SomeColumn char(4) NOT NULL,
Filler CHAR(100)
)

CREATE TABLE SmallerTable (
id INT IDENTITY PRIMARY KEY,
LookupColumn char(4) NOT NULL,
SomeArbDate Datetime default getdate()
)

INSERT INTO BigTable (SomeColumn)
SELECT top 250000
char(65+FLOOR(RAND(a.column_id *5645 + b.object_id)*10)) +
char(65+FLOOR(RAND(b.column_id *3784 + b.object_id)*12)) +
char(65+FLOOR(RAND(b.column_id *6841 + a.object_id)*12)) +
char(65+FLOOR(RAND(a.column_id *7544 + b.object_id)*8))
from master.sys.columns a cross join master.sys.columns b

INSERT INTO SmallerTable (LookupColumn)
SELECT DISTINCT SomeColumn
FROM BigTable TABLESAMPLE (25 PERCENT)
-- (3918 row(s) affected)

First without indexes

-- Query 1
SELECT BigTable.ID, SomeColumn
FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn
WHERE LookupColumn IS NULL

-- Query 2
SELECT ID, SomeColumn FROM BigTable
WHERE NOT EXISTS
(SELECT LookupColumn
FROM SmallerTable
WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

Let’s take a look at the execution plans

LeftOuterJoinNotIN_NotIndexed

The plans are almost the same. There’s an extra filter in the JOIN and the logical join types are different. Why the different joins?

If we look at the execution plan for the NOT EXISTS, the join type is Right Anti-Semi join (a bit of a mouthful). This is a special join type used by the NOT EXISTS and NOT IN and it’s the opposite of the semi-join that I discussed back when I looked at IN and INNER JOIN

An anti-semi join is a partial join. It does not actually join rows in from the second table, it simply checks for, in this case, the absence of matches. That’s why it’s an anti-semi join. A semi-join checks for matches, an anti-semi join does the opposite and checks for the absence of matches.

The extra filter in the LEFT OUTER JOIN query is because the join in that execution plan is a complete right join, i.e. it’s returned matching rows (and possibly duplicates) from the second table. The filter operator is doing the IS NULL filter.

That’s the major difference between these two. When using the LEFT OUTER JOIN … IS NULL technique, SQL can’t tell that you’re only doing a check for nonexistance. Optimiser’s not smart enough (yet). Hence it does the complete join and then filters. The NOT EXISTS filters as part of the join.

Technical discussion done, now how did they actually perform?

– Query 1: LEFT OUTER JOIN
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 15, physical reads 0.

SQL Server Execution Times:
CPU time = 157 ms,  elapsed time = 486 ms.

– Query 2: NOT EXISTS
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 3639, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 15, physical reads 0.

SQL Server Execution Times:
CPU time = 156 ms,  elapsed time = 358 ms.

Can’t make a big deal out of that.

Now, index on the join columns

CREATE INDEX idx_BigTable_SomeColumn
ON BigTable (SomeColumn)

CREATE INDEX idx_SmallerTable_LookupColumn
ON SmallerTable (LookupColumn)

and the same queries

LeftOuterJoinNotIN_Indexed

With indexes added, the execution plans are even more different. The LEFT OUTER JOIN is still doing the complete outer join with a filter afterwards. It’s interesting to note that it’s still a hash join, even though both inputs are sorted in the order of the join keys.

The Not Exists now has a stream aggregate (because duplicate values are irrelevant for an EXISTS/NOT EXISTS) and an anti-semi join. The join here is no longer hash, it’s now a merge join.

This echoes what I found when looking at IN vs Inner join. When the columns were indexed, the inner join still went for a hash join but the IN changed to a merge join. At the time, I thought it to be a fluke, I’m not so sure any longer. More tests on this are required…

The costing of the plans indicates that the optimiser believes that the LEFT OUTER JOIN form is more expensive. Do the execution stats carry the same conclusion?

– Query 1: LEFT OUTER JOIN
Table ‘Worktable’. Scan count 0, logical reads 0, physical reads 0.
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 172 ms,  elapsed time = 686 ms.

– Query 2: NOT EXISTS
Table ‘BigTable’. Scan count 1, logical reads 342, physical reads 0.
Table ‘SmallerTable’. Scan count 1, logical reads 8, physical reads 0.

SQL Server Execution Times:
CPU time = 78 ms,  elapsed time = 388 ms.

Well, yes, they do.

The reads (ignoring the existence of the worktable for the hash join) are the same. That’s to be expected, both queries executed with a single scan of each index.

The CPU time figures are not. The CPU time of the LEFT OUTER JOIN form is almost twice that of the NOT EXISTS.

In conclusion…

If you need to find rows that don’t have a match in a second table, and the columns are nullable, use NOT EXISTS. If you need to find rows that don’t have a match in a second table, and the columns are not nullable, use NOT EXISTS or NOT IN.

The LEFT OUTER JOIN … IS NULL method is slower when the columns are indexed and it’s perhaps not as clear what’s happening. It’s reasonably clear what a NOT EXISTS predicate does, with LEFT OUTER JOIN it’s not immediately clear that it’s a check for non-matching rows, especially if there are several where clause predicates.

I think that’s about that for this series. I’m going to do one more post summarising all the findings, probably in a week or two.

Comments

Posted by p456 on 25 March 2010

I prefer to go for semantics rather than performance, which might change anyway:

If I need a column from the table in the result set, I use LEFT JOIN; if I don't, I use EXISTS. It makes it much more clear what the query is doing.

Posted by tuckerc1 on 25 March 2010

If the table has multiple primary keys, for which I want to see if they already exists I have found no other way than to do the left join. I'm open for suggestions.

Posted by Tom Garth on 25 March 2010

Nicely written, and yeah, what pbrazier said.

Posted by stephen.booth on 25 March 2010

@pbrazier - Performance be damned!

BTW, semantics can change as well and often does.

Posted by Andy Lennon on 25 March 2010

i've never really liked the exists/not exists syntax; it doesn't seem right to be able to reference a table that's not in the from statement.

I guess i'll have to reconsinder my approach if there's such a consistent and obvious performance difference.

Posted by Rob on 25 March 2010

Interesting conclusion -- before reading this, I would've thought LEFT OUTER JOIN would be outperform NOT EXISTS. Guess I should go read the previous articles. :-)

Posted by Dan on 25 March 2010

I always design for performance.  Great post.  

Posted by stevoid1970 on 26 March 2010

In your final conclusion you write:  "If A (are nullable) then use NOT EXISTS, If B (not nullable) then use NOT EXISTS. " So in other words, always use NOT EXISTS.  Is that what you meant?

Posted by dbajonm on 26 March 2010

Great investigation!  I was looking at an issue two years ago where huge volumes of data were being loaded to a primary "log" table every hour.  The LEFT JOIN was starting to take up the full hour and we had to change the query.  I'd always considered the NOT EXISTS as slower; but it cut the time almost in half.  I can't remember the exact percentage, but the time outs were no longer occurring.  I've enjoyed the entire series of articles.  Wonderful research!

Posted by nil-364356 on 26 March 2010

Awesome article. Thanks for posting.

Posted by GilaMonster on 28 March 2010

pbrazier: If you're doing a check for non existence, you're probably not referencing a column in the second table anyway, as it will always be null when there's no matching record.

Tuckerc1: Use Exists. You can join on multiple columns in the where clause of that (which you cannot do with IN)

Andy: But you're not referencing the table anywhere other than within the subquery, same as if you were using subqueries elsewhere in the where clause or in the select clause

stevoid1970: No, I said If A (nullable), use Not Exists. If B (not nullable) use Not Exists or Not In (they perform identically when the columns are not nullable)

Posted by Feng Zheng on 7 April 2010

You said that Not Exists is always faster than the version of Left Join with the specified conditions shown in your example. However, this is inaccurate since I believe that it is the data volume and distribution, and other stuffs such as different indexes that eventually decide which is faster.

Example:

The author’s example is copied and rewritten as following

SmallerTable has 5997 rows in my test and indexes are created as same as author did

-- Query 1

SELECT count(*)

FROM BigTable LEFT OUTER JOIN SmallerTable ON BigTable.SomeColumn = SmallerTable.LookupColumn      

WHERE LookupColumn IS NULL

GO

-- Query 2

SELECT count(*)

FROM BigTable WHERE

NOT EXISTS   (SELECT LookupColumn     FROM SmallerTable     WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

As author said and I confirmed, the query 2 is much faster than query 1. This is true.

However, in the following query, I switch the table position, #1 is hundreds of times faster than # 2:

-- Query 1

SELECT count(*)

FROM SmallerTable  LEFT OUTER JOIN BigTable ON BigTable.SomeColumn = SmallerTable.LookupColumn    

WHERE LookupColumn IS NULL

GO

-- Query 2

SELECT count(*)

FROM SmallerTable WHERE

NOT EXISTS   (SELECT LookupColumn     FROM    BigTable  WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

GO

Any question, email me at zhengfengchina@yahoo.com

Posted by GilaMonster on 15 April 2010

I have one comment.

You've made a fundamental mistake in that code. LookupColumn is in SmallerTable, not Bigtable. In your revised Query 1, it can never be null, because it's in the first table (not the one that's getting left joined in) and it's non-nullable.

The left outer join is fast because, if you look at the execution plan, it's a simple constant scan. You're asking for rows where a non-nullable column is null. The optimiser knows that can never happen. Check your execution plans.

btw, you've done the same in the second, though it's not significant in EXISTS. The column LookupColumn does not exist in BigTable. It's a common mistake people make with IN/EXISTS. Columns from the outer query can be referred to in a subquery without error.

Your revised two queries are not equivalent. It won't show in the results, because SmallerTable is a subset of BigTable, hence both return 0, but add a row to SmallerTable that's doesn't have a match in BigTable (like ZZZZ) and your revised Query 1 returns 0, while Query 2 returns 1.

If you correct the column references so that the two queries are equivalent, the left outer join runs (on my machine) in around 120ms and the not exists in around 80ms

Corrected queries follow:

SELECT count(*)

FROM SmallerTable  

LEFT OUTER JOIN BigTable ON BigTable.SomeColumn = SmallerTable.LookupColumn    

WHERE SomeColumn IS NULL

GO

-- Query 2

SELECT count(*)

FROM SmallerTable

WHERE NOT EXISTS

 (SELECT SomeColumn FROM    

BigTable  WHERE SmallerTable.LookupColumn = BigTable.SomeColumn)

Posted by dineshrajan on 23 May 2010

Good Article.

Leave a Comment

Please register or log in to leave a comment.