RE: Outer Join – What am I missing (SQL Spackle)

SSCommitted

Points: 1776

February 27, 2015 at 7:24 pm

Ron Cicotte (2/27/2015)
Another (more efficient) way to solve this problem without using a LEFT OUTER JOIN or even WHERE NOT EXISTS (which btw is even less efficient) is to use the power of indexes. This method while similar to the NOT EXIST clause explicitly collects the cartesian set of indexes of employees and courses and deletes those where courses have been completed joining on the #CourseCompleted table. Since we are only collecting the indexes and not the related data we avoid returning unnecessary data and the looping inherent in the NOT EXIST clause.

Could you please elaborate on how the WHERE NOT EXISTS approach is less efficient and on what you mean by "looping inherent in the NOT EXIST clause"? I think this is a good example of "it depends" on your data. If there are potentially many matches on the right-side table (CourseCompleted), the WHERE NOT EXISTS approach can eliminate the rows from the left-side table as soon as it finds the first match. I haven't studied the execution plan, but I believe the LEFT OUTER JOIN approach retrieves all matches and then filters rows. If that's the case, WHERE NOT EXISTS is clearly more efficient when there could be many matches per row.

Also not sure what you mean by "use the power of indexes". Doesn't the query optimizer already do that whenever it can?

It's a great idea to performance test queries whenever you're unsure, but I don't think the test you suggested is comparing apples to apples. The first two queries are sorting the result set by Employee.EmployeeId and the script is not.

Also, as you pointed out, most of the elapsed time is spent returning the records, so it's harder to see the performance differences between the actual queries. For that reason, many benchmark tests will either just select a count or will select data into a table variable instead of returning it to the screen. Because other processes can compete for resources even on a "quiet" server, many benchmark tests will report CPU time as well as elapsed time to show how much CPU is actually used. (Don't forget, with parallel processing CPU time can exceed elapsed time.) There are also times when disk IO is the bottleneck, so it is also often reported.

Out of curiosity I modified the test so it's comparing apples to apples and so the query differences have a greater overall impact on the results. I replaced the column selects with count(*) and removed the ORDER BY clauses.

I then ran SET STATISTICS TIME ON (ignoring IO for now) and repeated the test.

The LEFT OUTER JOIN took about 5.5 seconds of CPU and the same time elapsed,

the WHERE NOT EXISTS took about 5 seconds of CPU and the same time elapsed, (Hmmm... maybe it's not "even less efficient".),

and the script took about 34 seconds of CPU and 17 seconds elapsed.

Finally, I tested the EXCEPT query suggested by Bill Talada by making it a subquery and selecting a count from it. (I also removed the name columns so the EXCEPT was only comparing IDs. In "real life", you'd probably do that anyway and then join the EXCEPT subquery to the base table to pull only needed data columns).

It took about 16 seconds of CPU and only 4 seconds elapsed! A clear winner in terms of elapsed time, but at a higher CPU cost.

All of my tests were run using Microsoft SQL Server Developer (64-bit) Version 11.0.5058.0 on a Windows Enterprise 3.2GHz Workstation with 12 Gigs RAM. The record count was 5,517,383.

As always, YMMV, but to summarize my results:

The EXCEPT technique ran the fastest,

NOT EXISTS was 20% slower but used less CPU, and

the script took the longest and used the most CPU.

Good, objective testing is important, because

Ron Cicotte (2/27/2015)
In todays world data sets are becoming very large very fast and it is important to be conscious of inefficient queries from the very beginning.