RE: SQL & the JOIN Operator – SQLServerCentral

SSC Guru

Points: 150467

February 26, 2011 at 8:36 pm

#1291966

Very nicely presented article, but the content is not quite right in places.

The set intersection is useful so long as your are clear that you are talking about sets of keys (i.e. unique join column values). If you don't make that distinction, or omit to go on to describe how one-many and many-many joins looks, it seems misleading. An INNER JOIN is not the intersection of two sets (that would be INTERSECT). The example data only uses unique join key values in both tables, which could reinforce any misconception.

I appreciate the value of simplification to introduce a concept, but the set-intersect visual might promote unclear understanding. The same issue applies to the coverage of LEFT, RIGHT, and FULL OUTER JOIN. There's no mention of how duplicates are treated, or how rows are NULL-extended when a match does not occur. Again, I feel the Venn diagram approach leads us astray here, unless it is very clearly pointed out that you are visualizing sets (unique by definition) of join keys.

The article also describes the cross product of Table1 and Table2 as "...each record of Table1 will be duplicated for each record in Table2". This suggests that only Table1's rows will appear duplicated in the output, which is clearly not the case. It's not wrong exactly, but it is incomplete and imprecise.

The section 'Excluding the Intersection of Sets' recommends the least efficient method (LEFT JOIN). NOT EXISTS is generally preferred because it can stop searching as soon as it finds a match. The semantic is different, of course, because LEFT JOIN ... WHERE NOT NULL can produce duplicates whereas NOT EXISTS will not. 'Excluding the intersection' has an air on set theory about it, but it isn't a very good description for returning rows from Table1 where a matching row does not exist in Table2.

In the same section, you say: "When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed." This is a common misconception, which you should have tested before publishing. There is no problem with performing an outer join on NULLable columns: NULLs never join (regardless of the ANSI_NULLS setting), so there is no possible confusion between a 'join matched' NULL and a NULL resulting from the NULL extension of a row by the outer join.

The section 'One Word About Execution Plans' starts with "These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed". This is logically true, but not physically. If it were, we would never see an index seek in a plan with joins! The SQL Server optimizer is free to reorder expressions, transform the plan, and push filters (predicates) down past joins if it results in a lower cost query. The only reason the filter in your LEFT OUTER JOIN example can't be pushed past the join is because it is testing NULL-extended rows resulting from the join.

In 'Joins and Indexes', you say: "On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index." - This is nonsense. It isn't the optimizer trying to be smart - there is no other access method available aside from scanning the clustered index. There's no such thing as a table scan on a clustered table.

The ability to perform anything other than an equi-join is not a 'cosmetic feature' - such joins are common where we need to produce a running total or a match on a range of dates, for example. The problem with non-equal (inequality) joins is NOT that they usually duplicate records - it is that performance tends to be worse because more rows will match.

Under 'Multiple Joins', you write: "Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3". The query optimizer chooses an initial join order based on table cardinality and other heuristics - it does not normally follow the written query order (though you can force that by including the FORCE ORDER hint, setting FORCE_PLAN ON, or by hinting a specific physical join operator e.g. INNER HASH JOIN). In the main, the optimizer picks an initial order, and then applies transformations which may result in further changes to the physical join order.

Consider reading my optimizer series starting at http://sqlblog.com/blogs/paul_white/archive/2010/07/29/inside-the-optimiser-constructing-a-plan.aspx

Paul

Paul White
SQLPerformance.com
SQLkiwi blog
@SQL_Kiwi