Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««678910»»

SQL & the JOIN Operator Expand / Collapse
Author
Message
Posted Friday, February 25, 2011 11:49 AM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Yesterday @ 7:29 AM
Points: 217, Visits: 476
Enjoyed the article and found the diagrams easy to follow.
Post #1069789
Posted Friday, February 25, 2011 12:25 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Monday, November 25, 2013 9:09 AM
Points: 216, Visits: 118
arun55 (10/8/2009)
vliet (10/8/2009)
Since the 'or' operator has lower precedence as the 'and' operator, when you ommit the parentheses the join condition means something completely different. Without parentheses the predicate behind the 'or' is evaluated separately, which leaves out the 'a.id = b.id' part completely, thus returning all records from 'a' for each and every record where 'b.date2' is empty. This might blow up the size of the result set a lot when a has many rows and b has many rows without a date in this column. Never ommit any parenthenses whenever you use both 'and' and 'or' in a predicate. You can never use too many parenthenses in these cases, computers don't know what you mean, they just do what you tell them to, but that's generaly not what you want.

I've seen predicates with both the 'and' and 'or' operator being the cause of 'human' errors in SQL so often, it's almost worth an article.



Thanks for the information. I understood what you have explained.
But in above case, b.date field has no nulls and I am getting same results.
I think it was a precautionary statement. I will check the requirement and try to add parentheses.


If Table B can contain a start but no end date (assumption given inferred use) this is important.

Your 40 minute query is returning information based on two specific possibilities:
1) a.id = b.id and a.date1 between b.date1 and b.date2
2) b.date2 is null

This would return the same data as the following more complex and less efficient query:
select a.*,b.*
from table a
left join table b
on a.id = b.id
and a.date1 between b.date1 and b.date2
union
select a.*,b.*
from table a, table b
where b.date2 is null

Looking at the second select this is a cross join. After creating this cross join it then goes through each record Count(A)*Count(B) and select those where date2 is null (0 per your statements). So this entire cross join would be generated then ignored. You are paying the cost to create this cross join and then ignore it.

Your results are the same given your data set, but this doesn't mean your queries are equal. They could change after your next transaction.

You might consider a third query depending on your expectation.
This requires that a.date1 occur after b.date1 but b.date2 may be null
select a.*,b.*
from table a
left join table b
on a.id = b.id and a.date1 >= b.date1 and a.date1 <= isnull(b.date2,getdate())

This query allows a.date1 to predate b.date1. This may not be desired.
Arun55 modified.
select a.*,b.*
from table a
left join table b
on a.id = b.id
and (a.date1 between b.date1 and b.date2 or b.date2 is null)
Post #1069811
Posted Friday, February 25, 2011 2:22 PM


SSCoach

SSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoachSSCoach

Group: General Forum Members
Last Login: Today @ 5:20 PM
Points: 17,600, Visits: 15,462
Well done with this article.



Jason AKA CirqueDeSQLeil
I have given a name to my pain...
MCM SQL Server


SQL RNNR

Posting Performance Based Questions - Gail Shaw
Posting Data Etiquette - Jeff Moden
Hidden RBAR - Jeff Moden
VLFs and the Tran Log - Kimberly Tripp
Post #1069872
Posted Friday, February 25, 2011 6:19 PM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Saturday, March 26, 2011 5:33 PM
Points: 2, Visits: 16
Good article.

Here's a possible correction. For the Self Join section, there is the output to the query. See the fourth record which is "7, Peter, 8, Harry". Isn't it supposed to be "7, Peter, 6, Mark"?

Thanks.
Post #1069953
Posted Saturday, February 26, 2011 8:46 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Monday, September 1, 2014 10:50 AM
Points: 11, Visits: 911
Excellent article using Venn diagrams
Post #1070034
Posted Saturday, February 26, 2011 8:55 AM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Yesterday @ 10:35 AM
Points: 15, Visits: 27
One question: I thought that when you used multiple JOINs that SQL does the first join between Table1 and Table2 and then joins the result set of that join to Table3. Is this correct? Your article seems to say that SQL joins Table1 to Table2 and then joins Table1 to Table3 and then I assume it joins the result set of those two joins. Could you clarify? Is there some "order of operations" when it comes to JOINs?

Thank you for the article! Really like the Venn diagram visual aid.
Post #1070037
Posted Saturday, February 26, 2011 4:14 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, September 4, 2014 2:41 PM
Points: 29, Visits: 227
Permap

Good catch! I'm correcting it now.

Thx
Post #1070085
Posted Saturday, February 26, 2011 8:36 PM


SSCrazy Eights

SSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy EightsSSCrazy Eights

Group: General Forum Members
Last Login: Yesterday @ 12:48 PM
Points: 9,923, Visits: 11,169
Very nicely presented article, but the content is not quite right in places.

The set intersection is useful so long as your are clear that you are talking about sets of keys (i.e. unique join column values). If you don't make that distinction, or omit to go on to describe how one-many and many-many joins looks, it seems misleading. An INNER JOIN is not the intersection of two sets (that would be INTERSECT). The example data only uses unique join key values in both tables, which could reinforce any misconception.

I appreciate the value of simplification to introduce a concept, but the set-intersect visual might promote unclear understanding. The same issue applies to the coverage of LEFT, RIGHT, and FULL OUTER JOIN. There's no mention of how duplicates are treated, or how rows are NULL-extended when a match does not occur. Again, I feel the Venn diagram approach leads us astray here, unless it is very clearly pointed out that you are visualizing sets (unique by definition) of join keys.

The article also describes the cross product of Table1 and Table2 as "...each record of Table1 will be duplicated for each record in Table2". This suggests that only Table1's rows will appear duplicated in the output, which is clearly not the case. It's not wrong exactly, but it is incomplete and imprecise.

The section 'Excluding the Intersection of Sets' recommends the least efficient method (LEFT JOIN). NOT EXISTS is generally preferred because it can stop searching as soon as it finds a match. The semantic is different, of course, because LEFT JOIN ... WHERE NOT NULL can produce duplicates whereas NOT EXISTS will not. 'Excluding the intersection' has an air on set theory about it, but it isn't a very good description for returning rows from Table1 where a matching row does not exist in Table2.

In the same section, you say: "When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed." This is a common misconception, which you should have tested before publishing. There is no problem with performing an outer join on NULLable columns: NULLs never join (regardless of the ANSI_NULLS setting), so there is no possible confusion between a 'join matched' NULL and a NULL resulting from the NULL extension of a row by the outer join.

The section 'One Word About Execution Plans' starts with "These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed". This is logically true, but not physically. If it were, we would never see an index seek in a plan with joins! The SQL Server optimizer is free to reorder expressions, transform the plan, and push filters (predicates) down past joins if it results in a lower cost query. The only reason the filter in your LEFT OUTER JOIN example can't be pushed past the join is because it is testing NULL-extended rows resulting from the join.

In 'Joins and Indexes', you say: "On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index." - This is nonsense. It isn't the optimizer trying to be smart - there is no other access method available aside from scanning the clustered index. There's no such thing as a table scan on a clustered table.

The ability to perform anything other than an equi-join is not a 'cosmetic feature' - such joins are common where we need to produce a running total or a match on a range of dates, for example. The problem with non-equal (inequality) joins is NOT that they usually duplicate records - it is that performance tends to be worse because more rows will match.

Under 'Multiple Joins', you write: "Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3". The query optimizer chooses an initial join order based on table cardinality and other heuristics - it does not normally follow the written query order (though you can force that by including the FORCE ORDER hint, setting FORCE_PLAN ON, or by hinting a specific physical join operator e.g. INNER HASH JOIN). In the main, the optimizer picks an initial order, and then applies transformations which may result in further changes to the physical join order.

Consider reading my optimizer series starting at http://sqlblog.com/blogs/paul_white/archive/2010/07/29/inside-the-optimiser-constructing-a-plan.aspx

Paul




Paul White
SQL Server MVP
SQLblog.com
@SQL_Kiwi
Post #1070099
Posted Sunday, February 27, 2011 7:37 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 8:14 PM
Points: 35,215, Visits: 31,665
Paul,

Isn't this a bit like the article you wrote on designing databases? While I agree with all the things you've stated, entire books have been written on how to join and all the caveates that occur when dupes are present, etc, ad infinitum. This article is a good introduction to joins... it's not the final word just like your introduction to designing databases isn't and wasn't intended to be the final word. Think about it...


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1070176
Posted Sunday, February 27, 2011 2:08 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, September 4, 2014 2:41 PM
Points: 29, Visits: 227
James

The SQL statement was supposed to do it from the first to the last join. Actually I understand that's the way things worked out in old RDMBSs.

But SQL SERVER has a powerful Query Optimizer that evaluates the possibilities to choose the best combination of tables so to optimize query performance.

You can do yourself a simple test to check that. Try running the following queries. This is a simple query (running on ADVENTUREWORKSLT2008) written in two different forms, changing the sequence of the JOINs.

USE AdventureWorksLT2008
GO

SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal
FROM SALESLT.CUSTOMER C
INNER JOIN SalesLT.SalesOrderHeader SH ON SH.CustomerID = C.CUSTOMERID
INNER JOIN SalesLT.SalesOrderDetail SD ON SD.SalesOrderID = SH.SalesOrderID
INNER JOIN SalesLT.Product P ON P.ProductID = SD.ProductID
GROUP BY C.CompanyName, P.Name
GO


SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal
FROM SALESLT.Product P
INNER JOIN SalesLT.SalesOrderDetail SD ON P.ProductID = SD.ProductID
INNER JOIN SalesLT.SalesOrderHeader SH ON SD.SalesOrderID = SH.SalesOrderID
INNER JOIN SalesLT.Customer C ON SH.CustomerID = C.CUSTOMERID
GROUP BY C.CompanyName, P.Name
GO


If you compare the execution plan of each query, you'll see they're exactly the same (due to the work of the QUERY OPTIMIZER).



(I looked for some reference to show you this idea and only found this pretty short explanation http://bytes.com/topic/sql-server/answers/142445-order-joins-performance)
Post #1070213
« Prev Topic | Next Topic »

Add to briefcase «««678910»»

Permissions Expand / Collapse