SQL & the JOIN Operator

  • Hi All,

    I have a question on joining more than 2 tables.

    When we have more than 2 tables to join what is the order of execution :

    For eg :

    select ...

    from table t1 inner join table t2

    on t1.a = t2.a

    inner join table t3

    on t2.b = t3.b

    I think it should join t1 & t2 first and then t2 & t3 to further filter the records from the join of t1 &t2.

    And if the query is like

    select ...

    from table t1 inner join table t2

    on t1.a = t2.a

    inner join table t3

    on t1.b = t3.b

    Here how it goes? First join First, which i think but just want to confirm.

    Thanks.

    Sunny

  • Good article 🙂

    The only thing I'm missing is a description of the other "join like" operators, like APPLY, INTERSECT and EXCEPT, but I know it's hard to cover everything in one article.

    Guess there will be a follow up 😀

  • Thanks for the informative article. As a newbie this helped me tremendously! 🙂

  • Nils

    Thanks for your comments. You know this 2nd article is in my list, hope I have time to work on it soon 🙂

  • Sunny

    Usually the query optimizer will evaluate the different sequences when defining the execution plan. So, considering standard situations, your query's performance will be same in both cases.

  • Enjoyed the article and found the diagrams easy to follow.

  • arun55 (10/8/2009)


    vliet (10/8/2009)


    Since the 'or' operator has lower precedence as the 'and' operator, when you ommit the parentheses the join condition means something completely different. Without parentheses the predicate behind the 'or' is evaluated separately, which leaves out the 'a.id = b.id' part completely, thus returning all records from 'a' for each and every record where 'b.date2' is empty. This might blow up the size of the result set a lot when a has many rows and b has many rows without a date in this column. Never ommit any parenthenses whenever you use both 'and' and 'or' in a predicate. You can never use too many parenthenses in these cases, computers don't know what you mean, they just do what you tell them to, but that's generaly not what you want.

    I've seen predicates with both the 'and' and 'or' operator being the cause of 'human' errors in SQL so often, it's almost worth an article.

    Thanks for the information. I understood what you have explained.

    But in above case, b.date field has no nulls and I am getting same results.

    I think it was a precautionary statement. I will check the requirement and try to add parentheses.

    If Table B can contain a start but no end date (assumption given inferred use) this is important.

    Your 40 minute query is returning information based on two specific possibilities:

    1) a.id = b.id and a.date1 between b.date1 and b.date2

    2) b.date2 is null

    This would return the same data as the following more complex and less efficient query:

    select a.*,b.*

    from table a

    left join table b

    on a.id = b.id

    and a.date1 between b.date1 and b.date2

    union

    select a.*,b.*

    from table a, table b

    where b.date2 is null

    Looking at the second select this is a cross join. After creating this cross join it then goes through each record Count(A)*Count(B) and select those where date2 is null (0 per your statements). So this entire cross join would be generated then ignored. You are paying the cost to create this cross join and then ignore it.

    Your results are the same given your data set, but this doesn't mean your queries are equal. They could change after your next transaction.

    You might consider a third query depending on your expectation.

    This requires that a.date1 occur after b.date1 but b.date2 may be null

    select a.*,b.*

    from table a

    left join table b

    on a.id = b.id and a.date1 >= b.date1 and a.date1 <= isnull(b.date2,getdate())

    This query allows a.date1 to predate b.date1. This may not be desired.

    Arun55 modified.

    select a.*,b.*

    from table a

    left join table b

    on a.id = b.id

    and (a.date1 between b.date1 and b.date2 or b.date2 is null)

  • Well done with this article.

    Jason...AKA CirqueDeSQLeil
    _______________________________________________
    I have given a name to my pain...MCM SQL Server, MVP
    SQL RNNR
    Posting Performance Based Questions - Gail Shaw[/url]
    Learn Extended Events

  • Good article.

    Here's a possible correction. For the Self Join section, there is the output to the query. See the fourth record which is "7, Peter, 8, Harry". Isn't it supposed to be "7, Peter, 6, Mark"?

    Thanks.

  • Excellent article using Venn diagrams

  • One question: I thought that when you used multiple JOINs that SQL does the first join between Table1 and Table2 and then joins the result set of that join to Table3. Is this correct? Your article seems to say that SQL joins Table1 to Table2 and then joins Table1 to Table3 and then I assume it joins the result set of those two joins. Could you clarify? Is there some "order of operations" when it comes to JOINs?

    Thank you for the article! Really like the Venn diagram visual aid.

  • Permap

    Good catch! I'm correcting it now.

    Thx

  • Very nicely presented article, but the content is not quite right in places.

    The set intersection is useful so long as your are clear that you are talking about sets of keys (i.e. unique join column values). If you don't make that distinction, or omit to go on to describe how one-many and many-many joins looks, it seems misleading. An INNER JOIN is not the intersection of two sets (that would be INTERSECT). The example data only uses unique join key values in both tables, which could reinforce any misconception.

    I appreciate the value of simplification to introduce a concept, but the set-intersect visual might promote unclear understanding. The same issue applies to the coverage of LEFT, RIGHT, and FULL OUTER JOIN. There's no mention of how duplicates are treated, or how rows are NULL-extended when a match does not occur. Again, I feel the Venn diagram approach leads us astray here, unless it is very clearly pointed out that you are visualizing sets (unique by definition) of join keys.

    The article also describes the cross product of Table1 and Table2 as "...each record of Table1 will be duplicated for each record in Table2". This suggests that only Table1's rows will appear duplicated in the output, which is clearly not the case. It's not wrong exactly, but it is incomplete and imprecise.

    The section 'Excluding the Intersection of Sets' recommends the least efficient method (LEFT JOIN). NOT EXISTS is generally preferred because it can stop searching as soon as it finds a match. The semantic is different, of course, because LEFT JOIN ... WHERE NOT NULL can produce duplicates whereas NOT EXISTS will not. 'Excluding the intersection' has an air on set theory about it, but it isn't a very good description for returning rows from Table1 where a matching row does not exist in Table2.

    In the same section, you say: "When we do this kind of query, we have to pay attention to which field we pick for the WHERE clause. We must use a field that does not allow NULL values. Otherwise the result set may include unwanted records. That's why I suggested to use the second table's key. More specifically, its primary key. Since primary keys don't accept NULL values, they will assure our result set will be just what we needed." This is a common misconception, which you should have tested before publishing. There is no problem with performing an outer join on NULLable columns: NULLs never join (regardless of the ANSI_NULLS setting), so there is no possible confusion between a 'join matched' NULL and a NULL resulting from the NULL extension of a row by the outer join.

    The section 'One Word About Execution Plans' starts with "These comments lead us to an important insight. We usually don't stop to think about this, but observe that the execution plan of SQL queries will first calculate the result set for the FROM clause and the JOIN operator (if any), and then the WHERE clause will be executed". This is logically true, but not physically. If it were, we would never see an index seek in a plan with joins! The SQL Server optimizer is free to reorder expressions, transform the plan, and push filters (predicates) down past joins if it results in a lower cost query. The only reason the filter in your LEFT OUTER JOIN example can't be pushed past the join is because it is testing NULL-extended rows resulting from the join.

    In 'Joins and Indexes', you say: "On the other hand, Table1 had no index on field key2. Because of that, the query optimizer tried to be smart enough and improve the performance of querying key2 using the only available index." - This is nonsense. It isn't the optimizer trying to be smart - there is no other access method available aside from scanning the clustered index. There's no such thing as a table scan on a clustered table.

    The ability to perform anything other than an equi-join is not a 'cosmetic feature' - such joins are common where we need to produce a running total or a match on a range of dates, for example. The problem with non-equal (inequality) joins is NOT that they usually duplicate records - it is that performance tends to be worse because more rows will match.

    Under 'Multiple Joins', you write: "Just remember that joins are written in pairs. So first we will join Table1 to Table2. And then we will join Table1 and Table3". The query optimizer chooses an initial join order based on table cardinality and other heuristics - it does not normally follow the written query order (though you can force that by including the FORCE ORDER hint, setting FORCE_PLAN ON, or by hinting a specific physical join operator e.g. INNER HASH JOIN). In the main, the optimizer picks an initial order, and then applies transformations which may result in further changes to the physical join order.

    Consider reading my optimizer series starting at http://sqlblog.com/blogs/paul_white/archive/2010/07/29/inside-the-optimiser-constructing-a-plan.aspx

    Paul

  • Paul,

    Isn't this a bit like the article you wrote on designing databases? While I agree with all the things you've stated, entire books have been written on how to join and all the caveates that occur when dupes are present, etc, ad infinitum. This article is a good introduction to joins... it's not the final word just like your introduction to designing databases isn't and wasn't intended to be the final word. Think about it... 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • James

    The SQL statement was supposed to do it from the first to the last join. Actually I understand that's the way things worked out in old RDMBSs.

    But SQL SERVER has a powerful Query Optimizer that evaluates the possibilities to choose the best combination of tables so to optimize query performance.

    You can do yourself a simple test to check that. Try running the following queries. This is a simple query (running on ADVENTUREWORKSLT2008) written in two different forms, changing the sequence of the JOINs.

    [font="Courier New"]USE AdventureWorksLT2008

    GO

    SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal

    FROM SALESLT.CUSTOMER C

    INNER JOIN SalesLT.SalesOrderHeader SH ON SH.CustomerID = C.CUSTOMERID

    INNER JOIN SalesLT.SalesOrderDetail SD ON SD.SalesOrderID = SH.SalesOrderID

    INNER JOIN SalesLT.Product P ON P.ProductID = SD.ProductID

    GROUP BY C.CompanyName, P.Name

    GO

    SELECT C.CompanyName, P.Name AS Product, SUM(SD.OrderQty) AS OrderQtyTotal

    FROM SALESLT.Product P

    INNER JOIN SalesLT.SalesOrderDetail SD ON P.ProductID = SD.ProductID

    INNER JOIN SalesLT.SalesOrderHeader SH ON SD.SalesOrderID = SH.SalesOrderID

    INNER JOIN SalesLT.Customer C ON SH.CustomerID = C.CUSTOMERID

    GROUP BY C.CompanyName, P.Name

    GO[/font]

    If you compare the execution plan of each query, you'll see they're exactly the same (due to the work of the QUERY OPTIMIZER).

    (I looked for some reference to show you this idea and only found this pretty short explanation http://bytes.com/topic/sql-server/answers/142445-order-joins-performance)

Viewing 15 posts - 76 through 90 (of 98 total)

You must be logged in to reply to this topic. Login to reply