Do extra join conditions help, hurt, not matter?

  • Hi - I searched for a couple hours, but can't find this topic and am wondering if any of you has insight on it.

    I have an inner join of several tables.  Hypothetically, if each of those tables has a unique primary key.  The first table can be joined to the second on a primary key and the second table to the third with a primary key.  Also assume that other columns could be used in a join on those tables.  Is it better to only use the primary key, or is it better to add in some of the other columns, or is SQL Server smart enough to ignore the superfluous columns?

    To further illustrate in case I'm not properly explaining my question.

    Table1 has columns col1, col2, col3, col4 (primary key is on col3 and col4)

    Table2 has columns  col2, col3, col4, col5 (no primary key)

    If I want to do an inner join between tables 1 and 2 is it better to do this:

    select * from table1 t1 inner join table2 t2 on  t1.col3=t2.col3 and t1.col4=t2.col4

    or

    select * from table1 t1 inner join table2 t2 on  t1.col3=t2.col3 and t1.col4=t2.col4 and t1.col2=t2.col2

    or doesn't it matter?

    My expectation is that I'll get a scan on table2 regardless since there's no PK (it's just a heap I think).

    I don't typically post questions to groups, so I apologize if I've made this unclear or left out important details or made a faux pas.

    Thanks for any insight you may have.

     

  • Unfortunately - the answer to your question is 'It Depends' 😉

    Since table 2 does not have a defined constraint then you can have duplicate values at any level.  How you join is going to depend on what results you want to get based on the data.

    If you know that the HEAP was created as a one to one relationship between all 3 columns then you probably want to join using all 3 columns.  If col2 can be different for each distinct value of col3/col4 - then you probably want to join using just col3/col4...again, it really depends on the data in that table and the expected results from your query.

    Side note: I have seen this situation many times - and the developer decided to join on just the 2 columns.  Duplicate rows were returned and instead of adding the other column to the query the developer figure it was just easier to put in a DISTINCT (or GROUP BY or even both).  Not only is that incorrect...but it forces an unnecessary sort operation and affects the performance of the query.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • jschwartz_0412 wrote:

    Table1 has columns col1, col2, col3, col4 (primary key is on col3 and col4)

    Table2 has columns  col2, col3, col4, col5 (no primary key)

    If I want to do an inner join between tables 1 and 2 is it better to do this:

    select * from table1 t1 inner join table2 t2 on  t1.col3=t2.col3 and t1.col4=t2.col4

    or

    select * from table1 t1 inner join table2 t2 on  t1.col3=t2.col3 and t1.col4=t2.col4 and t1.col2=t2.col2

    or doesn't it matter?

    I'd say you should only put the actual join condition in the INNER JOIN clause of your query.  Since (col3,col4) is the primary key of Table1, do your join on those 2 columns, you can't get any more specific.  If there are extra criteria for filtering either Table1 or Table2, put those in your WHERE clause.  It makes it easier for the next person who has to read the code to understand what is happening.

  • Thank you Chris.  While I hadn't thought of the next person, I agree completely and appreciate the response.

    I'm still interested if there is a positive or negative impact and why.

    I did run a test on my actual query with and without the extraneous columns in the join condition and I see 1 more scan in the statistics, telling me that, in my current situation, I should definitely exclude the extra columns.  My gut tells me this would always be a bad thing, but my head says that the optimizer will take care of it, but my test agreed with my gut.

    Thanks again for the response - I really do appreciate it.

  • Thanks Jeffrey.  I appreciate the time you took to reply to me.  That makes total sense.  I'll keep that in mind when I'm working with our developers to fix this problem.

    Thanks again!

    I should mention that the results of the queries are identical.  The extraneous columns in the join condition do not impact the results.  I'm only curious about performance.

  • the SQL Server engine will evaluate any condition you write in your query.  Adding extra conditions will force it to do more work.  I'm surprised to hear it causes an extra scan, but it definitely would consume extra CPU cycles comparing those 2 columns.

  • Thank you again Chris.  That makes a lot of sense to me.  Have a great weekend.  I believe you have answered my question.

  • I suppose one way to think about it, is to understand what types of things that SQL Server can determine it doesn't need to do.  For example, if you have a view or even a table valued function that includes a join to a table, but this table isn't used in any where clause, group by clause, or columns used in the final select results of the query, and it doesn't change the number of rows being returned, then the SQL Server engine can optimize that join out and it is noticeable in your execution plan.

    When any of those conditions isn't met, then SQL Server will evaluate it and consider it pertinent to the results.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply