Fun with Outer Joins

  • Comments posted to this topic are about the item Fun with Outer Joins

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • Thanks for the article. In my experience (15 years SQL Server, 25 Oracle) there have been so many instances where despite hours of research and hacking the T-SQL, an outer join fails to deliver the keys for non matched rows. This appears to be a not uncommon finding. Maybe there's scope for a follow on article explaining with why this happens and how to fix it without resorting to #tables.

    Men who wish to know about the world must learn about it in its particular details. - Heraclitus
  • hi there,

    maybe this little trick helps:

    I personally understood this behaviour of outer joins quite a time ago when it was explained to me as follows (yes I know, very simplified...):

    there is a logical order of processing the a query. At some time the outer join with its ON-clause is processed, first as an inner join, then adding the outer rows and also the NULL-values on the missing inner rows. Some time after that the WHERE-clause is processed maybe "removing" null-values.

    Since I'm saying to myself "first the ON is processed, after that the WHERE is processed" it's no problem for me any more writing my outer joins in a way they are doing what I expect.

  • I suppose I don't understand the point. We know that when records do not exist in the RIGHT table that NULLs are returned. So why would we expect a filter condition on the RIGHT table to treat NULLs differently than we would otherwise?

    i.e. Simply change the filter condition to account for NULLs

    WHERE Professor.HasTenure = 'True'

    AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)

    Jared
    CE - Microsoft

  • SQLKnowItAll (9/10/2012)


    I suppose I don't understand the point. We know that when records do not exist in the RIGHT table that NULLs are returned. So why would we expect a filter condition on the RIGHT table to treat NULLs differently than we would otherwise?

    i.e. Simply change the filter condition to account for NULLs

    WHERE Professor.HasTenure = 'True'

    AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)

    id say because the above query is the classic catch all. if we can avoid the catch all performance should increase. (as always the it depends please test applies).

    if we have an index on Class with ProfessorID and ClassYear we can use that index in the join. we also eliminating records we dont need early in execution and have a lower overall row count.

    To Kenneth, Great article and a gave me a great way to explain join conditions vs where conditions.


    For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

    For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

    Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
    Jeff Moden's Cross tab and Pivots Part 1[/url]
    Jeff Moden's Cross tab and Pivots Part 2[/url]

  • capn.hector (9/10/2012)


    SQLKnowItAll (9/10/2012)


    I suppose I don't understand the point. We know that when records do not exist in the RIGHT table that NULLs are returned. So why would we expect a filter condition on the RIGHT table to treat NULLs differently than we would otherwise?

    i.e. Simply change the filter condition to account for NULLs

    WHERE Professor.HasTenure = 'True'

    AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)

    id say because the above query is the classic catch all. if we can avoid the catch all performance should increase. (as always the it depends please test applies).

    if we have an index on Class with ProfessorID and ClassYear we can use that index in the join. we also eliminating records we dont need early in execution and have a lower overall row count.

    To Kenneth, Great article and a gave me a great way to explain join conditions vs where conditions.

    I don't see it as a catch all. I see it as defining the requirements of the query. The join condition is the id, the filter is the tenure and the year. Assessing the question posed, I wouldn't expect to see the results that the author posed; i.e. if class year is NULL, then they didn't teach that year and I don't need them in the results. I just don't understand why the expected results would be different. If someone has been working with LEFT OUTER JOINs for more than 20 minutes... why would they ever expect the query to return rows from the LEFT table when the filter is on the RIGHT? I don't put the filter in the JOIN because it is a filter, not a join condition.

    Jared
    CE - Microsoft

  • SQLKnowItAll (9/10/2012)


    capn.hector (9/10/2012)


    SQLKnowItAll (9/10/2012)


    I suppose I don't understand the point. We know that when records do not exist in the RIGHT table that NULLs are returned. So why would we expect a filter condition on the RIGHT table to treat NULLs differently than we would otherwise?

    i.e. Simply change the filter condition to account for NULLs

    WHERE Professor.HasTenure = 'True'

    AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)

    id say because the above query is the classic catch all. if we can avoid the catch all performance should increase. (as always the it depends please test applies).

    if we have an index on Class with ProfessorID and ClassYear we can use that index in the join. we also eliminating records we dont need early in execution and have a lower overall row count.

    To Kenneth, Great article and a gave me a great way to explain join conditions vs where conditions.

    I don't see it as a catch all. I see it as defining the requirements of the query. The join condition is the id, the filter is the tenure and the year. Assessing the question posed, I wouldn't expect to see the results that the author posed; i.e. if class year is NULL, then they didn't teach that year and I don't need them in the results. I just don't understand why the expected results would be different. If someone has been working with LEFT OUTER JOINs for more than 20 minutes... why would they ever expect the query to return rows from the LEFT table when the filter is on the RIGHT? I don't put the filter in the JOIN because it is a filter, not a join condition.

    while i agree with you that the join condition is the id and the filter is tenure and year, i disagree that the join on id should be the only thing in the join conditions. as far as the expected results, well users want what they want and we are there to provide.

    (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL) while not exactly a catch all with a passed variable you are still using the form (something = constant or something IS NULL). if we can eliminate the second check for null (even by not putting a "filter" in the where clause but in the join) we should speed up the query.

    this of course is a great situation for the it depends, test both ways, or what does your code reviewer say answer to which is correct.


    For faster help in answering any problems Please read How to post data/code on a forum to get the best help - Jeff Moden[/url] for the best way to ask your question.

    For performance Issues see how we like them posted here: How to Post Performance Problems - Gail Shaw[/url]

    Need to Split some strings? Jeff Moden's DelimitedSplit8K[/url]
    Jeff Moden's Cross tab and Pivots Part 1[/url]
    Jeff Moden's Cross tab and Pivots Part 2[/url]

  • First I would like to appologize if my choice of test data is throwing people off. The data and table layout was meant to be simple and easy to follow so it didn't get in the way of the description of the difference between the ON clause and the WHERE clause.

    Second I'll address a few points.

    The correct terms are "preserved table " and "unpreserved table"

    Never heard of either. What are they the correct terms for?

    and your design is wrong.

    My design is correct for the purpose. It's simple. I didn't need or want a complicated design to demonstrate a subject that can be complicated enough for a lot of people.

    we do not use BIT flags in SQL

    "We" do use BITs quite often. They are a nice small data type where I can fit 8 of them into a byte. Unless you have a good reason not to use them that I've never heard of I think I'll continue to not waste disk space.

    Your life will be easier if you learn to use row constructors:

    INSET INTO Professors

    VALUES (1, 'Dr Coke', '2011-01-01'),

    (2, 'Dr Sprite', '2011-01-01'),

    (3, 'Dr 7-up', '2011-01-01'),

    (4, 'Mr Pepper', NULL),

    (5, 'Mr Fanta', NULL);

    I do in fact know how to use this method of inserting. However it only works on 2008 and above and not everyone (believe it or not) is done using 2005 (and even lower).

    Now try your queries with a proper schema.

    My queries work exactly the same. Which is why I chose to use a very simple table layout. If I had chosen to make a realistic schema I could have added quite a few more columns, tables, defaults, constraints, foreign keys etc. That not being the point of the article I chose not to.

    SQLKnowItAll (9/10/2012)


    I suppose I don't understand the point. We know that when records do not exist in the RIGHT table that NULLs are returned. So why would we expect a filter condition on the RIGHT table to treat NULLs differently than we would otherwise?

    i.e. Simply change the filter condition to account for NULLs

    WHERE Professor.HasTenure = 'True'

    AND (Class.ClassYear >= 2011 OR Class.ClassYear IS NULL)

    You can absolutly do it this way. Personally I prefer the other, and I do consider the ClassYear part of the join in this case. But that comes down to personal preference I think. I will say that you should consider yourself lucky to understand it so well. I still get around 1 or 2 questions a week on this subject. Which is why I wrote the article in the first place. It tends to be a difficult subject for some people. Not everyone, but in my experience the vast majority.

    Kenneth FisherI was once offered a wizards hat but it got in the way of my dunce cap.--------------------------------------------------------------------------------For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/[/url]For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/[/url]Link to my Blog Post --> www.SQLStudies.com[/url]

  • I really learned from this article! I've been working with SQL Server for 7 years now and I had no idea the results were different if I took a filter from the WHERE clause and put it into the ON clause...I always assumed the results were going to be the same. I personally knew the right table results were being filtered with the WHERE clause (Dr Sprite and Dr 7-Up were being excluded) because I've been down that bumpy road many times. I've always dealt with this situation by adding "ISNULL" into the where clause.

    AND ISNULL(class.classyear,2011)>=2011

    Not a pretty solution but NOW I know there is a better way!

    Thanks for the article AND it's simplicity. I get what Joe Celko was driving at, but I find simple examples to be refreshing...though it could probably be argued that is why I'm a 7 year veteran and still learning what many DBA's consider elementary.

  • The way I always remember this is thus: If you refer to the right hand table anywhere is the where clause, you immediately turn that left join into a inner join. That's the effect you see anyway.

    Putting the predicate in the ON clause keeps it a left join.

    Unless of course you are doing an anti-join, where you want all records from Set A not in Set B, so you use

    select cols from Set_A a

    left join Set_B b

    On a.col1 = b.col1

    where b.col1 is NULL

    That is the only time I refer to a column from the right hand table involved in a left outer join in the where clause.

    HTH

    Dave J


    http://glossopian.co.uk/
    "I don't know what I don't know."

  • thisisfutile (9/10/2012)


    I really learned from this article! I've been working with SQL Server for 7 years now and I had no idea the results were different if I took a filter from the WHERE clause and put it into the ON clause...I always assumed the results were going to be the same. I personally knew the right table results were being filtered with the WHERE clause (Dr Sprite and Dr 7-Up were being excluded) because I've been down that bumpy road many times. I've always dealt with this situation by adding "ISNULL" into the where clause.

    AND ISNULL(class.classyear,2011)>=2011

    Not a pretty solution but NOW I know there is a better way!

    Thanks for the article AND it's simplicity. I get what Joe Celko was driving at, but I find simple examples to be refreshing...though it could probably be argued that is why I'm a 7 year veteran and still learning what many DBA's consider elementary.

    Using a function on a column in a where clause will not allow your query to use an index. That is why instead of isnull you want to use "Where col1 = 'a' OR Col1 IS NULL..."

  • Ken,

    It's a great article.

    Yep as a simple teaching example I get your design. I also get what Joe is saying, although he has often said it better, in that it winds up validating bad design by being used as an example. The "relationship" table (Catalog) is used to connect [Professor] to [Class]. For your next eaxample you might choose something like [SalesOrder] and [SalesOrderActivity].

    Joe,

    Keep up the good fight! I had one VP take me to task about my penchant for normalization as being "abnormal". I pointed out that we have a mediumn sized database in production that is small, robust, comprehensive, and quick. I learend a lot from your books.

    To everybody else,

    Take this to heart in that the real answer here is not so much how the joins work but the fact that the way SQL Server processes your execution can surprise you if you are not careful. The key here is that the WHERE clause seems to be processed after all the joins have taken place.

    ATBCharles Kincaid

  • CELKO (9/9/2012)


    The correct terms are "preserved table " and "unpreserved table" and your design is wrong. You have no keys, we do not use BIT flags in SQL, you believe in a magical "id" that changes from table to table, becoming a professor, a squid, a class, an automobile, etc. That is Kabahlah magice and nto RDBMS.

    You had only one professor, no key, a name that violates USPS standards and an assembly language bit flag. IDENTITY properties? And no DRI because of the bad non-schema. Yes, without keys and DRI, this is not RDBMS. You also do not understand temporal data.

    CREATE TABLE Professors

    (emp_id INTEGER NOT NULL PRIMARY KEY,

    professor_name VARCHAR(35) NOT NULL,

    tenure_date DATE);

    Your life will be easier if you learn to use row constructors:

    INSET INTO Professors

    VALUES (1, 'Dr Coke', '2011-01-01'),

    (2, 'Dr Sprite', '2011-01-01'),

    (3, 'Dr 7-up', '2011-01-01'),

    (4, 'Mr Pepper', NULL),

    (5, 'Mr Fanta', NULL);

    Notice the use of a NULL for non-tenured professors. This still stinks and we ought to have a status and date range. But let's skip that design flaw for now.

    CREATE TABLE Classes

    (course_nbr INTEGER NOT NULL

    REFERENCES Catalog,

    semester_name CHAR(7) NOT NULL

    REFERENCES Semesters,

    PRIMARY KEY (course_nbr, semester_name)

    emp_id INTEGER

    REFERENCES Professors);

    The common temporal idiom in SQL is a look up table for reporting periods. I prefer a name that can be sorted with ISO-8601 data:

    CREATE TABLE Semesters

    (semester_name CHAR(7) NOT NULL PRIMARY KEY

    CHECK (semester_name LIKEB '[12][0-9][0-9][0-9]-S[1-3]'),

    semester_start_date DATE NOT NULL,

    semester_end_date DATE NOT NULL,

    CHECK(semester_start_date < semester_end_date);

    You can get the class name from the catalog and not waste space and time on it. What you had was not normalized. Courses have numbers; look at a college catalog. The NULL is my favorite instructor “TBD” (I do adjunct teaching and that means work).

    INSERT INTO Class_Schedule (class_nbr, semester_name, emp_id)

    VALUES (101 '2010-S1', 4),

    (101, '2010-S2', 4),

    (101, '2011-S1', 4),

    (101, '2012-S1', NULL),

    (101, '2012-S2', NULL),

    Etc.

    Now try your queries with a proper schema.

    wow... beating up the author because of the design correctness of the example in an article about joins? it was just an example by the way. He was giving some example code so people could try it themselves (and so the article made sense). Get off your high horse and understand the article was not about database design. It was about joins....

  • adamg 27214 (9/10/2012)


    Using a function on a column in a where clause will not allow your query to use an index. That is why instead of isnull you want to use "Where col1 = 'a' OR Col1 IS NULL..."

    Roger that. Most of the querying I do is ad-hoc on small datasets and the isnull function is a personal "at-a-glance" way for me to see that I'm including the null results on the RIGHT table (Bad habit? Yes. In production? No). For me it's easier when I'm troubleshooting because the less "OR" and "AND" structures I have to logic my way through during a troubleshooting session the better off I am. Thanks for pointing that out though. I constantly feel like I learn something on this site. I'm still amazed that I never realized two result sets are produced by just moving something from the WHERE clause up to the ON clause. I felt a "click" in my brain today and at 40 years of age I thought I was getting past most of those. It's good to know I can still learn. 😛

  • CELKO (9/9/2012)


    The correct terms are "preserved table " and "unpreserved table" ...

    Can you explain us these terms? I can't find any reference on the web.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2

Viewing 15 posts - 1 through 15 (of 63 total)

You must be logged in to reply to this topic. Login to reply