Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Outer Join – What am I missing (SQL Spackle) Expand / Collapse
Author
Message
Posted Tuesday, November 12, 2013 3:44 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, August 19, 2014 7:49 AM
Points: 147, Visits: 535
I'm starting to feel bad here ...


Hey deroby, don't feel bad. I'm betting many readers would've made the same comment if you hadn't beat them to it! You're also right about the tendency for people to look no further once they've found something that works.

Yes, this is a beginner-level article describing Outer Join as promised by its title. The introduction however, sets it up as a report request that needs a solution rather than as a tutorial on joins. Given the introduction, I think it's fair to expect that other solutions might be provided and possibly contrasted and I wouldn't classify the inclusion of NOT EXISTS as "optimization" that would have made it more confusing. I think NOT EXISTS is less confusing (regardless of performance) and I hope beginners are learning it early on.

Also, as a beginner-level tutorial on Left Outer joins, I think the article would be improved if it pointed out that you can't have any other criteria referencing the table you're left joining to in the WHERE clause. For example, say you wanted a list of employees who have not completed required company courses in 2013. At first glance you might be tempted to add criteria to the WHERE clause (instead of the join where it belongs):
WHERE	CourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year



and then be surprised when an empty set is returned.

This common mistake got me when I was learning and I think it is yet another good reason for teaching NOT EXISTS. With NOT EXISTS, it's more intuitive to put the additional criteria in the correct place, the sub-query's WHERE clause:
WHERE NOT EXISTS
(
SELECT 1
FROM CourseCompleted
WHERE CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId
AND CourseCompleted.DateComplete >= '1/1/2013' --List all courses missed this year
)



I know I'm being an armchair quarterback, but I don't mean to pick on the author. I appreciate how hard it must be to decide how much detail to put in an article like this. Hopefully my comments will help someone avoid pitfalls that got me.



Post #1513651
Posted Tuesday, November 12, 2013 4:00 PM
SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Yesterday @ 12:40 PM
Points: 273, Visits: 849
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.
Post #1513652
Posted Tuesday, November 12, 2013 4:17 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Tuesday, August 19, 2014 2:19 PM
Points: 80, Visits: 722
Andy DBA (11/12/2013)
WHERE	CourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year





I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with

WHERE	CourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'

I was so happy the day I discovered that I could utilize the ON clause.
Post #1513656
Posted Tuesday, November 12, 2013 5:37 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, August 19, 2014 7:49 AM
Points: 147, Visits: 535
thisisfutile (11/12/2013)
I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with

WHERE	CourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'

I was so happy the day I discovered that I could utilize the ON clause.


I hope you're joking! When using the Left Outer Join technique to find "missing" records, YOU CAN'T USE THE WHERE CLAUSE TO REFERENCE ANY COLUMNS IN THE "MISSING RECORD" TABLE except for the column that you specify IS NULL. The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL so the ISNULL function will always return '12/31/2012' which is always greater than '1/1/2013'.

Think of it this way:
The Left Join tells SQL Server to return all rows from the Left regardless of matches found on the Right and to only include data from the Right when the join matches. The "<joined right table column> IS NULL" criteria in the WHERE clause tells SQL Server to not return any result records that DO have a match on the Right. All that is left for it to return is rows from the Left that are missing matches on the Right with the Right "columns" containing nothing but NULL placeholders.

If this seems confusing it's because it is confusing and that's why many people feel NOT EXISTS is more intuitive.

Anyone who is still unsure might benefit from doing a "Select *" on the original article's solution and see what happens with and without the where clause.



Post #1513668
Posted Tuesday, November 12, 2013 6:57 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 4:14 AM
Points: 3,618, Visits: 5,254
arnipetursson (11/12/2013)
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.


I usually use a LEFT JOIN for this reason.

deroby (11/12/2013)

2. I've seen it 'fail' when people use a <somefield> that turns out to be NULL-able


That is easily avoidable, particularly when you're joining on a foreign key, by choosing <somefield> as one of the columns in your JOIN criteria. I agree though that it is a possible newbie mistake.

Nice article Patrick!



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1513683
Posted Wednesday, November 13, 2013 6:45 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, August 4, 2014 4:34 AM
Points: 26, Visits: 187
arnipetursson (11/12/2013)
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.


After adding UPDATE STATISTICS on the #tables + dropping the result in a temp-table and putting FREEPROCCACHE & DROPCLEANBUFFERS in front of each run, the results are returned in something like 14ms vs 13ms here respectively.
Even if it would be the other way around, in such 'close-cases' readability easily trumps any (small) performance hit IMHO.


When upping the rowcount dramatically the difference grows a little it seems; but relatively spoken they stick together quite well.
DECLARE @NumberOfCourse INT = 3000;
DECLARE @NumberOfEmployees INT = 50000;
=> (55181680 row(s) affected)

 [run 1] SQL Server Execution Times:   CPU time = 106922 ms,  elapsed time = 109345 ms.
[run 2] SQL Server Execution Times: CPU time = 110672 ms, elapsed time = 113376 ms.

vs
 [run 1] SQL Server Execution Times:   CPU time = 101938 ms,  elapsed time = 104593 ms.
[run 2] SQL Server Execution Times: CPU time = 99031 ms, elapsed time = 101675 ms.


Mind you that this is on a laptop that has plenty of stuff running in the background so don't take these numbers as 'hard evidence', they're a guideline at best.
In my experience I've always found both approaches to be equivalent as far as execution times go.



Post #1513853
Posted Wednesday, November 13, 2013 8:08 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Tuesday, August 19, 2014 2:19 PM
Points: 80, Visits: 722
Andy DBA (11/12/2013)The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL


Right you are. In my haste to explain how I learned the lesson to not add a RIGHT table to the WHERE clause of a LEFT join query (it effectively makes it an INNER join), I added to your example and caused your confusion. What I was going for was to expand on your example by adding ISNULL so that it wouldn't produce an empty result set. I just didn't take into account that it wasn't actually doing anything.

Sorry for my public DUH moment.
Post #1513913
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse