Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««123»»

Outer Join – What am I missing (SQL Spackle) Expand / Collapse
Author
Message
Posted Tuesday, November 12, 2013 3:44 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, May 5, 2015 3:08 PM
Points: 155, Visits: 623
I'm starting to feel bad here ...


Hey deroby, don't feel bad. I'm betting many readers would've made the same comment if you hadn't beat them to it! You're also right about the tendency for people to look no further once they've found something that works.

Yes, this is a beginner-level article describing Outer Join as promised by its title. The introduction however, sets it up as a report request that needs a solution rather than as a tutorial on joins. Given the introduction, I think it's fair to expect that other solutions might be provided and possibly contrasted and I wouldn't classify the inclusion of NOT EXISTS as "optimization" that would have made it more confusing. I think NOT EXISTS is less confusing (regardless of performance) and I hope beginners are learning it early on.

Also, as a beginner-level tutorial on Left Outer joins, I think the article would be improved if it pointed out that you can't have any other criteria referencing the table you're left joining to in the WHERE clause. For example, say you wanted a list of employees who have not completed required company courses in 2013. At first glance you might be tempted to add criteria to the WHERE clause (instead of the join where it belongs):
WHERE	CourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year



and then be surprised when an empty set is returned.

This common mistake got me when I was learning and I think it is yet another good reason for teaching NOT EXISTS. With NOT EXISTS, it's more intuitive to put the additional criteria in the correct place, the sub-query's WHERE clause:
WHERE NOT EXISTS
(
SELECT 1
FROM CourseCompleted
WHERE CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId
AND CourseCompleted.DateComplete >= '1/1/2013' --List all courses missed this year
)



I know I'm being an armchair quarterback, but I don't mean to pick on the author. I appreciate how hard it must be to decide how much detail to put in an article like this. Hopefully my comments will help someone avoid pitfalls that got me.



Post #1513651
Posted Tuesday, November 12, 2013 4:00 PM
Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Saturday, February 28, 2015 10:12 AM
Points: 331, Visits: 1,019
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.
Post #1513652
Posted Tuesday, November 12, 2013 4:17 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: 2 days ago @ 8:13 AM
Points: 90, Visits: 819
Andy DBA (11/12/2013)
WHERE	CourseCompleted.CourseId IS NULL
AND CourseCompleted.DateComplete >= '1/1/2013' --Failed attempt to list all courses missed this year





I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with

WHERE	CourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'

I was so happy the day I discovered that I could utilize the ON clause.
Post #1513656
Posted Tuesday, November 12, 2013 5:37 PM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Tuesday, May 5, 2015 3:08 PM
Points: 155, Visits: 623
thisisfutile (11/12/2013)
I remember learning that lesson too. Before I knew I could add it to the ON clause, I always solved this type of query with

WHERE	CourseCompleted.CourseId IS NULL
AND ISNULL(CourseCompleted.DateComplete,'12/31/2012') >= '1/1/2013'

I was so happy the day I discovered that I could utilize the ON clause.


I hope you're joking! When using the Left Outer Join technique to find "missing" records, YOU CAN'T USE THE WHERE CLAUSE TO REFERENCE ANY COLUMNS IN THE "MISSING RECORD" TABLE except for the column that you specify IS NULL. The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL so the ISNULL function will always return '12/31/2012' which is always greater than '1/1/2013'.

Think of it this way:
The Left Join tells SQL Server to return all rows from the Left regardless of matches found on the Right and to only include data from the Right when the join matches. The "<joined right table column> IS NULL" criteria in the WHERE clause tells SQL Server to not return any result records that DO have a match on the Right. All that is left for it to return is rows from the Left that are missing matches on the Right with the Right "columns" containing nothing but NULL placeholders.

If this seems confusing it's because it is confusing and that's why many people feel NOT EXISTS is more intuitive.

Anyone who is still unsure might benefit from doing a "Select *" on the original article's solution and see what happens with and without the where clause.



Post #1513668
Posted Tuesday, November 12, 2013 6:57 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 1:04 AM
Points: 3,933, Visits: 6,162
arnipetursson (11/12/2013)
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.


I usually use a LEFT JOIN for this reason.

deroby (11/12/2013)

2. I've seen it 'fail' when people use a <somefield> that turns out to be NULL-able


That is easily avoidable, particularly when you're joining on a foreign key, by choosing <somefield> as one of the columns in your JOIN criteria. I agree though that it is a possible newbie mistake.

Nice article Patrick!



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
Post #1513683
Posted Wednesday, November 13, 2013 6:45 AM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Monday, May 18, 2015 5:29 AM
Points: 28, Visits: 223
arnipetursson (11/12/2013)
When I run the two queries (outer join vs NOT EXISTS),
the outer join query runs in 4-9ms (69 logical IO n the temp tables and 0 scans and 0 logical IO on the work table)
and the not exists query runs in 28-32ms (109 logical IO on the temp tables and 50 scans and 5079 logical IO on the work table).

It looks like the #CourseCompleted table is scanned 50 times in the NOT EXISTS query, whereas only 30 times in the outer join query.

I wonder if these results hold over larger data sets.


After adding UPDATE STATISTICS on the #tables + dropping the result in a temp-table and putting FREEPROCCACHE & DROPCLEANBUFFERS in front of each run, the results are returned in something like 14ms vs 13ms here respectively.
Even if it would be the other way around, in such 'close-cases' readability easily trumps any (small) performance hit IMHO.


When upping the rowcount dramatically the difference grows a little it seems; but relatively spoken they stick together quite well.
DECLARE @NumberOfCourse INT = 3000;
DECLARE @NumberOfEmployees INT = 50000;
=> (55181680 row(s) affected)

 [run 1] SQL Server Execution Times:   CPU time = 106922 ms,  elapsed time = 109345 ms.
[run 2] SQL Server Execution Times: CPU time = 110672 ms, elapsed time = 113376 ms.

vs
 [run 1] SQL Server Execution Times:   CPU time = 101938 ms,  elapsed time = 104593 ms.
[run 2] SQL Server Execution Times: CPU time = 99031 ms, elapsed time = 101675 ms.


Mind you that this is on a laptop that has plenty of stuff running in the background so don't take these numbers as 'hard evidence', they're a guideline at best.
In my experience I've always found both approaches to be equivalent as far as execution times go.



Post #1513853
Posted Wednesday, November 13, 2013 8:08 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: 2 days ago @ 8:13 AM
Points: 90, Visits: 819
Andy DBA (11/12/2013)The second criteria in the code snippet above won't filter anything, because ALL COLUMNS returned from CourseCompleted will be NULL


Right you are. In my haste to explain how I learned the lesson to not add a RIGHT table to the WHERE clause of a LEFT join query (it effectively makes it an INNER join), I added to your example and caused your confusion. What I was going for was to expand on your example by adding ISNULL so that it wouldn't produce an empty result set. I just didn't take into account that it wasn't actually doing anything.

Sorry for my public DUH moment.
Post #1513913
Posted Friday, February 27, 2015 7:35 AM


SSC Veteran

SSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC VeteranSSC Veteran

Group: General Forum Members
Last Login: Thursday, May 21, 2015 9:32 PM
Points: 203, Visits: 1,169
There is always more than one way to skin a cat. My preference is to optimize simplicity and maintainability. Each part can be tested individually and there are no WHERE clauses.

SELECT -- get all employees and required courses
e.EmployeeId,
e.FirstName,
e.LastName,
c.CourseId,
c.CourseName
FROM
#Employee e
CROSS JOIN
#Course c

EXCEPT

SELECT -- remove completed employee courses
e.EmployeeId,
e.FirstName,
e.LastName,
c.CourseId,
c.CourseName
FROM
#Employee e
JOIN
#CourseCompleted cc ON cc.EmployeeId = e.EmployeeId
JOIN
#Course c on cc.CourseId = c.CourseId

ORDER BY
e.EmployeeId
;


Post #1664235
Posted Friday, February 27, 2015 7:38 AM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Tuesday, May 19, 2015 2:12 PM
Points: 924, Visits: 2,469
Personally I find the exists/not exists statements to be more difficult to write, while the LEFT JOIN WHERE IS NULL is extremely intuitive to me. It's interesting to discover that others have the opposite view. I think there was an article just the other day on joins outperforming the use of subqueries.


Post #1664244
Posted Friday, February 27, 2015 4:48 PM
Grasshopper

GrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopperGrasshopper

Group: General Forum Members
Last Login: Thursday, March 12, 2015 6:41 AM
Points: 16, Visits: 153
@deroby,

Don't feel bad. In fact I think it's very important to explain to newbies that while this is an example of where LEFT OUTER JOIN can be used to accomplish the task at hand and it may be easy to understand it is generally not a good practice to rely on it since it is notoriously inefficient and should be avoided especially on large data sets. In todays world data sets are becoming very large very fast and it is important to be conscious of inefficient queries from the very beginning. Since data generally grows over time a query that works just fine when written often becomes a performance killer months or even weeks later.

Another (more efficient) way to solve this problem without using a LEFT OUTER JOIN or even WHERE NOT EXISTS (which btw is even less efficient) is to use the power of indexes. This method while similar to the NOT EXIST clause explicitly collects the cartesian set of indexes of employees and courses EXCEPT those where courses have been completed joining on the #CourseCompleted table. Since we are only collecting the indexes and not the related data we avoid returning unnecessary data and the looping inherent in the NOT EXIST clause.

To illustrate my point I simply used the original data creation script to increase the record count to 300 courses and 50000 employees.

DECLARE @NumberOfCourse INT = 300;
DECLARE @NumberOfEmployees INT = 50000;

Using the Left Outer Join query:

SELECT  Employee.EmployeeId
,Employee.FirstName
,Employee.LastName
,Course.CourseId
,Course.CourseName
FROM #Employee AS Employee
CROSS JOIN #Course AS Course
LEFT OUTER JOIN #CourseCompleted AS CourseCompleted
ON CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId
WHERE CourseCompleted.CourseId IS NULL
ORDER BY Employee.EmployeeId

-- Returns 5516909 Records in 01:07
-- --------------------------------------------------------------

-- Using the NOT EXIST QUERY

SELECT Employee.EmployeeId
,Employee.FirstName
,Employee.LastName
,Course.CourseId
,Course.CourseName
FROM #Employee AS Employee
CROSS JOIN #Course AS Course
WHERE NOT EXISTS ( SELECT *
FROM #CourseCompleted AS CourseCompleted
WHERE CourseCompleted.EmployeeId = Employee.EmployeeId
AND CourseCompleted.CourseId = Course.CourseId )
ORDER BY Employee.EmployeeId

-- Returns 5516909 Records in 03:33

-- --------------------------------------------------------------
While the EXCEPT clause is a great way to simplify things it can be a real killer when returning a lot of data since the EXCEPT data set has to match the top SELECT clause and the query engine is matching everything as if it were a comjpound key to see what matches exactly. When I ran the EXCEPT query posted by Bill Tadada above I had to kill it after 7 minutes. But using it on just the indexes improves the query substantially.
-- --------------------------------------------------------------
Using the following script:
IF OBJECT_ID('tempdb..#NotComplete','U') IS NOT NULL
DROP TABLE #NotComplete;

-- get all of the keys of a cartesian set into a temp table {15000000)
SELECT Employee.employeeid, Course.courseId into #NotComplete
from #Employee AS Employee
CROSS JOIN #Course AS Course

EXCEPT

SELECT CourseCompleted.EmployeeId, CourseCompleted.CourseId
FROM #CourseCompleted AS CourseCompleted


-- the remaining keys can be used to get the data for the report (5516909 )

SELECT Employee.EmployeeId
,Employee.FirstName
,Employee.LastName
,Course.CourseId
,Course.CourseName
FROM #NotComplete AS NotComplete
INNER JOIN #Employee AS Employee ON Employee.EmployeeId = NotComplete.EmployeeId
INNER JOIN #Course AS Course ON Course.CourseId = NotComplete.CourseId

-- Returns 5516909 Records 00:36 (-00:31, a 50% improvement)

-- --------------------------------------------------------------

...and to your point deroby I did use the FREEPROCCACHE and DROPCLEANBUFFERS statements prior to execution of each. I'm using SS2014 on a laptop w/16GB ram and an SSD drive. As you say these are only indicators at best but it is consistent with my experience. I'm not sure why your results should be so different .


Ron Cicotte
ron.cicotte@gmail.com
Data Transformations, LLC
Post #1664497
« Prev Topic | Next Topic »

Add to briefcase ««123»»

Permissions Expand / Collapse