Hidden Pitfalls with INNER JOIN and NOT IN Operators

,

I want to show you how two common operators can deceive you into believing you have constructed a proper SQL statement for your solution. We may want to filter results in a table by using using a IN/NOT IN or a INNER JOIN/LEFT JOIN to another table. There are a couple of hidden dangers here which I will describe.

Imagine we want to evaluate a measurable event and compare it toΒ  another event. This is typical in business. How many people that bought motor oil from Walmart also bought auto parts ? How many of my customers that bought a widget also purchased a vacation package? Whether the transactions are all in one table or come from different vendors still would not make a difference. We still have to create two records sets to compare the purchases of widgets to vacations. Here IΒ  create two very different products hoping for easier comprehension, and focus on the concept rather than the business case.

So to compare the same customers in two views/tables, we could associate the two tables using a IN/NOT IN or a INNER JOIN/LEFT JOIN.

Let's set up our data

USE TempDB;
GO
Create table #widgetSales
(-- customers who bought widgets
customerID int,
itemBought varchar(20),
itemAmount money
)
Create table #vacationPurchases
(-- customer who bought vacations
customerID int, 
destination varchar(20)
)
Insert into #widgetSales
Values 
(1,'widgetA', 20 ),
(2,'widgetA', 20 ),
(2,'widgetB', 25 ),
(3,'widgetA', 20 ),
(3,'widgetB', 25 ),
(3,'widgetC', 30 ),
(4,'widgetA', 20 )

Insert into #vacationPurchases
Values 
(1,'Honolulu' ),
(2,'Seattle'),
(3,'New York' ),
(3,'Sydney'),
(NULL,'Paris')

I want to know the total amount each of my customers spent on widgets, but only those customers that DID NOT PURCHASE a vacation. This will work fine if we do a LEFT JOIN but what if we decide to use something else? Something that looks easier perhaps? Let us see what happens with a NOT IN operator.

SelectcustomerID, sum(itemAmount) as totalSpent
from#widgetSales
wherecustomerID NOT IN 
   (select customerID 
            from   #vacationPurchases
           )
groupby customerID

The whole operation fails to pull any data because of that one NULL in the look up table!

 

Now I want to know the totals of those customers that also DID PURCHASE a vacation package. I could use a IN operator here on the look up table. What happens if instead I decide to use a JOIN? Maybe I do this because someone once told me it is faster. It should work as well right? My vacation purchasing customers will be there after all in the look up table.

Selectw.customerID, sum(w.itemAmount) as totalSpent
from#widgetSales as w 
inner join #vacationPurchases as v ON v.customerID = w.customerID
groupby w.customerID

Here are the results:

 

Notice the duplication for customer 3 ($150 when it should be $75) because we joined to a table that did not have a unique constraint on its join column. How was I supposed to know customerID was not really an ID in my lookup table? Beware it could be a foreign key or not have uniqueness enforced even if it is meant to be unique. I would not have a clue of erroneous results if most of my customers purchased zero or only 1 vacation. HereΒ  a cursory inspection of results can utterly deceive the user.

So far we have shown two scenarios where lack of knowledge of the table you are using as your lookup can backfire on you depending on the type of operator you use (NOT IN vs JOIN). Luckily you don't have to memorize these situations on when to use a LEFT JOIN or an IN. There is a T-SQL construct so that you can use that is safe in both of these cases. The WHERE EXISTS clause is applicable to any situation where you want to filter events in one transaction set by using another transaction set.

Let's revisit both scenarios we previously described.

The customers that did not buy a vacation.

Let's evaluate just those without worrying about NULLs in the look up table.

Selectw.customerID, sum(w.itemAmount) as totalSpent
from#widgetSales as w
whereNOT EXISTS 
(
 select 'anything here' /* does not even evaluate, try 1/0 */ from#vacationPurchases as v
 wherev.customerID = w.customerID
)
 groupby w.customerID

This works as expected, (customer 4 results) and is more straight forward and easier than a LEFT JOIN.

The customers that also purchased a vacation package.

Let's evaluate just those without worrying about duplication in the lookup table. Rather than worrying about do I use a IN or JOIN here to avoid some pitfall I read about on SQL Server Central, we just keep the similar WHERE EXISTS construct.

Selectw.customerID, sum(w.itemAmount) as totalSpent
from#widgetSales as w
whereEXISTS /* <-- the only thing that changes, Easy! */(
 select 'anything here' /* does not even evaluate, try 1/0 */ from#vacationPurchases as v
 wherev.customerID = w.customerID
)
groupby w.customerID

This works! Customer 3 has a total of $75 as expected.

How easy was that? Your time as a developer is important and if you by chance don't know the underlying profile of the table we've shown how bad data can result. In the real world looking at a table may not be enough. Imagine if the vacation purchases table does not have a NULL in it (yet). What if customer 3 had only bought one vacation? Without understanding the table and briefly looking at it you probably would be tempted to think the customerID is unique. In this case a JOIN would produce accurate results... until the day the customer buys another vacation and a report user comes back to you saying (rightfully so) that something looks off with the latest numbers πŸ™‚

Rate

β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜…

3.77 (13)

Share

Share

Rate

β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜… β˜…

3.77 (13)