Developing predefined reports in addition to all the ad-hoc queries have been a part of my daily activities. Often I find my stakeholders asking me to prepare reports in which they want to know something that has never happened. It might sound strange, but yeah, that's what their point of concern is all about. In simple words, the questions which arise mostly are such as
- Which are my customers who have never bought a product?
- Which products are never sold?
- Which events are never booked?
Knowing such answers is helpful in analyzing what went wrong with these and probably they might also come up with some new ideas which might benefit the overall sales.
When I start to prepare answers for such a question, I usually audit all the tables that might be of interest to me. For example, in the first scenario, the Customers and the Sales table would be of utmost importance. Similarly, in the second scenario, Products and Sales would be of interest and so on.
In any of these cases, it is important to understand that we need to extract information from one of the tables which doesn't exist in the other. I will try to explain what are the basic ways to achieve such an answer and also emphasize on the best option in my opinion.
Technically, it becomes quite exciting for me when things come down to writing SQL queries. There are multiple ways in SQL to achieve the above scenarios, some of which are mentioned below:
- Using a NOT IN statement
- Using an OUTER APPLY
- Using a LEFT OUTER JOIN
- Using an EXCEPT statement
- Using NOT EXISTS
All these approaches might yield the same result, however, it is possible that one might perform better than the other. I had to literally try out all the approaches individually, before finalizing one.
For the sake of this article, I'd like to take an example from the WorldWideImporters database as provided by Microsoft. This database can be downloaded from the Microsoft SQL Server's official Github page.
Using a NOT IN statement
While this is the most common approach that every SQL developer tries to apply, this might not be the best one. A typical query would look like something below.
SELECT CustomerName FROM Sales.Customers WHERE CustomerID NOT IN (SELECT CustomerID FROM Sales.Orders);
It seems like a pretty decent query that will return the correct results as long as the CustomerID in the Sales.Orders is NOT NULL. In case there are NULL values, this method might not work because the database engine treats this as a LEFT ANTI SEMI JOIN, however, it doesn't reflect if the NULL on the right-hand side matches or doesn't match the NULL on the left.
Using an OUTER APPLY
Another way of expressing the same query would be to make use of an OUTER APPLY clause and the query is as follows.
SELECT CustomerName FROM Sales.Customers cus OUTER APPLY ( SELECT CustomerID FROM Sales.Orders ord WHERE ord.CustomerID = cus.CustomerID ) as sales WHERE sales.CustomerID IS NULL;
This approach also uses a LEFT ANTI SEMI JOIN, however, the resulting plan seems to be missing the join operator. I think this approach is a little more expensive than the previous one because it is actually processed in a different way. An OUTER JOIN operator first brings in all the data that either matches or doesn't match the condition and then a filter is applied which returns only the matched records.
Using a LEFT OUTER JOIN
One usual alternative to the OUTER APPLY is the LEFT OUTER JOIN. Here, it is assumed that since the join condition is based on CustomerID, all the CustomerIDs from the Sales.Orders table will be NULL for those who have not made a purchase yet. The query plan is almost similar to the one used in the previous approach. The most important thing to note here would be to identify on which column the NULL filter is to be applied. Also, it is advisable to use an indexed column to use in the join condition as it will help yield better performance.
SELECT CustomerName FROM Sales.Customers cus LEFT OUTER JOIN Sales.Orders ord ON ord.CustomerID = cus.CustomerID WHERE ord.CustomerID IS NULL;
Using the EXCEPT statement
Another approach, although I do not see its use frequently, is to use the EXCEPT clause. The only problem, in my opinion, is that it can only be used if both the columns are the same which in this case is the CustomerID. It is not possible to use any other column in the SELECT statement for the first table because there is no way for the database engine to identify on which column is the join condition is to be applied.
SELECT CustomerID FROM Sales.Customers cus EXCEPT SELECT CustomerID FROM Sales.Orders ord;
Using NOT EXISTS
My most preferred choice to handle such a case would be to make use of the NOT EXISTS statement. Please note that I have used "SELECT 1" in the inner query and there are two reasons for that. Firstly, SQL Server doesn't really care about what is inside the EXISTS subquery and secondly, from a documentation perspective, it helps me realize that the inner query doesn't return any records just by looking at the statement. In terms of performance, I feel it is almost equivalent to the NOT IN and EXCEPT statements, however, a potential advantage is that it eliminates the risks of NULLs or the duplicates.
SELECT CustomerName FROM Sales.Customers cus WHERE NOT EXISTS ( SELECT 1 FROM Sales.Orders ord WHERE ord.CustomerID = cus.CustomerID );
Although, there are multiple ways to address the filtering of non-existing records in SQL Server, my most preferred choice would be to use the NOT EXISTS statement. In the future, I might make a detailed comparison of all the execution plans that are being used by all these statements and also compare the performance for each one of these.