SQLServerCentral Article

Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

,

"SQL Spackle" is a collection of short articles written based on multiple requests for similar code. These short articles are NOT meant to be complete solutions. Rather, they are meant to "fill in the cracks".

--Phil McCracken

Introduction

This is a classic problem that has appeared as a request for help on the forums here at SQLServerCentral many times. Generally speaking, the request is to find a group of things or people that have an association with at least one each of another group of people or things but no association with a third group.

Such queries can be used for many reasons. If, for example, products "A" and "B" are similar to "C", you might want to find those customers who bought both "A" and "B" and ask them why they didn't also buy "C". Maybe you have an excess of "Product C" and you want to identify customer types where stepping up advertising on "C" might be the most successful. Finding customers that bought both similar "A" and "B" items might help. Of course, it could be for something that has nothing to do with customers or products. It could be for something like finding patients on both medicines "A" and "B" who show no signs of symptom "C".

We're going to explore a "classic" and very high performance solution to this "classic" problem.

Restating the Problem

Rather than speaking in general terms, we're going to solve a specific problem in this article just to make it easier to talk about. You can easily apply it to your particular requirements by changing the column names, table names, and the critieria.

To be specific, we're going to look for customers who bought any number of products but specifically bought at least one each of "A" and "B" and not "C" regardless of what else they bought.

Simple Test Data

The comments in the test data generation code below describe what each customer bought and whether or not they meet the criteria of buying both "A" and "B" but not "C". Notice that we're testing not only for conditions that are true, but also those that come close so that we can be sure that we don't select rows that we don't mean to.

--===== Conditionally drop the test table to make
     -- reruns in SSMS easier.
     IF OBJECT_ID('tempdb..#Purchase','U') IS NOT NULL
        DROP TABLE #Purchase
;
--===== Create the test table
 CREATE TABLE #Purchase
        (
         PurchaseID     INT IDENTITY(1,1),
         CustomerID     INT,
         ProductCode    CHAR(1)
         PRIMARY KEY CLUSTERED (PurchaseID)
        )
;
--===== Populate the test table with known data.
 INSERT INTO #Purchase
        (CustomerID, ProductCode)
------- Customer #1 precisely meets the criteria.
     -- Bought 'A' and 'B' but not 'C'.
 SELECT 1, 'A' UNION ALL
 SELECT 1, 'B' UNION ALL
------- Customer #2 also meets the criteria.
     -- Bought 'A' and 'B' and somthing else,
     -- but not 'C'.
 SELECT 2, 'A' UNION ALL
 SELECT 2, 'B' UNION ALL
 SELECT 2, 'D' UNION ALL
------- Customer #3 also meets the criteria.
     -- Bought 'A' and 'B' and something else,
     -- but not 'C'.
 SELECT 3, 'A' UNION ALL
 SELECT 3, 'B' UNION ALL
 SELECT 3, 'D' UNION ALL
 SELECT 3, 'A' UNION ALL
 SELECT 3, 'D' UNION ALL
------- Customer #4 doesn't meet the criteria.
     -- Bought 'A' and 'B' but also bought 'C'.
 SELECT 4, 'A' UNION ALL
 SELECT 4, 'B' UNION ALL
 SELECT 4, 'C' UNION ALL
------- Customer #5 doesn't meet the criteria.
     -- Bought 'A' and 'B' and something else,
     -- but also bought 'C'.
 SELECT 5, 'A' UNION ALL
 SELECT 5, 'B' UNION ALL
 SELECT 5, 'A' UNION ALL
 SELECT 5, 'B' UNION ALL
 SELECT 5, 'C' UNION ALL
 SELECT 5, 'D' UNION ALL
------- Customer #6 doesn't meet the criteria.
     -- Bought more than 1 of 'A' and something else
     -- but not 'B'.
 SELECT 6, 'A' UNION ALL
 SELECT 6, 'A' UNION ALL
 SELECT 6, 'D' UNION ALL
 SELECT 6, 'E' UNION ALL
------- Customer #7 doesn't meet the criteria.
     -- Bought more than 1 of 'B' and something else
     -- but not 'A'.
 SELECT 7, 'B' UNION ALL
 SELECT 7, 'B' UNION ALL
 SELECT 7, 'D' UNION ALL
 SELECT 7, 'E'
;

The code above isn't adequate for performance testing. For those that want to explore and compare solutions of their own, I've attached code to build a million row test table at the bottom of this article in the "Resources" link.

Finding Customers that Bought BOTH 'A' and 'B'

To solve problems like this, I usually just start out with something like the following snippet of code even though I know it's not everything I'll need to get the job done. Why? Because "you can't correct a blank piece of paper." You've got to start somewhere. 😉

--===== Find Customers that bought either "A" OR "B"
 SELECT CustomerID
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
;

If you run that code, you'll immediately see a problem. Looking back at the data, you'll find that customers 6 and 7 bought more than 1 of either 'A' or 'B' but not at least 1 of each. They need to be excluded from the return.

CustomerID

-----------

1

1

2

2

3

3

3

4

4

5

5

5

5

6

6

7

7

Counting the Products

A first step in eliminating the CustomerIDs that haven't bought both "A" and "B" would be to do a count.

--===== Find Customers that bought either "A" OR "B"
 SELECT CustomerID,
        ProductCount = COUNT(*)
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
  GROUP BY CustomerID
;

That returns the following.

CustomerID ProductCount

----------- ------------

1 2

2 2

3 3

4 2

5 4

6 2

7 2

It counts ALL of the "A" and "B" products that a customer bought. At this point, we might be tempted to filter on anything that has more than 2 for a product count but then we see the counts for CustomerIDs 6 and 7. They each bought two of one of the qualifying products, but they didn't buy at least one of each. The have a count of "2" either because they bought 2 of "A" or 2 of "B" but not at least 1 of each.

What we really need to do is count how many DISTINCT products the customer bought and then keep only the "2"s. As a sidebar, this is where a lot of people start adding a whole lot of code in the form of subqueries and the like. It's simply not necessary, though.

Counting the Distinct Products

In order to determine if a customer bought both product "A" and "B", we have a little counting to do. We don't need to count how many of "A" and "B" they bought as we did in the previous snippet of code. Instead, we need to limit our lookup to just the two products for each customer. In other words, we need to determine the number of DISTINCT products they bought.

Since this is an "SQL Spackle" article, I won't go into all of the different methods to do such a distinct count. I will tell you that, even though it uses "old" technology, the code that follows is one of the faster methods and consumes 7 times less CPU and 3 times less duration than some of the newer methods that use "Windowing Functions".

All we need to do that is to make a minor tweak to the code we previously used to count the products for each customer. Instead of counting everything, lets only count the DISTINCT occurrences of the ProductCode.

--===== Find Customers that bought either "A" OR "B"
     -- and count the DISTINCT number of products each bought.
 SELECT CustomerID,
        DistinctProductCount = COUNT(DISTINCT ProductCode)
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
  GROUP BY CustomerID
;

Thay returns the following and I believe you'll immediately see the difference.

CustomerID ProductCount

----------- ------------

1 2

2 2

3 2

4 2

5 2

6 1

7 1

NOW we can safely filter on the "2"s and be sure that we're getting only those customers who bought at least one each of both "A" and "B".

Filtering for "2" Products

The trouble is, how to apply the filter?

Let's simply move the COUNT(DISTINCT) to a filtering position in the code.  Let's try the WHERE clause first and see what happens.

--===== Find Customers that bought either "A" OR "B"
     -- and count the DISTINCT number of products each bought.
     -- The filter in the WHERE clause causes an error here.
 SELECT CustomerID,
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
    AND COUNT(DISTINCT ProductCode) = 2
  GROUP BY CustomerID
;

The problem is that we can't put the filter in the WHERE clause because we'll get an error.

Msg 147, Level 15, State 1, Line 1

An aggregate may not appear in the WHERE clause unless it is in a subquery contained in a HAVING clause or a select list, and the column being aggregated is an outer reference.

So where do we put the filter if not in the WHERE clause?

It's a very old trick but GROUP BY has a very special method of filtering. Instead of putting the aggregated filter in the WHERE clause, we can put the filter into a HAVING clause, instead. Like this.

--===== Find Customers that bought "A" AND "B"
     -- HAVING is like a WHERE for GROUP BYs.
 SELECT CustomerID
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
  GROUP BY CustomerID
 HAVING COUNT(DISTINCT ProductCode) = 2
;

That bit of code very nicely produces the list of all customers that have purchased at least 1 "A" AND 1 "B".

CustomerID

-----------

1

2

3

4

5

The only thing left to do is to exclude those customers that also bought "C". In other words, we want to exclude customers 4 and 5, in this case.

Excluding Customers that also bought "C"

Simplifying the problem using "Divide'n'Conquer" methods, let's first find the customers who bought "C" and then simply exclude them from the list of customers that bought both "A" and "B".

Find the Customers that bought "C"

At this point, we don't even care about customers that bought both "A" and "B". All we need to do is find customers who bought "C". You have to trust me for a minute in that we don't need unique or distinct CustomerID's for this. As you can well imagine, that code is incredibly simple.

--===== Find Customers that bought "C".
 SELECT CustomerID
   FROM #Purchase
  WHERE ProductCode IN ('C')
;

We could have used "=" instead of "IN" here but I was going for consistency compared to the other block of code.  It'll also make code changes easier if we have to exclude more than one product sometime in the future.

Doing the Exclusion.

We have code that finds all the CustomerIDs that bought at least one each of products "A" and "B". We also have code that finds all the CustomerIDs that bought "C" that need to be excluded from that list. All we need to do now, is do the actual exclusion.

In SQL Server 2000 and before, you needed to use a WHERE NOT EXISTS, WHERE NOT IN, or an OUTER JOIN with a NULL filter to do the exclusion. In SQL Server 2005, it's a whole lot easier. If we want to exclude one set of data from another, all we have to do is use the EXCEPT operator. Like this.

--===== Find Customers that bought both "A" AND "B"
 SELECT CustomerID
   FROM #Purchase
  WHERE ProductCode IN ('A','B')
  GROUP BY CustomerID
 HAVING COUNT(DISTINCT ProductCode) = 2
 EXCEPT
--===== Find Customers that bought "C".
 SELECT CustomerID
   FROM #Purchase
  WHERE ProductCode IN ('C')
;

To summarize the code above according to the comments, we "Find all the customers that bought both "A" and "B" EXCEPT for those that also bought "C". And it runs nasty fast. Here's the output… just like we wanted.

CustomerID

-----------

1

2

3

"Crack Filled!"

Thanks for listening, folks.

--Jeff Moden

© Copyright by Jeff Moden, 2012, All Rights Reserved.

Resources

Rate

4.51 (142)

You rated this post out of 5. Change rating

Share

Share

Rate

4.51 (142)

You rated this post out of 5. Change rating