Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

  • Hi Jeff,

    Great Spackle, thanks!

    In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.

    SELECT DISTINCT

    CustomerID

    FROM #Purchase p1

    WHERE ProductCode = 'A'

    AND EXISTS (SELECT

    1

    FROM #Purchase p2

    WHERE p2.CustomerID = p1.CustomerID

    AND p2.ProductCode = 'B')

    AND NOT EXISTS (SELECT

    1

    FROM #Purchase p3

    WHERE p3.CustomerID = p1.CustomerID

    AND p3.ProductCode = 'C')

    I tested this against 10,000,000 rows as well as 1,000,000 and found that it also scales better (IMHO)...

    For 10,000,000 rows, the EXCEPT query had these stats:

    Table 'Worktable'. Scan count 49955, logical reads 151393, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 655 ms, elapsed time = 739 ms.

    The EXISTS query had these:

    Table 'Worktable'. Scan count 3, logical reads 2156, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 406 ms, elapsed time = 478 ms.

    I wondered if you considered this method and if so, what it was that steered you away from it?

    Thanks

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • Thanks for article and comments - I learnt a lot.

    In MySQL 😀 this is the fastest by far:

    SELECT

    CustomerId

    FROM Purchase

    WHERE ProductCode IN ('A','B', 'C')

    GROUP BY CustomerID

    having sum(case when ProductCode = 'A' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'B' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    Unfortunately MySQL does not have the EXCEPT operator.

    -jj

  • Hi folks,

    Thanks for the great discussion going on and the thoughtful feedback. I'd love to jump in this instant but I'm on my way to work. I'll be back tonight to try to answer some of the questions.

    Thanks, again.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Here's another alternative. I haven't tested the performance as I don't have a setup to do that, and I don't know if it will be similar to Jeff's in query plan. It does avoid the COUNT(DISTINCT...) that I read a while back (can't find the linky) was sometimes bad (something about rewinds).

    EDIT: Eh, nevermind... you can ignore this one... Did a small test and it's between 5 and 10x slower than other methods with about 2x the reads.

    Second Edit: BUT, it's easier to adjust if you know that the situation is a random number of "Purchased these" with a single "but didn't purchase this". Most of the other methods rely on replicating a block of SQL for each item in the list of "Purchased". I think both Jeff's (and mine) work without that restriction.

    SELECT

    b.CustomerID

    FROM

    (

    SELECT

    a.CustomerID

    FROM

    (

    SELECT

    DISTINCT

    CustomerID,

    Product

    FROM

    Purchase

    WHERE

    Product IN ('A','B')) AS a

    GROUP BY

    a.CustomerID

    HAVING

    COUNT(Product) = 2

    ) AS b LEFT OUTER JOIN

    Purchase p ON b.CustomerID = p.CustomerID AND p.Product = 'C'

    WHERE

    p.Product IS NULL

  • As always, Jeff turns complex problems into simple solutions. I'm still looking for "SQL Spackles" by Jeff Moden at Barnes & Noble.

  • mister.magoo (3/29/2012)


    Hi Jeff,

    Great Spackle, thanks!

    In my tests, using your code to build a million row test, I found this method to be twice as fast for the same logical reads.

    SELECT DISTINCT

    CustomerID

    FROM #Purchase p1

    WHERE ProductCode = 'A'

    AND EXISTS (SELECT

    1

    FROM #Purchase p2

    WHERE p2.CustomerID = p1.CustomerID

    AND p2.ProductCode = 'B')

    AND NOT EXISTS (SELECT

    1

    FROM #Purchase p3

    WHERE p3.CustomerID = p1.CustomerID

    AND p3.ProductCode = 'C')

    Nice way to do it. I find a lot of developers almost totally overlook use of EXISTS but it can be really efficient.

    This is slightly quicker using the results that have both products 'A' and 'B' to check that there is not a 'C' :

    SELECT DISTINCT CustomerID

    FROM #Purchase p1

    WHERE ProductCode = 'A'

    AND EXISTS(SELECT 1

    FROM #Purchase p2

    WHERE p2.CustomerID = p1.CustomerID

    AND p2.ProductCode = 'B'

    AND NOT EXISTS (SELECT 1

    FROM #Purchase p3

    WHERE p3.CustomerID = p2.CustomerID

    AND p3.ProductCode = 'C'))

  • Great spackle, Jeff. Very clearly written. Shows the old dogs a new trick and the new dogs some old tricks. Discussion shows there are many ways to skin a cat.

  • SELECT

    CustomerId

    FROM Purchase

    WHERE ProductCode IN ('A','B', 'C')

    GROUP BY CustomerID

    having sum(case when ProductCode = 'A' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'B' then 1 else 0 end) > 0

    and sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    That's a very interesting method! I'll take note of it just in case I need something like this.

    One question though:

    Wouldn't the query below present the same results?

    SELECT

    CustomerId

    FROM Purchase

    WHERE ProductCode IN ('A','B', 'C')

    GROUP BY CustomerID

    having sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    Best regards,

    Andre Guerreiro Neto

    Database Analyst
    http://www.softplan.com.br
    MCITPx1/MCTSx2/MCSE/MCSA

  • Thanks Jeff, I never have seen or used the COUNT(DISTINCT <column> ) in the SELECT or HAVING. That is kinda of hard to visualize.

    Great Article!!!

    Thomas

    Thomas LeBlanc, MVP Data Platform Consultant

  • codebyo (3/29/2012)


    One question though:

    Wouldn't the query below present the same results?

    SELECT

    CustomerId

    FROM Purchase

    WHERE ProductCode IN ('A','B', 'C')

    GROUP BY CustomerID

    having sum(case when ProductCode = 'C' then 1 else 0 end) = 0

    No, it would return someone who as ordered 'A' only or 'B' only

  • Yet another solution:

    select distinct a.customerid

    from #purchase a

    inner join #purchase b on (a.customerid = b.customerid)

    left join #purchase c on (a.customerid = c.customerid and c.productcode = 'C')

    where a.productcode = 'A'

    and b.productcode = 'B'

    and c.purchaseid is null

  • Nice job, Jeff!

    I thought for sure an "old school" approach uing an inner join and a left join exclusion would be faster, but it got blown away in 10M row tests by the article's solution that uses "Except". (1M row tests were still fairly even on my test machine!).

    For what it's worth, I also did 10M row tests on some of the other suggestions in this discussion and found the solution first posted by Arjun S (using INTERSECT and EXCEPT) to have the best CPU and elapsed times.

    Thanks for a great article (that even included benchmarking code!) and thanks everyone for posting additional solutions.

  • This is a terific article Jeff. I like they way you break down the problem and clearly spell out the thought process. It's most helpful.

    The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge. - Stephen Hawking

  • The EXISTS / NOT EXISTS approach is the one I would have taken. I believe EXISTS is more efficent than IN, or so I was taught for Oracle. I think you can get away without the first EXISTS clause:

    Select Distinct CustomerID

    FROM #Purchase PU

    WHERE ProductCode ='A'

    AND EXISTS (Select CustomerID

    FROM #Purchase

    WHERE ProductCode ='B' and CustomerID = PU.CustomerID)

    AND NOT EXISTS (Select PU.CustomerID

    FROM #Purchase

    WHERE ProductCode ='C' and CustomerID = PU.CustomerID)

  • Jonathan AC Roberts (3/29/2012)


    No, it would return someone who as ordered 'A' only or 'B' only

    I guess I may have drank too much coffee today. :doze:

    Thank you.

    Best regards,

    Andre Guerreiro Neto

    Database Analyst
    http://www.softplan.com.br
    MCITPx1/MCTSx2/MCSE/MCSA

  • Viewing 15 posts - 16 through 30 (of 166 total)

    You must be logged in to reply to this topic. Login to reply