Find Customers Who Bought "A" and "B" But Not "C" (SQL Spackle)

  • robinwilson (6/7/2014)


    Hello All

    I'm just wondering, is there anything wrong with doing it this way:

    There is no guarantee that customer with both product A and B are returned.


    N 56°04'39.16"
    E 12°55'05.25"

  • Thanks, I missed that part!

  • My variant on this uses count instead of sum. Since count() ignores null values, you don't need the else part of the case:

    select CustomerId

    from #Purchase

    where ProductCode in ('A', 'B', 'C')

    group by CustomerId

    having count(case ProductCode when 'A' then 1 end) > 0

    and count(case ProductCode when 'B' then 1 end) > 0

    and count(case ProductCode when 'C' then 1 end) = 0

    for me it's a bit more intuitive but that's purely subjective.

  • Hi,

    I am new to SQL and tried to solve this in my own way. In one of my approaches I first wanted to get a list of the customers who had purchased both A&B using this query.

    Select DISTINCT p1.CustomerID From #Purchase p1

    join #Purchase p2 On p1.ProductCOde = 'A' and p2.ProductCOde = 'B';

    Using the data provided, I am getting an "incorrect" result from what I was expecting. I am getting 6 rows returned with the unexpected row customerID = 6. Can someone tell me why SQL Server returned that row?

    My novice approach for the answer to the original question turned out to be:

    Select p1.CustomerID from

    ( Select DISTINCT p1.CustomerID From #Purchase p1

    join #Purchase p2 On (p1.ProductCOde = 'A' and p2.ProductCOde = 'B' and p1.CustomerID = p2.CustomerID)

    ) as p1

    where p1.CustomerID NOT IN (Select CustomerID from #Purchase p3 where p3.ProductCOde = 'C')

    TIA,

    Rob

  • Hi Rob,

    The problem with your first query is that you missed to add the join condition for the CustomerID. In result, you're getting a cross join between all the rows with ProductCode A and all the rows with ProductCode B, and in the end you get all customers who bought product 'A' (as long as someone bought product B).

    The weird thing is that you corrected that problem in your second query to get the correct result. Congratulations on getting it right!

    I'd encourage you to read the discussion on this topic to review and understand the different options given. It will help you in your learning.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (6/11/2014)


    Hi Rob,

    The problem with your first query is that you missed to add the join condition for the CustomerID. In result, you're getting a cross join between all the rows with ProductCode A and all the rows with ProductCode B, and in the end you get all customers who bought product 'A' (as long as someone bought product B).

    The weird thing is that you corrected that problem in your second query to get the correct result. Congratulations on getting it right!

    I'd encourage you to read the discussion on this topic to review and understand the different options given. It will help you in your learning.

    Thks Luis!

    I also saw adding a WHERE clause:

    Select DISTINCT p1.CustomerID From #Purchase p1

    join #Purchase p2 On p1.ProductCOde = 'A' and p2.ProductCOde = 'B'

    where p1.CustomerID = p2.CustomerID;

    would give the expected result. Yes, I am going thru and looking at the different solutions. Learning how to do the benchmark as well.

    Rob

  • mushin2003 (6/11/2014)


    Luis Cazares (6/11/2014)


    Hi Rob,

    The problem with your first query is that you missed to add the join condition for the CustomerID. In result, you're getting a cross join between all the rows with ProductCode A and all the rows with ProductCode B, and in the end you get all customers who bought product 'A' (as long as someone bought product B).

    The weird thing is that you corrected that problem in your second query to get the correct result. Congratulations on getting it right!

    I'd encourage you to read the discussion on this topic to review and understand the different options given. It will help you in your learning.

    Thks Luis!

    I also saw adding a WHERE clause:

    Select DISTINCT p1.CustomerID From #Purchase p1

    join #Purchase p2 On p1.ProductCOde = 'A' and p2.ProductCOde = 'B'

    where p1.CustomerID = p2.CustomerID;

    would give the expected result. Yes, I am going thru and looking at the different solutions. Learning how to do the benchmark as well.

    Rob

    That's an interesting choice. Even if it will work exactly the same for inner joins, my advice is to go the other way to keep things where they belong within the logic.

    Select DISTINCT p1.CustomerID

    From #Purchase p1

    join #Purchase p2 On p1.CustomerID = p2.CustomerID --Keep join conditions within the join

    where p1.ProductCOde = 'A' --Keep query filters on WHERE

    and p2.ProductCOde = 'B';

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • simonc 39536 (6/6/2014)


    I think a lot of people seem to focus purely on performance, a good solution to a given issue considers many factors, NOT just performance, people who focus on just the performance aspect tend to write vastly more code and it eventually gets messy, ...

    Just as a contrasting note to that...

    It's an unwritten rule that one should always focus on accuracy first, and then performance so I don't believe for a seond that a lot of people focus purely on performance. For me, performance (and the related scalability) is second only to accuracy and not by much (although I'm not one to try to shave a millisecond off of a 5ms proc unless it's really needed for very high volume). I've also found that high performance code is usually much shorter than performance challenged code and is actually quite a bit more readable. I do agree that there are exceptions to that rule (greatest of 3 columns is a good example of that) but I won't sacrifice performance just because someone might not be able to understand the code (that's why I use embedded comments) or to make the code shorter (although it usually is).

    Last but not least, I've also found that high performance code almost always uses fewer resources (sometimes, drastically so) and that makes it worth it, as well. I spend about 25% of my time fixing other peoples' code to make it faster and less resource intensive. Oddly enough, I also end up making it quite a bit more readable and supportable in the process.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Are there any recommended books or resource which explain the internal workings of SQL Server engine(2008, 2012)? Which for example explains why Jeff's answer scales better than the solution provided using INTERCEPT/EXCEPT. I actually found it to be one of the easier solutions to follow.

    Thanks,

    Rob

  • It boils down to the number of IO used for the solution.

    Also the CPU usage matters.

    Obviously scanning the table twice (using INTERCEPT/EXCEPT for example) uses twice as much IO than scanning the table just once. There are factors that changes this such as proper indexes.


    N 56°04'39.16"
    E 12°55'05.25"

  • For a miniscule (and mostly academic) improvement on PeSo's script,

    I've got it so the index/table only gets scanned once (instead of 3 times)

    by just using a BETWEEN instead of the IN (which the optimiser auto-expands to 3 OR conditions):

    SELECT CustomerID

    FROM #Purchase

    WHERE ProductCode BETWEEN 'A' AND 'C'

    GROUP BY CustomerID

    HAVING MIN(ProductCode) = 'A'

    AND MAX(ProductCode) = 'B'

    STATISTICS IO Before: ...Scan count 3, logical reads 225...

    STATISTICS IO After: ...Scan count 1, logical reads 217...

    Tested on the million row random table from Jeff's article.

  • The Wizard Of Oz (6/12/2014)


    For a miniscule (and mostly academic) improvement on PeSo's script,

    I've got it so the index/table only gets scanned once (instead of 3 times)

    by just using a BETWEEN instead of the IN (which the optimiser auto-expands to 3 OR conditions):

    SELECT CustomerID

    FROM #Purchase

    WHERE ProductCode BETWEEN 'A' AND 'C'

    GROUP BY CustomerID

    HAVING MIN(ProductCode) = 'A'

    AND MAX(ProductCode) = 'B'

    STATISTICS IO Before: ...Scan count 3, logical reads 225...

    STATISTICS IO After: ...Scan count 1, logical reads 217...

    Tested on the million row random table from Jeff's article.

    But that won't work to find customers that bought A & C but not B. 😀

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (6/12/2014)


    But that won't work to find customers that bought A & C but not B. 😀

    Haha! Yes Luis, good thing that A & C not B wasn't in the scope of the original article 😛

  • The Wizard Of Oz (6/12/2014)


    Luis Cazares (6/12/2014)


    But that won't work to find customers that bought A & C but not B. 😀

    Haha! Yes Luis, good thing that A & C not B wasn't in the scope of the original article 😛

    "I have the answer, but it only works for spherical chickens in a vaccuum.":-D

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Normally when an example is given using values A, B and C they should be taken as algebraic variables who's contents can change. More like this example:

    DECLARE @a char(1), @b-2 char(1), @C char(1)

    SELECT @a='A', @b-2='B', @C='C'

    SELECT DISTINCT CustomerID

    FROM #Purchase P1

    WHERE P1.ProductCode = @a

    AND EXISTS(SELECT *

    FROM #Purchase P2

    WHERE P2.ProductCode = @b-2

    AND P2.CustomerID = P1.CustomerId)

    AND NOT EXISTS(SELECT *

    FROM #Purchase P3

    WHERE P3.ProductCode = @C

    AND P3.CustomerID = P1.CustomerId)

    As soon as you start substituting values in a lot of the solutions given in this thread they would stop working.

    e.g:

    SELECT @a='C', @b-2='B', @C='A'

Viewing 15 posts - 136 through 150 (of 166 total)

You must be logged in to reply to this topic. Login to reply