

SSCAddicted
Group: General Forum Members
Last Login: Yesterday @ 4:11 PM
Points: 479,
Visits: 314


Jonathan AC Roberts (6/12/2014) Normally when an example is given using values A, B and C they should be taken as algebraic variables who's contents can change.
Agreed, it gives a solution that can be used in many situations.
I tested Jeff's original solution and 2 other solutions on a 10millionrow random table (using the code attached in Jeff's article):
DECLARE @A CHAR(1), @B CHAR(1), @C CHAR(1)
SELECT @A = 'X', @B = 'F', @C = 'M'
SET NOCOUNT ON SET STATISTICS IO ON SET STATISTICS TIME ON
PRINT '  Jeff''s Original '
SELECT CustomerID FROM #Purchase WHERE ProductCode IN (@A, @B) GROUP BY CustomerID HAVING COUNT(DISTINCT ProductCode) = 2
EXCEPT
SELECT CustomerID FROM #Purchase WHERE ProductCode = @C

PRINT '  NOT EXISTS '
SELECT CustomerID FROM #Purchase P1 WHERE ProductCode IN (@A, @B) AND NOT EXISTS ( SELECT 1 FROM #Purchase P2 WHERE ProductCode = @C AND P1.CustomerID = P2.CustomerID ) GROUP BY CustomerID HAVING COUNT(DISTINCT ProductCode) = 2

PRINT '  SUM(CASE...) '
SELECT CustomerID FROM #Purchase P1 WHERE ProductCode IN (@A, @B, @C) GROUP BY CustomerID HAVING SUM(CASE ProductCode WHEN @A THEN 1 ELSE 0 END) > 0 AND SUM(CASE ProductCode WHEN @B THEN 1 ELSE 0 END) > 0 AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0
SET STATISTICS TIME OFF SET STATISTICS IO OFF SET NOCOUNT OFF
Results (cleaned for readability):
 Jeff's Original 
Table '#Purchase'. Scan count 49960, logical reads 151394.
SQL Server Execution Times: CPU time = 406 ms, elapsed time = 412 ms.
 NOT EXISTS 
Table '#Purchase'. Scan count 3, logical reads 2160.
SQL Server Execution Times: CPU time = 296 ms, elapsed time = 288 ms.
 SUM(CASE...) 
Table '#Purchase'. Scan count 3, logical reads 2160.
SQL Server Execution Times: CPU time = 655 ms, elapsed time = 661 ms.




SSCrazy
Group: General Forum Members
Last Login: Tuesday, April 14, 2015 6:45 AM
Points: 2,403,
Visits: 3,431


The Wizard Of Oz (6/13/2014) I tested Jeff's original solution and 2 other solutions on a 10millionrow random table (using the code attached in Jeff's article):They are not equivalent and return different result.
COUNT(DISTINCT ...) = 2
and
SUM(CASE WHEN ... THEN ... ELSE ... END) > 0
Will return different results.
N 56°04'39.16" E 12°55'05.25"




SSCAddicted
Group: General Forum Members
Last Login: Yesterday @ 4:11 PM
Points: 479,
Visits: 314


SwePeso (6/13/2014) They are not equivalent and return different result.
Hmm, I'm going to be sceptical for now because I haven't found a case where any of the 3 solutions brought back different results What settings did you use for @A, @B and @C so I can replicate?




SSCrazy
Group: General Forum Members
Last Login: Tuesday, April 14, 2015 6:45 AM
Points: 2,403,
Visits: 3,431


You don't need to.
The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B. The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's.
N 56°04'39.16" E 12°55'05.25"




SSCAddicted
Group: General Forum Members
Last Login: Yesterday @ 4:11 PM
Points: 479,
Visits: 314


SwePeso (6/13/2014) The COUNT(DISTINCT ...) approach will only return the groups that has exactly one A and one B. Not sure about this, I thought the whole point of the DISTINCT is that it collapses any and all duplicates in the group, so it returns all groups having at least one A and at least one B, just like the SUM(CASE ...) approach.
SwePeso (6/13/2014) The SUM(CASE ...) approach will return all groups having at least one A and at least one B. It will also return the groups having 20 A's and 14 B's. Yes, I thought this was exactly what we needed?




SSCrazy
Group: General Forum Members
Last Login: Tuesday, April 14, 2015 6:45 AM
Points: 2,403,
Visits: 3,431


So, depending on the distribution of the sample data, the two different queries will return different results. The COUNT(DISTINCT ... ) will return a fewer number of rows, than the SUM(CASE ...) will.
N 56°04'39.16" E 12°55'05.25"




SSCAddicted
Group: General Forum Members
Last Login: Yesterday @ 4:11 PM
Points: 479,
Visits: 314


I don't think so, because before that point we've already done our GROUP BY CustomerID, so really the 2 methods we're comparing are:
... WHERE ProductCode IN (@A, @B) people that bought A or B or AB AND NOT EXISTS ( SELECT 1 FROM #Purchase P2 WHERE ProductCode = @C AND P1.CustomerID = P2.CustomerID ) people that bought (A or B or AB) and (not C) GROUP BY CustomerID HAVING COUNT(DISTINCT ProductCode) = 2 people that bought (AB) and (not C)
... WHERE ProductCode IN (@A, @B, @C) people that bought (A or B or C or AB or AC or BC or ABC) GROUP BY CustomerID HAVING SUM(CASE ProductCode WHEN @A THEN 1 ELSE 0 END) > 0 AND SUM(CASE ProductCode WHEN @B THEN 1 ELSE 0 END) > 0 AND SUM(CASE ProductCode WHEN @C THEN 1 ELSE 0 END) = 0 people that bought (AB) and (not C)
Either I need more coffee today (99% likely), or you do Peso (1%). I'm eager to learn more about SQL either way




Forum Newbie
Group: General Forum Members
Last Login: Sunday, June 29, 2014 6:46 AM
Points: 8,
Visits: 53


I'm finding that the results are the same for all 3 queries regardless of the input parameters and, as far as I can see, logically they should be.
SSCrazy, Can you supply some example data to highlight what you are saying here?
I'm pretty sceptical as well and want some evidence that what you are saying is correct.




SSCrazy Eights
Group: General Forum Members
Last Login: Today @ 5:38 PM
Points: 8,670,
Visits: 7,926





SSCrazy
Group: General Forum Members
Last Login: Tuesday, April 14, 2015 6:45 AM
Points: 2,403,
Visits: 3,431


No. You are right. I am the one needing more coffee. Just a mind lapse and disregarding the DISTINCT for some reason.
N 56°04'39.16" E 12°55'05.25"



