Can a strawberry query be done better?

Question

Can a strawberry query be done better?

Banana-823045

SSCommitted

Points: 1689
More actions
July 4, 2013 at 2:49 pm

#279490

Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.
The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:
SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;
Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.
This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)
Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.
Thanks!

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 1

Banana-823045 (7/4/2013)
Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.
The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:
SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;
Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.
This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)
Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.
Thanks!

Do you have an example of the data that is contained in the SalesMonth column?

Also, could you provide the CREATE TABLE for the Sales table so we can see the datatypes, etc. It could very well make a difference.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Banana-823045 SSCommitted Points: 1689 More actions · Answer 2

Hmm. I had intended it as a general question. However, I acknowledge your point about schema having influences. So, for a need of a handy test data, I turned to AdventureWorks2008R2 database.

Here's the queries to try out:

1) Get sales person's best sales ever and detail about this sales:

SELECT

o.SalesOrderID,

o.OrderDate,

o.AccountNumber,

o.TotalDue,

o.SalesPersonID,

p.FirstName,

p.LastName

FROM Sales.SalesOrderHeader AS o

INNER JOIN Person.Person AS p

ON o.SalesPersonID = p.BusinessEntityID

LEFT JOIN Sales.SalesOrderHeader AS m

ON o.SalesPersonID = m.SalesPersonID

AND o.TotalDue < m.TotalDue

WHERE m.TotalDue IS NULL;

2) Get the best sales for each month for different years. Note that because of functions applied in the join criteria, the query is non-sargable so I expect it to perform badly. Normally, if that was crucial for business, I probably would have added indexed computed field to keep the year and month separate.

SELECT

o.OrderDate,

o.AccountNumber,

o.TotalDue,

o.SalesPersonID,

p.FirstName,

p.LastName

FROM Sales.SalesOrderHeader AS o

INNER JOIN Person.Person AS p

ON o.SalesPersonID = p.BusinessEntityID

LEFT JOIN Sales.SalesOrderHeader AS m

ON YEAR(o.OrderDate) = YEAR(m.OrderDate)

AND MONTH(o.OrderDate) = MONTH(m.OrderDate)

AND o.TotalDue < m.TotalDue

WHERE m.TotalDue IS NULL;

There were no modifications to the AdventureWorks2008R2 database. Hopefully, that'll provide you with enough sample data.

Thanks again!

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Query 1.

SELECT * FROM (

SELECT o.SalesOrderID,

o.OrderDate,

o.AccountNumber,

o.TotalDue,

o.SalesPersonID,

p.FirstName,

p.LastName,

ROW_NUMBER() OVER (PARTITION BY o.SalesPersonID ORDER BY o.TotalDue desc) SalesPersonPosition

FROM Sales.SalesOrderHeader AS o

INNER JOIN Person.Person AS p

ON o.SalesPersonID = p.BusinessEntityID

) sub

WHERE SalesPersonPosition = 1

ORDER BY sub.TotalDue desc

Performance characteristics:

Your

Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 251 ms, elapsed time = 120 ms.

Mine

Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 15 ms.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 4

Query 2.

SELECT * FROM (

SELECT

o.OrderDate,

o.AccountNumber,

o.TotalDue,

o.SalesPersonID,

p.FirstName,

p.LastName,

ROW_NUMBER() OVER (PARTITION BY YEAR(o.OrderDate), MONTH(o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition

FROM Sales.SalesOrderHeader AS o

INNER JOIN Person.Person AS p

ON o.SalesPersonID = p.BusinessEntityID

) sub

WHERE MonthPosition = 1

ORDER BY sub.TotalDue desc

Performance characteristics:

Yours

Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 36, logical reads 113790, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 2590 ms, elapsed time = 781 ms.

Mine

Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 17 ms.

p.s. I added the order by (to both queries) to make comparison easier as the order of rows comes out differently without the order by due to the different ways the query processor executes them

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Luis Cazares SSC Guru Points: 183698 More actions · Answer 5

Depending on the requirements, you might want to use RANK instead of ROW_NUMBER if you need to bring up ties instead of a single best.

Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

How to post data/code on a forum to get the best help: Option 1 / Option 2

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 6

And now we know why test data is essential for such "general" questions. 😛

Just to add my 2 cents, you can combine the Year and Month comparison by doing the following...

ROW_NUMBER() OVER (PARTITION BY DATEDIFF(mm,0,o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition

I couldn't test it for speed because, right now, I'm on a box that doesn't have the 2k8 version of AdventureWorks. I've usually not found it to make a huge difference in performance but I really like it because it cuts that part of the typing in half. 😀

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Banana-823045 SSCommitted Points: 1689 More actions · Answer 7

Cool, thanks, Gail.

From your response, I also now wonder if I'm doing it wrong by trying to compare the execution plans since you seems to be simply looking at the time elapsed and some statistics. Am I to take that those are more reliable benchmarks than an execution plan? I thought the point of looking at an execution plan was to see what would run generally well in most scenarios?

Luis - Yes, I'll keep that in mind.

Jeff - I submit to your superior logic RE: test data. 🙂

Dwain Camps SSC Guru Points: 86908 More actions · Answer 8

Since we're speaking hypothetically here and I don't have AdventureWorks to play with, I'll propose another hypothetical approach that I've seen to work well on occasion.

Using Gail's Query 1, instead of the INNER JOIN try using a CROSS APPLY on a SELECT TOP 1 salesamount with ORDER BY salesamount DESC.

I would throw together an example but I don't like posting code I haven't tested.

For an example where I did something similar: http://www.sqlservercentral.com/Forums/FindPost1470314.aspx

There are even performance results later in the thread.

Edit: Note that this method will not return both rows if there's a tie for winner.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 9

There are multiple different ways to do this.

Rank or row number

Max in a subquery and join to it

Top 1 in a cross apply

All will likely be way faster than a triangular join. Which is best among them often depends on the volume and distribution of the data

As for comparing performance characteristics, if you want to see which run faster, which uses less CPU, then see which runs faster and uses less CPU. The execution plan shows estimated costs of operators and procedures, not durations and CPU.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Banana-823045 SSCommitted Points: 1689 More actions · Answer 10

Thanks. I had the impression that one should use execution plan to get an idea of how well it'll perform for different applications whereas timing the result would be too specific to only one particular data set/schema.

Also, now that it's obvious that strawberry query isn't the best choice, I wonder what else good a triangular join would be for? Jeff Moden had cited them as one of "hidden RBARs" but also mentioned in pass that they can be useful in rare instances. Does anyone know of such instances?

Thanks again!

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 11

Banana-823045 (7/6/2013)
Thanks. I had the impression that one should use execution plan to get an idea of how well it'll perform for different applications whereas timing the result would be too specific to only one particular data set/schema.
Also, now that it's obvious that strawberry query isn't the best choice, I wonder what else good a triangular join would be for? Jeff Moden had cited them as one of "hidden RBARs" but also mentioned in pass that they can be useful in rare instances. Does anyone know of such instances?
Thanks again!

Like Gail said, the values in Execution Plans frequently are estimates (unless they say "actual") which means that "% of Batch" for a multi-query comparison run is almost always an estimate. "% of batch" should never be used as the definitive measure to determine which query is "best" for either performance or resource usage. In fact, it's frequently very "wrong" (estimates aren't wrong, just sometimes skewed) and can show exactly the opposite of what is true during a run.

For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.

/****************************************************************************************

Purpose:

This code demonstrates that the estimated and actual execution plans in SQL Server can

be 100% INCORRECT and that the execution plan should only be relied on to provide hints

as to what may be wrong with a query rather than an absolute indication. This code runs

in SQL Server 2005 only.

The code creates 30 years worth of dates starting with 2000-01-01 using two different

methods. The first method uses a recursive CTE and the second method uses a "Tally"

structure. The output of each method is directed to a "throw-away" variable to take

display delays out of the picture.

Please check both the actual and estimated execution plans and compare the % of batch.

****************************************************************************************/

SET NOCOUNT ON

--=======================================================================================

-- Recursive method shown by (Name with-held)

--=======================================================================================

PRINT '========== Recursive method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO,TIME ON;

DECLARE @Bitbucket DATETIME; --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @DateVal DATETIME;

SET @DateVal = '2000-01-01';

WITH rCTE AS

(

SELECT @DateVal AS DateVal

UNION ALL

SELECT DateVal = DATEADD(dd,1,DateVal)

FROM rCTE

WHERE DATEADD(dd,1,DateVal) < DATEADD(yy, 30, @DateVal)

)

SELECT @Bitbucket = d.DateVal

FROM rCTE d

OPTION (MAXRECURSION 0)

;

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME,IO OFF;

PRINT REPLICATE('=',90);

GO

--=======================================================================================

-- "Tally" structure method

--=======================================================================================

PRINT '========== Tally table method =========='

--===== Turn on some performance counters ===============================================

SET STATISTICS IO,TIME ON;

DECLARE @Bitbucket DATETIME; --Holds display output so display times aren't measured.

--===== Execute the code being tested ===================================================

DECLARE @StartDate AS DATETIME;

SET @StartDate = '2000-01-01';

SELECT TOP (DATEDIFF(dd,@StartDate,DATEADD(yy,30,@StartDate)))

@Bitbucket = DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,@StartDate)

FROM sys.all_columns ac1

CROSS JOIN sys.all_columns ac2

;

--===== Turn off the performance counters and print a separator =========================

SET STATISTICS TIME,IO OFF;

PRINT REPLICATE('=',90);

GO

Here's what I get on my older desktop box...

========== Recursive method ==========

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'Worktable'. Scan count 2, logical reads 65749, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 485 ms, elapsed time = 569 ms.

==========================================================================================

========== Tally table method ==========

SQL Server Execution Times:

CPU time = 0 ms, elapsed time = 0 ms.

Table 'syscolrdb'. Scan count 1, logical reads 98, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'syscolpars'. Scan count 2, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:

CPU time = 15 ms, elapsed time = 26 ms.

==========================================================================================

I don't have an example for it but there are also times when some of the nodes in the execution plan will show impossibilites like 114,387% for a value.

I use the execution plan all the time to help me troubleshoot poorly performing queries. I never use the costs or "% of Batch" to determine which will actually perform better. I also don't always trust SET STATISTICS. They can outright lie depending on what is being done. Please see the following article where SET STATISTICS actually makes Scalar UDF's look much worse than they actually are (they're still much worse than other methods, but not as bad as SET STATISTICS makes them look).

http://www.sqlservercentral.com/articles/T-SQL/91724/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 12

Jeff Moden (7/6/2013)
For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.

In this particular case it's because the optimiser costs the recursive CTE based on either just the anchor member or on the anchor member and one recursive step (can't remember which). Hence the estimates are wildly inaccurate.

Exec plans tell you how the query runs. Execution statistics tell you how the query performs.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Jeff Moden SSC Guru Points: 1003853 More actions · Answer 13

GilaMonster (7/6/2013)
Jeff Moden (7/6/2013)
For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.
In this particular case it's because the optimiser costs the recursive CTE based on either just the anchor member or on the anchor member and one recursive step (can't remember which). Hence the estimates are wildly inaccurate.
Exec plans tell you how the query runs. Execution statistics tell you how the query performs.

I couldn't remember which one either and so I didn't bring it up, but you're absolutely correct.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

laptopalias Newbie Points: 3 More actions · Answer 14

If you look in the MySQL manual, you'll see that your 'strawberry' query is offered as one method for solving this kind of problem. With the advent of subqueries however, the other methods provided on the same page are inherently faster (on appropriately indexed tables). In tests, the 'uncorrelated subquery' method usually wins out.

Strawberry