SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Can a strawberry query be done better?


Can a strawberry query be done better?

Author
Message
Banana-823045
Banana-823045
SSC Journeyman
SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)

Group: General Forum Members
Points: 92 Visits: 556
Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;



Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

Thanks!
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)

Group: General Forum Members
Points: 91048 Visits: 41151
Banana-823045 (7/4/2013)
Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;



Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

Thanks!


Do you have an example of the data that is contained in the SalesMonth column?

Also, could you provide the CREATE TABLE for the Sales table so we can see the datatypes, etc. It could very well make a difference.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Banana-823045
Banana-823045
SSC Journeyman
SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)

Group: General Forum Members
Points: 92 Visits: 556
Hmm. I had intended it as a general question. However, I acknowledge your point about schema having influences. So, for a need of a handy test data, I turned to AdventureWorks2008R2 database.

Here's the queries to try out:

1) Get sales person's best sales ever and detail about this sales:
SELECT 
o.SalesOrderID,
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
LEFT JOIN Sales.SalesOrderHeader AS m
ON o.SalesPersonID = m.SalesPersonID
AND o.TotalDue < m.TotalDue
WHERE m.TotalDue IS NULL;



2) Get the best sales for each month for different years. Note that because of functions applied in the join criteria, the query is non-sargable so I expect it to perform badly. Normally, if that was crucial for business, I probably would have added indexed computed field to keep the year and month separate.
SELECT 
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
LEFT JOIN Sales.SalesOrderHeader AS m
ON YEAR(o.OrderDate) = YEAR(m.OrderDate)
AND MONTH(o.OrderDate) = MONTH(m.OrderDate)
AND o.TotalDue < m.TotalDue
WHERE m.TotalDue IS NULL;



There were no modifications to the AdventureWorks2008R2 database. Hopefully, that'll provide you with enough sample data.

Thanks again!
GilaMonster
GilaMonster
SSC Guru
SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)

Group: General Forum Members
Points: 91879 Visits: 45285
Query 1.

SELECT * FROM (
SELECT o.SalesOrderID,
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName,
ROW_NUMBER() OVER (PARTITION BY o.SalesPersonID ORDER BY o.TotalDue desc) SalesPersonPosition
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
) sub
WHERE SalesPersonPosition = 1
ORDER BY sub.TotalDue desc



Performance characteristics:

Your
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 251 ms, elapsed time = 120 ms.


Mine
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 15 ms.


Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass


GilaMonster
GilaMonster
SSC Guru
SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)

Group: General Forum Members
Points: 91879 Visits: 45285
Query 2.

SELECT * FROM (
SELECT
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName,
ROW_NUMBER() OVER (PARTITION BY YEAR(o.OrderDate), MONTH(o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
) sub
WHERE MonthPosition = 1
ORDER BY sub.TotalDue desc



Performance characteristics:

Yours
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 36, logical reads 113790, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 2590 ms, elapsed time = 781 ms.


Mine
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 17 ms.


p.s. I added the order by (to both queries) to make comparison easier as the order of rows comes out differently without the order by due to the different ways the query processor executes them

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass


Luis Cazares
Luis Cazares
SSCoach
SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)SSCoach (17K reputation)

Group: General Forum Members
Points: 17440 Visits: 19122
Depending on the requirements, you might want to use RANK instead of ROW_NUMBER if you need to bring up ties instead of a single best.


Luis C.
General Disclaimer:
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?


How to post data/code on a forum to get the best help: Option 1 / Option 2
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)

Group: General Forum Members
Points: 91048 Visits: 41151
And now we know why test data is essential for such "general" questions. :-P

Just to add my 2 cents, you can combine the Year and Month comparison by doing the following...

ROW_NUMBER() OVER (PARTITION BY DATEDIFF(mm,0,o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition




I couldn't test it for speed because, right now, I'm on a box that doesn't have the 2k8 version of AdventureWorks. I've usually not found it to make a huge difference in performance but I really like it because it cuts that part of the typing in half. :-D

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Banana-823045
Banana-823045
SSC Journeyman
SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)SSC Journeyman (92 reputation)

Group: General Forum Members
Points: 92 Visits: 556
Cool, thanks, Gail.

From your response, I also now wonder if I'm doing it wrong by trying to compare the execution plans since you seems to be simply looking at the time elapsed and some statistics. Am I to take that those are more reliable benchmarks than an execution plan? I thought the point of looking at an execution plan was to see what would run generally well in most scenarios?

Luis - Yes, I'll keep that in mind.

Jeff - I submit to your superior logic RE: test data. Smile
dwain.c
dwain.c
SSCertifiable
SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)SSCertifiable (7.7K reputation)

Group: General Forum Members
Points: 7735 Visits: 6431
Since we're speaking hypothetically here and I don't have AdventureWorks to play with, I'll propose another hypothetical approach that I've seen to work well on occasion.

Using Gail's Query 1, instead of the INNER JOIN try using a CROSS APPLY on a SELECT TOP 1 salesamount with ORDER BY salesamount DESC.

I would throw together an example but I don't like posting code I haven't tested.

For an example where I did something similar: http://www.sqlservercentral.com/Forums/FindPost1470314.aspx

There are even performance results later in the thread.

Edit: Note that this method will not return both rows if there's a tie for winner.


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
GilaMonster
GilaMonster
SSC Guru
SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)SSC Guru (91K reputation)

Group: General Forum Members
Points: 91879 Visits: 45285
There are multiple different ways to do this.

Rank or row number
Max in a subquery and join to it
Top 1 in a cross apply

All will likely be way faster than a triangular join. Which is best among them often depends on the volume and distribution of the data

As for comparing performance characteristics, if you want to see which run faster, which uses less CPU, then see which runs faster and uses less CPU. The execution plan shows estimated costs of operators and procedures, not durations and CPU.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass


Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search