Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Can a strawberry query be done better? Expand / Collapse
Author
Message
Posted Thursday, July 4, 2013 2:49 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, September 8, 2014 7:19 AM
Points: 76, Visits: 439
Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;

Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

Thanks!
Post #1470510
Posted Thursday, July 4, 2013 3:30 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:51 PM
Points: 35,606, Visits: 32,190
Banana-823045 (7/4/2013)
Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

SELECT
s.SalesPerson,
s.SalesMonth,
s.SalesAmount,
s.Customer
FROM Sales AS s
LEFT JOIN Sales AS m
ON s.SalesMonth = m.SalesMonth
AND s.SalesAmount < m.SalesAmount
WHERE m.SalesID IS NULL;

Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

Thanks!


Do you have an example of the data that is contained in the SalesMonth column?

Also, could you provide the CREATE TABLE for the Sales table so we can see the datatypes, etc. It could very well make a difference.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1470518
Posted Thursday, July 4, 2013 4:03 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, September 8, 2014 7:19 AM
Points: 76, Visits: 439
Hmm. I had intended it as a general question. However, I acknowledge your point about schema having influences. So, for a need of a handy test data, I turned to AdventureWorks2008R2 database.

Here's the queries to try out:

1) Get sales person's best sales ever and detail about this sales:
SELECT 
o.SalesOrderID,
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
LEFT JOIN Sales.SalesOrderHeader AS m
ON o.SalesPersonID = m.SalesPersonID
AND o.TotalDue < m.TotalDue
WHERE m.TotalDue IS NULL;

2) Get the best sales for each month for different years. Note that because of functions applied in the join criteria, the query is non-sargable so I expect it to perform badly. Normally, if that was crucial for business, I probably would have added indexed computed field to keep the year and month separate.
SELECT 
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
LEFT JOIN Sales.SalesOrderHeader AS m
ON YEAR(o.OrderDate) = YEAR(m.OrderDate)
AND MONTH(o.OrderDate) = MONTH(m.OrderDate)
AND o.TotalDue < m.TotalDue
WHERE m.TotalDue IS NULL;

There were no modifications to the AdventureWorks2008R2 database. Hopefully, that'll provide you with enough sample data.

Thanks again!
Post #1470523
Posted Thursday, July 4, 2013 4:15 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 1:13 PM
Points: 40,428, Visits: 36,879
Query 1.

SELECT * FROM (
SELECT o.SalesOrderID,
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName,
ROW_NUMBER() OVER (PARTITION BY o.SalesPersonID ORDER BY o.TotalDue desc) SalesPersonPosition
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
) sub
WHERE SalesPersonPosition = 1
ORDER BY sub.TotalDue desc

Performance characteristics:

Your
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 251 ms, elapsed time = 120 ms.


Mine
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 16 ms, elapsed time = 15 ms.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1470526
Posted Thursday, July 4, 2013 4:21 PM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 1:13 PM
Points: 40,428, Visits: 36,879
Query 2.

SELECT * FROM (
SELECT
o.OrderDate,
o.AccountNumber,
o.TotalDue,
o.SalesPersonID,
p.FirstName,
p.LastName,
ROW_NUMBER() OVER (PARTITION BY YEAR(o.OrderDate), MONTH(o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition
FROM Sales.SalesOrderHeader AS o
INNER JOIN Person.Person AS p
ON o.SalesPersonID = p.BusinessEntityID
) sub
WHERE MonthPosition = 1
ORDER BY sub.TotalDue desc

Performance characteristics:

Yours
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 36, logical reads 113790, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 2590 ms, elapsed time = 781 ms.


Mine
Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 47 ms, elapsed time = 17 ms.


p.s. I added the order by (to both queries) to make comparison easier as the order of rows comes out differently without the order by due to the different ways the query processor executes them



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1470527
Posted Thursday, July 4, 2013 4:50 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 9:04 PM
Points: 3,926, Visits: 8,921
Depending on the requirements, you might want to use RANK instead of ROW_NUMBER if you need to bring up ties instead of a single best.


Luis C.
Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

Forum Etiquette: How to post data/code on a forum to get the best help
Post #1470530
Posted Thursday, July 4, 2013 5:21 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 8:51 PM
Points: 35,606, Visits: 32,190
And now we know why test data is essential for such "general" questions.

Just to add my 2 cents, you can combine the Year and Month comparison by doing the following...

ROW_NUMBER() OVER (PARTITION BY DATEDIFF(mm,0,o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition


I couldn't test it for speed because, right now, I'm on a box that doesn't have the 2k8 version of AdventureWorks. I've usually not found it to make a huge difference in performance but I really like it because it cuts that part of the typing in half.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1470535
Posted Thursday, July 4, 2013 5:32 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Monday, September 8, 2014 7:19 AM
Points: 76, Visits: 439
Cool, thanks, Gail.

From your response, I also now wonder if I'm doing it wrong by trying to compare the execution plans since you seems to be simply looking at the time elapsed and some statistics. Am I to take that those are more reliable benchmarks than an execution plan? I thought the point of looking at an execution plan was to see what would run generally well in most scenarios?

Luis - Yes, I'll keep that in mind.

Jeff - I submit to your superior logic RE: test data. :)
Post #1470538
Posted Thursday, July 4, 2013 6:37 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: 2 days ago @ 4:53 AM
Points: 3,422, Visits: 5,368
Since we're speaking hypothetically here and I don't have AdventureWorks to play with, I'll propose another hypothetical approach that I've seen to work well on occasion.

Using Gail's Query 1, instead of the INNER JOIN try using a CROSS APPLY on a SELECT TOP 1 salesamount with ORDER BY salesamount DESC.

I would throw together an example but I don't like posting code I haven't tested.

For an example where I did something similar: http://www.sqlservercentral.com/Forums/FindPost1470314.aspx

There are even performance results later in the thread.

Edit: Note that this method will not return both rows if there's a tie for winner.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1470540
Posted Friday, July 5, 2013 1:12 AM


SSC-Forever

SSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-ForeverSSC-Forever

Group: General Forum Members
Last Login: Today @ 1:13 PM
Points: 40,428, Visits: 36,879
There are multiple different ways to do this.

Rank or row number
Max in a subquery and join to it
Top 1 in a cross apply

All will likely be way faster than a triangular join. Which is best among them often depends on the volume and distribution of the data

As for comparing performance characteristics, if you want to see which run faster, which uses less CPU, then see which runs faster and uses less CPU. The execution plan shows estimated costs of operators and procedures, not durations and CPU.



Gail Shaw
Microsoft Certified Master: SQL Server 2008, MVP
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Post #1470598
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse