Can a strawberry query be done better?

  • Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

    The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

    SELECT

    s.SalesPerson,

    s.SalesMonth,

    s.SalesAmount,

    s.Customer

    FROM Sales AS s

    LEFT JOIN Sales AS m

    ON s.SalesMonth = m.SalesMonth

    AND s.SalesAmount < m.SalesAmount

    WHERE m.SalesID IS NULL;

    Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

    This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

    Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

    Thanks!

  • Banana-823045 (7/4/2013)


    Coming from MySQL background, one query design I learned early on was 'strawberry query', which got its odd name from the MySQL newsgroup. It is a very useful pattern for solving the problem of answering questions like "who's the best performing salesperson of month?" or similar questions.

    The solution basically involves doing a "triangular join" and filtering for NULLs. Using salesperson example:

    SELECT

    s.SalesPerson,

    s.SalesMonth,

    s.SalesAmount,

    s.Customer

    FROM Sales AS s

    LEFT JOIN Sales AS m

    ON s.SalesMonth = m.SalesMonth

    AND s.SalesAmount < m.SalesAmount

    WHERE m.SalesID IS NULL;

    Note that the query is free to include other fields from the same row because there is no GROUP BY; the grouping is implicitly done via the self-join. We are guaranteed to get only one possible row each month for a given sales which also is the greatest amount. There is no any other row that's greater than the greatest amount of given month so m.SalesID must be NULL.

    This also works for getting the minimum; just reverse the inequality operator on the join criteria. Also, there is no TOP 1 ... ORDER BY which can be problematic when you need to get multiple results (e.g. you want to see all 12 months at once.)

    Now, that worked well with MySQL. However, I'm also aware that T-SQL language has some features that doesn't exist in the MySQL dialect and also whether there might be a better way of doing it in T-SQL. I don't exactly trust myself to interpret the best execution plans so I'd be very interested in hearing from others whether this can be outperformed by alternatives such as using ROW_NUMBER() or whatever other approaches.

    Thanks!

    Do you have an example of the data that is contained in the SalesMonth column?

    Also, could you provide the CREATE TABLE for the Sales table so we can see the datatypes, etc. It could very well make a difference.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hmm. I had intended it as a general question. However, I acknowledge your point about schema having influences. So, for a need of a handy test data, I turned to AdventureWorks2008R2 database.

    Here's the queries to try out:

    1) Get sales person's best sales ever and detail about this sales:

    SELECT

    o.SalesOrderID,

    o.OrderDate,

    o.AccountNumber,

    o.TotalDue,

    o.SalesPersonID,

    p.FirstName,

    p.LastName

    FROM Sales.SalesOrderHeader AS o

    INNER JOIN Person.Person AS p

    ON o.SalesPersonID = p.BusinessEntityID

    LEFT JOIN Sales.SalesOrderHeader AS m

    ON o.SalesPersonID = m.SalesPersonID

    AND o.TotalDue < m.TotalDue

    WHERE m.TotalDue IS NULL;

    2) Get the best sales for each month for different years. Note that because of functions applied in the join criteria, the query is non-sargable so I expect it to perform badly. Normally, if that was crucial for business, I probably would have added indexed computed field to keep the year and month separate.

    SELECT

    o.OrderDate,

    o.AccountNumber,

    o.TotalDue,

    o.SalesPersonID,

    p.FirstName,

    p.LastName

    FROM Sales.SalesOrderHeader AS o

    INNER JOIN Person.Person AS p

    ON o.SalesPersonID = p.BusinessEntityID

    LEFT JOIN Sales.SalesOrderHeader AS m

    ON YEAR(o.OrderDate) = YEAR(m.OrderDate)

    AND MONTH(o.OrderDate) = MONTH(m.OrderDate)

    AND o.TotalDue < m.TotalDue

    WHERE m.TotalDue IS NULL;

    There were no modifications to the AdventureWorks2008R2 database. Hopefully, that'll provide you with enough sample data.

    Thanks again!

  • Query 1.

    SELECT * FROM (

    SELECT o.SalesOrderID,

    o.OrderDate,

    o.AccountNumber,

    o.TotalDue,

    o.SalesPersonID,

    p.FirstName,

    p.LastName,

    ROW_NUMBER() OVER (PARTITION BY o.SalesPersonID ORDER BY o.TotalDue desc) SalesPersonPosition

    FROM Sales.SalesOrderHeader AS o

    INNER JOIN Person.Person AS p

    ON o.SalesPersonID = p.BusinessEntityID

    ) sub

    WHERE SalesPersonPosition = 1

    ORDER BY sub.TotalDue desc

    Performance characteristics:

    Your

    Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 251 ms, elapsed time = 120 ms.

    Mine

    Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 16 ms, elapsed time = 15 ms.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Query 2.

    SELECT * FROM (

    SELECT

    o.OrderDate,

    o.AccountNumber,

    o.TotalDue,

    o.SalesPersonID,

    p.FirstName,

    p.LastName,

    ROW_NUMBER() OVER (PARTITION BY YEAR(o.OrderDate), MONTH(o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition

    FROM Sales.SalesOrderHeader AS o

    INNER JOIN Person.Person AS p

    ON o.SalesPersonID = p.BusinessEntityID

    ) sub

    WHERE MonthPosition = 1

    ORDER BY sub.TotalDue desc

    Performance characteristics:

    Yours

    Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SalesOrderHeader'. Scan count 18, logical reads 1504, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 36, logical reads 113790, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 2590 ms, elapsed time = 781 ms.

    Mine

    Table 'Person'. Scan count 9, logical reads 310, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'SalesOrderHeader'. Scan count 9, logical reads 752, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 47 ms, elapsed time = 17 ms.

    p.s. I added the order by (to both queries) to make comparison easier as the order of rows comes out differently without the order by due to the different ways the query processor executes them

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Depending on the requirements, you might want to use RANK instead of ROW_NUMBER if you need to bring up ties instead of a single best.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • And now we know why test data is essential for such "general" questions. 😛

    Just to add my 2 cents, you can combine the Year and Month comparison by doing the following...

    ROW_NUMBER() OVER (PARTITION BY DATEDIFF(mm,0,o.OrderDate) ORDER BY o.TotalDue desc) MonthPosition

    I couldn't test it for speed because, right now, I'm on a box that doesn't have the 2k8 version of AdventureWorks. I've usually not found it to make a huge difference in performance but I really like it because it cuts that part of the typing in half. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Cool, thanks, Gail.

    From your response, I also now wonder if I'm doing it wrong by trying to compare the execution plans since you seems to be simply looking at the time elapsed and some statistics. Am I to take that those are more reliable benchmarks than an execution plan? I thought the point of looking at an execution plan was to see what would run generally well in most scenarios?

    Luis - Yes, I'll keep that in mind.

    Jeff - I submit to your superior logic RE: test data. 🙂

  • Since we're speaking hypothetically here and I don't have AdventureWorks to play with, I'll propose another hypothetical approach that I've seen to work well on occasion.

    Using Gail's Query 1, instead of the INNER JOIN try using a CROSS APPLY on a SELECT TOP 1 salesamount with ORDER BY salesamount DESC.

    I would throw together an example but I don't like posting code I haven't tested.

    For an example where I did something similar: http://www.sqlservercentral.com/Forums/FindPost1470314.aspx

    There are even performance results later in the thread.

    Edit: Note that this method will not return both rows if there's a tie for winner.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • There are multiple different ways to do this.

    Rank or row number

    Max in a subquery and join to it

    Top 1 in a cross apply

    All will likely be way faster than a triangular join. Which is best among them often depends on the volume and distribution of the data

    As for comparing performance characteristics, if you want to see which run faster, which uses less CPU, then see which runs faster and uses less CPU. The execution plan shows estimated costs of operators and procedures, not durations and CPU.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks. I had the impression that one should use execution plan to get an idea of how well it'll perform for different applications whereas timing the result would be too specific to only one particular data set/schema.

    Also, now that it's obvious that strawberry query isn't the best choice, I wonder what else good a triangular join would be for? Jeff Moden had cited them as one of "hidden RBARs" but also mentioned in pass that they can be useful in rare instances. Does anyone know of such instances?

    Thanks again!

  • Banana-823045 (7/6/2013)


    Thanks. I had the impression that one should use execution plan to get an idea of how well it'll perform for different applications whereas timing the result would be too specific to only one particular data set/schema.

    Also, now that it's obvious that strawberry query isn't the best choice, I wonder what else good a triangular join would be for? Jeff Moden had cited them as one of "hidden RBARs" but also mentioned in pass that they can be useful in rare instances. Does anyone know of such instances?

    Thanks again!

    Like Gail said, the values in Execution Plans frequently are estimates (unless they say "actual") which means that "% of Batch" for a multi-query comparison run is almost always an estimate. "% of batch" should never be used as the definitive measure to determine which query is "best" for either performance or resource usage. In fact, it's frequently very "wrong" (estimates aren't wrong, just sometimes skewed) and can show exactly the opposite of what is true during a run.

    For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.

    /****************************************************************************************

    Purpose:

    This code demonstrates that the estimated and actual execution plans in SQL Server can

    be 100% INCORRECT and that the execution plan should only be relied on to provide hints

    as to what may be wrong with a query rather than an absolute indication. This code runs

    in SQL Server 2005 only.

    The code creates 30 years worth of dates starting with 2000-01-01 using two different

    methods. The first method uses a recursive CTE and the second method uses a "Tally"

    structure. The output of each method is directed to a "throw-away" variable to take

    display delays out of the picture.

    Please check both the actual and estimated execution plans and compare the % of batch.

    ****************************************************************************************/

    SET NOCOUNT ON

    --=======================================================================================

    -- Recursive method shown by (Name with-held)

    --=======================================================================================

    PRINT '========== Recursive method =========='

    --===== Turn on some performance counters ===============================================

    SET STATISTICS IO,TIME ON;

    DECLARE @Bitbucket DATETIME; --Holds display output so display times aren't measured.

    --===== Execute the code being tested ===================================================

    DECLARE @DateVal DATETIME;

    SET @DateVal = '2000-01-01';

    WITH rCTE AS

    (

    SELECT @DateVal AS DateVal

    UNION ALL

    SELECT DateVal = DATEADD(dd,1,DateVal)

    FROM rCTE

    WHERE DATEADD(dd,1,DateVal) < DATEADD(yy, 30, @DateVal)

    )

    SELECT @Bitbucket = d.DateVal

    FROM rCTE d

    OPTION (MAXRECURSION 0)

    ;

    --===== Turn off the performance counters and print a separator =========================

    SET STATISTICS TIME,IO OFF;

    PRINT REPLICATE('=',90);

    GO

    --=======================================================================================

    -- "Tally" structure method

    --=======================================================================================

    PRINT '========== Tally table method =========='

    --===== Turn on some performance counters ===============================================

    SET STATISTICS IO,TIME ON;

    DECLARE @Bitbucket DATETIME; --Holds display output so display times aren't measured.

    --===== Execute the code being tested ===================================================

    DECLARE @StartDate AS DATETIME;

    SET @StartDate = '2000-01-01';

    SELECT TOP (DATEDIFF(dd,@StartDate,DATEADD(yy,30,@StartDate)))

    @Bitbucket = DATEADD(dd,ROW_NUMBER() OVER (ORDER BY (SELECT NULL))-1,@StartDate)

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    ;

    --===== Turn off the performance counters and print a separator =========================

    SET STATISTICS TIME,IO OFF;

    PRINT REPLICATE('=',90);

    GO

    Here's what I get on my older desktop box...

    ========== Recursive method ==========

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'Worktable'. Scan count 2, logical reads 65749, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 485 ms, elapsed time = 569 ms.

    ==========================================================================================

    ========== Tally table method ==========

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 0 ms.

    Table 'syscolrdb'. Scan count 1, logical reads 98, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'syscolpars'. Scan count 2, logical reads 7, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 15 ms, elapsed time = 26 ms.

    ==========================================================================================

    I don't have an example for it but there are also times when some of the nodes in the execution plan will show impossibilites like 114,387% for a value.

    I use the execution plan all the time to help me troubleshoot poorly performing queries. I never use the costs or "% of Batch" to determine which will actually perform better. I also don't always trust SET STATISTICS. They can outright lie depending on what is being done. Please see the following article where SET STATISTICS actually makes Scalar UDF's look much worse than they actually are (they're still much worse than other methods, but not as bad as SET STATISTICS makes them look).

    http://www.sqlservercentral.com/articles/T-SQL/91724/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (7/6/2013)


    For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.

    In this particular case it's because the optimiser costs the recursive CTE based on either just the anchor member or on the anchor member and one recursive step (can't remember which). Hence the estimates are wildly inaccurate.

    Exec plans tell you how the query runs. Execution statistics tell you how the query performs.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (7/6/2013)


    Jeff Moden (7/6/2013)


    For example, here's a classic case of why many people come to the erroneous conclusion that a recursive CTE that counts is lightning fast compared to many of the other methods which are actually much faster than the recursive method. Run the following with the actual execution plan on and note the "% of Batch" for both queries. It shows the rCTE take 0% of the time and the "Tally" method taking 100% of the time. Yet, if you look at the print outs in the Messages tab, you'll find that the opposite is true.

    In this particular case it's because the optimiser costs the recursive CTE based on either just the anchor member or on the anchor member and one recursive step (can't remember which). Hence the estimates are wildly inaccurate.

    Exec plans tell you how the query runs. Execution statistics tell you how the query performs.

    I couldn't remember which one either and so I didn't bring it up, but you're absolutely correct.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you look in the MySQL manual, you'll see that your 'strawberry' query is offered as one method for solving this kind of problem. With the advent of subqueries however, the other methods provided on the same page are inherently faster (on appropriately indexed tables). In tests, the 'uncorrelated subquery' method usually wins out.

    Strawberry

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply