Top N plus

  • I'm trying to make a report for a competitive analysis based on ranking. The requirement's fairly simple: Include the Top N plus a specified item.

    So if the query were on cars and the analysis is for Hyuandai Sonata, the results for "Top 5 plus" mid-size cars might be:

    rank make/model qtr sales other data

    1 Toyota Camry 500,000 ...

    2 Honda Accord 480,000 ...

    3 Ford Fusion 220,000 ...

    4 Nissan Altima 215,000 ...

    5 Chevy Malibu 175,000 ...

    9 Hyuandai Sonata 125,000 ...

    The problem is that we need to include the rank of the target (in this case 'Hyuandai Sonata') in the result set. So far, our solution is to stuff the ranking query for the entire result set into a temp table and then get the top N plus the target:

    SELECT * from #temp WHERE rank <= @topN or column = @plusTarget

    This works fine except some of the query datasets are very large (in excess of 100K items). We'd like to qualify the ranking query with the ranked value (e.g.: where sales >= {Sonata's sales}) but the ranked value is usually a somewhat complex calculation/aggregate. Thus the entire calculation/aggregation would still have to be performed on the entire dataset.

    Any suggestions on a better way?

  • How about using a CTE within a UNION query. Something like this:

    WITH MyCTE AS

    (SELECT SalesOrderId

    ,row_number() over (order by TotalDue DESC) AS TotalRank

    ,TotalDue

    FROM Sales.SalesOrderHeader)

    SELECT TOP 5 *

    FROM MyCTE

    UNION

    SELECT *

    FROM MyCTE

    WHERE SalesOrderId = 879967268

    ORDER BY TotalRank

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • Great example but you can enhance this further

    ;WITH MyCTE (SalesOrderID, TotalRank, TotalDue)

    AS (

    SELECTSalesOrderID,

    ROW_NUMBER() OVER (ORDER BY TotalDue DESC),

    TotalDue

    FROMSales.SalesOrderHeader

    )

    SELECTTOP 5

    SalesOrderID,

    TotalRank,

    TotalDue

    FROMMyCTE

    WHERETotalRank <= 5

    OR SalesOrderID = 879967268

    ORDER BYTotalRank


    N 56°04'39.16"
    E 12°55'05.25"

  • The CTE suggestion is really no different than our existing temp table technique. The problem is it still has to calculate the ranked metric for the entire dataset and our datasets can be very large (commonly producing more than 100K items in a ranked query). TOP (N) is supposed to be efficient enough that it sometimes may not need to process the entire dataset but its efficiency is negated by needing the ranked results of the "plus" target.

    Is there a different technique that we can try?

  • Have you performance tested both approaches? I think the CTE approach, especially the second one, not mine, should be much more effecient than loading data into a temp table.

    "The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
    - Theodore Roosevelt

    Author of:
    SQL Server Execution Plans
    SQL Server Query Performance Tuning

  • From the part of

    -- Stage the CTE

    and down, this approach averages

    at 88 milliseconds for 100,000 records.

    at 131 milliseconds for 1,000,000 records.

    Fast enough? 😉

    -- Create sample data

    CREATE TABLE#Sample

    (

    OrderID INT,

    Due DATETIME

    )

    -- Populate sample data

    INSERT#Sample

    (

    OrderID,

    Due

    )

    SELECTABS(CHECKSUM(NEWID())) % 100000,

    30000 + ABS(CHECKSUM(NEWID())) / 1535.6

    FROMmaster..spt_values AS v1

    INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'

    WHEREv1.Type = 'p'

    AND v1.Number < 1000

    AND v2.Number < 1000

    -- Create a clustered index

    CREATE CLUSTERED INDEX IX_OrderID ON #Sample (Due DESC)

    -- Initialize the "6th" wheel

    DECLARE@SalesOrderID INT

    SELECTTOP 1

    @SalesOrderID = OrderID

    FROM#Sample

    ORDER BYNEWID()

    -- Stage the CTE

    -- Time this and down beacuse all above is already made.

    -- @SalesOrderID is really a user supplied parameter

    ;WITH MyCTE (SalesOrderID, TotalRank, TotalDue)

    AS (

    SELECTOrderID,

    ROW_NUMBER() OVER (ORDER BY Due DESC),

    Due

    FROM#Sample

    )

    -- Show the expected result

    SELECTTOP 6

    SalesOrderID,

    CASE WHEN SalesOrderID = @SalesOrderID THEN '6th wheel' ELSE 'Top 5' END AS Status,

    TotalRank,

    TotalDue

    FROMMyCTE

    WHERETotalRank <= 5

    OR SalesOrderID = @SalesOrderID

    ORDER BYTotalRank


    N 56°04'39.16"
    E 12°55'05.25"

  • In this particular query, top N is showing no efficiency benefits at all. Here are representative timings from 2 runs of each technique.

    Top N:

    set statistics time on

    go

    select top (10)

    itemId,

    itemName,

    occurrenceDate,

    RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

    cast(some_metric_summary as numeric(5,1)) as metric_value

    from

    ... view with group by ...

    where

    ... criteria and date range ...

    order by rank_metric

    go

    It results in:

    (10 row(s) affected)

    SQL Server Execution Times:

    CPU time = 144422 ms, elapsed time = 151149 ms.

    Here's our temp table approach:

    select top (10)

    itemId,

    itemName,

    occurrenceDate,

    RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

    cast(some_metric_summary as numeric(5,1)) as metric_value

    into #x

    from

    ... view with group by ...

    where

    ... criteria and date range ...

    go

    create index x0 on #x( rank_metric )

    create index x1 on #x( itemId )

    go

    select * from #x where rank_metric <= 10

    union

    select * from #x where item_id = 322179

    order by rank_metric

    go

    (22098 row(s) affected)

    SQL Server Execution Times:

    CPU time = 145719 ms, elapsed time = 152716 ms.

    SQL Server Execution Times: (index x0)

    CPU time = 31 ms, elapsed time = 39 ms.

    SQL Server Execution Times: (index x1)

    CPU time = 47 ms, elapsed time = 41 ms.

    SQL Server Execution Times: (union 1)

    CPU time = 78 ms, elapsed time = 74 ms.

    SQL Server Execution Times: (union 2)

    CPU time = 78 ms, elapsed time = 77 ms.

    (22 row(s) affected)

    SQL Server parse and compile time:

    CPU time = 0 ms, elapsed time = 1 ms.

    total elapsed time: 152948 ms

    and the CTE:

    with Ranking as (

    select top (10)

    itemId,

    itemName,

    occurrenceDate,

    RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

    cast(some_metric_summary as numeric(5,1)) as metric_value

    from

    ... view with group by ...

    where

    ... criteria and date range ...

    )

    select * from Ranking where rank_metric <= 10

    union

    select * from Ranking where item_id = 322179

    order by rank_metric

    go

    (22 row(s) affected)

    SQL Server Execution Times:

    CPU time = 286703 ms, elapsed time = 299400 ms.

    This test indicates that the Top N approach is still calculating the metric_summary for the entire dataset since it's timings are similar to our temp table technique. I'm not sure why the CTE performed so badly since our temp table technique without the index creation was actually faster than the version with the indexes.

  • What index are you using? Are you sure the compiler even used it?

    All of the items to posted had the Top(N) syntax in it...

    Without the RIGHT index (the one the compiler "wants") - it will just ignore the index that is there.

    Also - you DO know that RANK() is NOT the same as ROW_NUMBER(), so RANK()<=10 is not at all the same as ROW_NUMBER()<=10. I'm not sure what you're using it for, but I suspect that you could use ROW_NUMBER() instead, and that should be quite a bit faster.

    And - you're right - with the RANK() function in place, you can pretty much count on the entire table getting scanned unless the index built is right on.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (2/4/2008)


    What index are you using? Are you sure the compiler even used it?

    All of the items to posted had the Top(N) syntax in it...

    Without the RIGHT index (the one the compiler "wants") - it will just ignore the index that is there.

    Also - you DO know that RANK() is NOT the same as ROW_NUMBER(), so RANK()<=10 is not at all the same as ROW_NUMBER()<=10. I'm not sure what you're using it for, but I suspect that you could use ROW_NUMBER() instead, and that should be quite a bit faster.

    And - you're right - with the RANK() function in place, you can pretty much count on the entire table getting scanned unless the index built is right on.

    Sorry, TOP (10) was only used in the first query. (I have to munge the actual SQL since we can't reveal any details about our code or schema.) RANK() is the proper method for us since it accounts for ties, but we could use TOP(N) with a fudge factor and filter it later (e.g.:

    select * from (

    select top (N * 1.3) * from ... order by rank )

    where rank <= N

    ) if that saves some processing time. Right now, we're just trying to find a technique that doesn't have to process the whole dataset.

    And the query is only partially optimized and uses several supporting indexes. It's a quarterly summary of about 1.2M rows for 22K of 471K occurrences so 2.5 minutes isn't bad... we plan on getting it down to the 30 sec range eventually.

    Indexing aside, all three techniques use the same query plan when attacking the base query. We've discovered that the CTE nosedives because it does everything twice as evidenced by the read counts below. Apparantly, a common table expression is just that -- an expression. It is re-evaluated (re-executed) with each use so the UNION doubles the work.

    temp table

    Table '{occurrence source}'. Scan count 0, logical reads 44196 (644 rows)

    Table 'Worktable'. Scan count 1, logical reads 281937

    Table '{times of day}'. Scan count 1, logical reads 2 (30 rows)

    Table '{times of day detail}'. Scan count 1, logical reads 2 (163 rows)

    Table '{detail to 15 minutes}'. Scan count 0, logical reads 3077451 (7M rows)

    Table '{item occurrence}'. Scan count 11604, logical reads 45421169 (471K rows)

    Table '{detail maximums}'. Scan count 0, logical reads 34848 (44K rows)

    Table '{time}'. Scan count 121, logical reads 384 (280K rows)

    Table '{time factors}'. Scan count 121, logical reads 242 (128 rows)

    Table '{daily weightings}'. Scan count 1, logical reads 11 (461 rows)

    Table '{summary level mask}'. Scan count 0, logical reads 2 (32 rows)

    Table '{detail level mask}'. Scan count 0, logical reads 2 (4 rows)

    SQL Server Execution Times:

    CPU time = 143953 ms, elapsed time = 150859 ms.

    CTE

    Table '{occurrence source}'. Scan count 0, logical reads 88392

    Table 'Worktable'. Scan count 2, logical reads 563874

    Table '{times of day}Portion'. Scan count 2, logical reads 4

    Table '{times of day}'. Scan count 2, logical reads 4

    Table '{detail to 15 minutes}'. Scan count 0, logical reads 6154902

    Table '{item occurrence}'. Scan count 23208, logical reads 90842338

    Table '{detail maximums}'. Scan count 0, logical reads 69696

    Table '{time}'. Scan count 242, logical reads 768

    Table '{time factors}'. Scan count 242, logical reads 484

    Table '{daily weightings}'. Scan count 2, logical reads 22

    Table '{summary level mask}'. Scan count 0, logical reads 4

    Table '{detail level mask}'. Scan count 0, logical reads 4

    SQL Server Execution Times:

    CPU time = 290453 ms, elapsed time = 303760 ms.

  • I believe the real problem lies in this part:

    ... view with group by ...

    Queries of views are "VERY" deceiving and indexes may not be appropriately used because you are hiding complexity in the view - which SQL will decompose - and you could be in for a surprise...


    * Noel

  • First - have you looked at using the WITH TIES syntax?

    Second - the fact that you're using single-column indexes to do this points to why everything is slow.

    The view part isn't helping - like Noel mentioned. If the view takes some time to execute you might just be barking up the wrong tree here....

    Just for S's and G's, I ran the same test as you against a 10M row table, and the Top(10) with TIES blew the doors off of the RANK and the temp table both, with the RIGHT index.

    For a point of reference - it came in at 56s, with the building of the index costing 99.9% of that. The actual SELECT top(10) WITH TIES .... executed in 30 ms.

    It looks simply like this:

    select top(10) WITH ties

    *

    from MyTable

    Order by some_metric_summary DESC;

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • My suggestion is either use Analysis Services to pre-aggregate the most common queries, or do so yourself in a separate Reporting database.

    Are you querying on up-to-the-second data? Or would as-of-last-night data work? How about as-of-an-hour-ago?

    If you can create a table or set of tables with some denormalized aggregates stored in them, and update that table nightly/weekly/quarterly/hourly/whatever, you'll get very, very fast reports from it.

    Otherwise, if you have to perform the calculations on-the-fly, you're going to have slower performance on any complex queries. If the server doesn't know which products had the best sales in the first quarter last year, it has to calculate all sales of all products in that quarter, in order to figure out which are the best.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • Indexed View 😉 ... any one ?


    * Noel

  • noeld (2/4/2008)


    Indexed View 😉 ... any one ?

    I'll pass...:hehe:

    Too messy for me, but thank you for asking.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Matt Miller (2/4/2008)


    First - have you looked at using the WITH TIES syntax?

    Second - the fact that you're using single-column indexes to do this points to why everything is slow.

    The view part isn't helping - like Noel mentioned. If the view takes some time to execute you might just be barking up the wrong tree here....

    Just for S's and G's, I ran the same test as you against a 10M row table, and the Top(10) with TIES blew the doors off of the RANK and the temp table both, with the RIGHT index.

    For a point of reference - it came in at 56s, with the building of the index costing 99.9% of that. The actual SELECT top(10) WITH TIES .... executed in 30 ms.

    It looks simply like this:

    select top(10) WITH ties

    *

    from MyTable

    Order by some_metric_summary DESC;

    I think you guys are trying to address performance when that is not the point of my post. My question is about a technique that will calculate a rank within a given set of crtieria and return the top N for the ranked value along with specified items and their rank. There are over 400 metrics that the user can choose so creating an index to support the ranking metric is not viable.

    Also, you have no idea how our indices are structured and your assumption about using single column indexes is incorrect. The views work very well, but we have not optimized them for large date range summarizations yet. So their performance is not important at this point since once we settle on a technique, we will tune the tables/views/indices for that technique.

    Lastly, TOP (N) WITH TIES does not address my question since the proper RANK() of the "plus" target must be included also. Once RANK() is added to the query, it's performance is identical to TOP (N) without ties.

    At this point, I don't see any alternative to ranking the entire result set and then selecting where rank <= N or itemId in (specific item list).

Viewing 15 posts - 1 through 15 (of 41 total)

You must be logged in to reply to this topic. Login to reply