Top N plus

Question

Post reply

Top N plus

antonio.collins

SSCrazy Eights

Points: 9600
More actions
February 3, 2008 at 6:20 pm

#186185

I'm trying to make a report for a competitive analysis based on ranking. The requirement's fairly simple: Include the Top N plus a specified item.
So if the query were on cars and the analysis is for Hyuandai Sonata, the results for "Top 5 plus" mid-size cars might be:
rank make/model qtr sales other data
1 Toyota Camry 500,000 ...
2 Honda Accord 480,000 ...
3 Ford Fusion 220,000 ...
4 Nissan Altima 215,000 ...
5 Chevy Malibu 175,000 ...
9 Hyuandai Sonata 125,000 ...
The problem is that we need to include the rank of the target (in this case 'Hyuandai Sonata') in the result set. So far, our solution is to stuff the ranking query for the entire result set into a temp table and then get the top N plus the target:
SELECT * from #temp WHERE rank <= @topN or column = @plusTarget
This works fine except some of the query datasets are very large (in excess of 100K items). We'd like to qualify the ranking query with the ranked value (e.g.: where sales >= {Sonata's sales}) but the ranked value is usually a somewhat complex calculation/aggregate. Thus the entire calculation/aggregation would still have to be performed on the entire dataset.
Any suggestions on a better way?

Viewing 15 posts - 1 through 15 (of 41 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 398716 More actions · Answer 1

How about using a CTE within a UNION query. Something like this:

WITH MyCTE AS

(SELECT SalesOrderId

,row_number() over (order by TotalDue DESC) AS TotalRank

,TotalDue

FROM Sales.SalesOrderHeader)

SELECT TOP 5 *

FROM MyCTE

UNION

SELECT *

FROM MyCTE

WHERE SalesOrderId = 879967268

ORDER BY TotalRank

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 2

Great example but you can enhance this further

;WITH MyCTE (SalesOrderID, TotalRank, TotalDue)

AS (

SELECTSalesOrderID,

ROW_NUMBER() OVER (ORDER BY TotalDue DESC),

TotalDue

FROMSales.SalesOrderHeader

)

SELECTTOP 5

SalesOrderID,

TotalRank,

TotalDue

FROMMyCTE

WHERETotalRank <= 5

OR SalesOrderID = 879967268

ORDER BYTotalRank

N 56°04'39.16"
E 12°55'05.25"

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 3

The CTE suggestion is really no different than our existing temp table technique. The problem is it still has to calculate the ranked metric for the entire dataset and our datasets can be very large (commonly producing more than 100K items in a ranked query). TOP (N) is supposed to be efficient enough that it sometimes may not need to process the entire dataset but its efficiency is negated by needing the ranked results of the "plus" target.

Is there a different technique that we can try?

Grant Fritchey SSC Guru Points: 398716 More actions · Answer 4

Have you performance tested both approaches? I think the CTE approach, especially the second one, not mine, should be much more effecient than loading data into a temp table.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

SwePeso SSC-Dedicated Points: 39752 More actions · Answer 5

From the part of

-- Stage the CTE

and down, this approach averages

at 88 milliseconds for 100,000 records.

at 131 milliseconds for 1,000,000 records.

Fast enough? 😉

-- Create sample data

CREATE TABLE#Sample

(

OrderID INT,

Due DATETIME

)

-- Populate sample data

INSERT#Sample

(

OrderID,

Due

)

SELECTABS(CHECKSUM(NEWID())) % 100000,

30000 + ABS(CHECKSUM(NEWID())) / 1535.6

FROMmaster..spt_values AS v1

INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'

WHEREv1.Type = 'p'

AND v1.Number < 1000

AND v2.Number < 1000

-- Create a clustered index

CREATE CLUSTERED INDEX IX_OrderID ON #Sample (Due DESC)

-- Initialize the "6th" wheel

DECLARE@SalesOrderID INT

SELECTTOP 1

@SalesOrderID = OrderID

FROM#Sample

ORDER BYNEWID()

-- Stage the CTE

-- Time this and down beacuse all above is already made.

-- @SalesOrderID is really a user supplied parameter

;WITH MyCTE (SalesOrderID, TotalRank, TotalDue)

AS (

SELECTOrderID,

ROW_NUMBER() OVER (ORDER BY Due DESC),

Due

FROM#Sample

)

-- Show the expected result

SELECTTOP 6

SalesOrderID,

CASE WHEN SalesOrderID = @SalesOrderID THEN '6th wheel' ELSE 'Top 5' END AS Status,

TotalRank,

TotalDue

FROMMyCTE

WHERETotalRank <= 5

OR SalesOrderID = @SalesOrderID

ORDER BYTotalRank

N 56°04'39.16"
E 12°55'05.25"

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 6

In this particular query, top N is showing no efficiency benefits at all. Here are representative timings from 2 runs of each technique.

Top N:

set statistics time on

go

select top (10)

itemId,

itemName,

occurrenceDate,

RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

cast(some_metric_summary as numeric(5,1)) as metric_value

from

... view with group by ...

where

... criteria and date range ...

order by rank_metric

go

It results in:

(10 row(s) affected)

SQL Server Execution Times:

CPU time = 144422 ms, elapsed time = 151149 ms.

Here's our temp table approach:

select top (10)

itemId,

itemName,

occurrenceDate,

RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

cast(some_metric_summary as numeric(5,1)) as metric_value

into #x

from

... view with group by ...

where

... criteria and date range ...

go

create index x0 on #x( rank_metric )

create index x1 on #x( itemId )

go

select * from #x where rank_metric <= 10

union

select * from #x where item_id = 322179

order by rank_metric

go

(22098 row(s) affected)

SQL Server Execution Times:

CPU time = 145719 ms, elapsed time = 152716 ms.

SQL Server Execution Times: (index x0)

CPU time = 31 ms, elapsed time = 39 ms.

SQL Server Execution Times: (index x1)

CPU time = 47 ms, elapsed time = 41 ms.

SQL Server Execution Times: (union 1)

CPU time = 78 ms, elapsed time = 74 ms.

SQL Server Execution Times: (union 2)

CPU time = 78 ms, elapsed time = 77 ms.

(22 row(s) affected)

SQL Server parse and compile time:

CPU time = 0 ms, elapsed time = 1 ms.

total elapsed time: 152948 ms

and the CTE:

with Ranking as (

select top (10)

itemId,

itemName,

occurrenceDate,

RANK() OVER (ORDER BY some_metric_summary desc) as rank_metric,

cast(some_metric_summary as numeric(5,1)) as metric_value

from

... view with group by ...

where

... criteria and date range ...

)

select * from Ranking where rank_metric <= 10

union

select * from Ranking where item_id = 322179

order by rank_metric

go

(22 row(s) affected)

SQL Server Execution Times:

CPU time = 286703 ms, elapsed time = 299400 ms.

This test indicates that the Top N approach is still calculating the metric_summary for the entire dataset since it's timings are similar to our temp table technique. I'm not sure why the CTE performed so badly since our temp table technique without the index creation was actually faster than the version with the indexes.

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

What index are you using? Are you sure the compiler even used it?

All of the items to posted had the Top(N) syntax in it...

Without the RIGHT index (the one the compiler "wants") - it will just ignore the index that is there.

Also - you DO know that RANK() is NOT the same as ROW_NUMBER(), so RANK()<=10 is not at all the same as ROW_NUMBER()<=10. I'm not sure what you're using it for, but I suspect that you could use ROW_NUMBER() instead, and that should be quite a bit faster.

And - you're right - with the RANK() function in place, you can pretty much count on the entire table getting scanned unless the index built is right on.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 8

Matt Miller (2/4/2008)
What index are you using? Are you sure the compiler even used it?
All of the items to posted had the Top(N) syntax in it...
Without the RIGHT index (the one the compiler "wants") - it will just ignore the index that is there.
Also - you DO know that RANK() is NOT the same as ROW_NUMBER(), so RANK()<=10 is not at all the same as ROW_NUMBER()<=10. I'm not sure what you're using it for, but I suspect that you could use ROW_NUMBER() instead, and that should be quite a bit faster.
And - you're right - with the RANK() function in place, you can pretty much count on the entire table getting scanned unless the index built is right on.

Sorry, TOP (10) was only used in the first query. (I have to munge the actual SQL since we can't reveal any details about our code or schema.) RANK() is the proper method for us since it accounts for ties, but we could use TOP(N) with a fudge factor and filter it later (e.g.:

select * from (

select top (N * 1.3) * from ... order by rank )

where rank <= N

) if that saves some processing time. Right now, we're just trying to find a technique that doesn't have to process the whole dataset.

And the query is only partially optimized and uses several supporting indexes. It's a quarterly summary of about 1.2M rows for 22K of 471K occurrences so 2.5 minutes isn't bad... we plan on getting it down to the 30 sec range eventually.

Indexing aside, all three techniques use the same query plan when attacking the base query. We've discovered that the CTE nosedives because it does everything twice as evidenced by the read counts below. Apparantly, a common table expression is just that -- an expression. It is re-evaluated (re-executed) with each use so the UNION doubles the work.

temp table
Table '{occurrence source}'. Scan count 0, logical reads 44196 (644 rows)
Table 'Worktable'. Scan count 1, logical reads 281937
Table '{times of day}'. Scan count 1, logical reads 2 (30 rows)
Table '{times of day detail}'. Scan count 1, logical reads 2 (163 rows)
Table '{detail to 15 minutes}'. Scan count 0, logical reads 3077451 (7M rows)
Table '{item occurrence}'. Scan count 11604, logical reads 45421169 (471K rows)
Table '{detail maximums}'. Scan count 0, logical reads 34848 (44K rows)
Table '{time}'. Scan count 121, logical reads 384 (280K rows)
Table '{time factors}'. Scan count 121, logical reads 242 (128 rows)
Table '{daily weightings}'. Scan count 1, logical reads 11 (461 rows)
Table '{summary level mask}'. Scan count 0, logical reads 2 (32 rows)
Table '{detail level mask}'. Scan count 0, logical reads 2 (4 rows)
SQL Server Execution Times:
CPU time = 143953 ms, elapsed time = 150859 ms.
CTE
Table '{occurrence source}'. Scan count 0, logical reads 88392
Table 'Worktable'. Scan count 2, logical reads 563874
Table '{times of day}Portion'. Scan count 2, logical reads 4
Table '{times of day}'. Scan count 2, logical reads 4
Table '{detail to 15 minutes}'. Scan count 0, logical reads 6154902
Table '{item occurrence}'. Scan count 23208, logical reads 90842338
Table '{detail maximums}'. Scan count 0, logical reads 69696
Table '{time}'. Scan count 242, logical reads 768
Table '{time factors}'. Scan count 242, logical reads 484
Table '{daily weightings}'. Scan count 2, logical reads 22
Table '{summary level mask}'. Scan count 0, logical reads 4
Table '{detail level mask}'. Scan count 0, logical reads 4
SQL Server Execution Times:
CPU time = 290453 ms, elapsed time = 303760 ms.

noeld SSC Guru Points: 96590 More actions · Answer 9

I believe the real problem lies in this part:

... view with group by ...

Queries of views are "VERY" deceiving and indexes may not be appropriately used because you are hiding complexity in the view - which SQL will decompose - and you could be in for a surprise...

* Noel

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

First - have you looked at using the WITH TIES syntax?

Second - the fact that you're using single-column indexes to do this points to why everything is slow.

The view part isn't helping - like Noel mentioned. If the view takes some time to execute you might just be barking up the wrong tree here....

Just for S's and G's, I ran the same test as you against a 10M row table, and the Top(10) with TIES blew the doors off of the RANK and the temp table both, with the RIGHT index.

For a point of reference - it came in at 56s, with the building of the index costing 99.9% of that. The actual SELECT top(10) WITH TIES .... executed in 30 ms.

It looks simply like this:

select top(10) WITH ties

*

from MyTable

Order by some_metric_summary DESC;

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

GSquared SSC Guru Points: 260824 More actions · Answer 11

My suggestion is either use Analysis Services to pre-aggregate the most common queries, or do so yourself in a separate Reporting database.

Are you querying on up-to-the-second data? Or would as-of-last-night data work? How about as-of-an-hour-ago?

If you can create a table or set of tables with some denormalized aggregates stored in them, and update that table nightly/weekly/quarterly/hourly/whatever, you'll get very, very fast reports from it.

Otherwise, if you have to perform the calculations on-the-fly, you're going to have slower performance on any complex queries. If the server doesn't know which products had the best sales in the first quarter last year, it has to calculate all sales of all products in that quarter, in order to figure out which are the best.

- Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
Property of The Thread

"Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

noeld SSC Guru Points: 96590 More actions · Answer 12

noeld

SSC Guru

Points: 96590

February 4, 2008 at 1:49 pm

#774897

Indexed View 😉 ... any one ?

* Noel

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 13

noeld (2/4/2008)
Indexed View 😉 ... any one ?

I'll pass...:hehe:

Too messy for me, but thank you for asking.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

antonio.collins SSCrazy Eights Points: 9600 More actions · Answer 14

Matt Miller (2/4/2008)
First - have you looked at using the WITH TIES syntax?
Second - the fact that you're using single-column indexes to do this points to why everything is slow.
The view part isn't helping - like Noel mentioned. If the view takes some time to execute you might just be barking up the wrong tree here....
Just for S's and G's, I ran the same test as you against a 10M row table, and the Top(10) with TIES blew the doors off of the RANK and the temp table both, with the RIGHT index.
For a point of reference - it came in at 56s, with the building of the index costing 99.9% of that. The actual SELECT top(10) WITH TIES .... executed in 30 ms.
It looks simply like this:
select top(10) WITH ties
*
from MyTable
Order by some_metric_summary DESC;

I think you guys are trying to address performance when that is not the point of my post. My question is about a technique that will calculate a rank within a given set of crtieria and return the top N for the ranked value along with specified items and their rank. There are over 400 metrics that the user can choose so creating an index to support the ranking metric is not viable.

Also, you have no idea how our indices are structured and your assumption about using single column indexes is incorrect. The views work very well, but we have not optimized them for large date range summarizations yet. So their performance is not important at this point since once we settle on a technique, we will tune the tables/views/indices for that technique.

Lastly, TOP (N) WITH TIES does not address my question since the proper RANK() of the "plus" target must be included also. Once RANK() is added to the query, it's performance is identical to TOP (N) without ties.

At this point, I don't see any alternative to ranking the entire result set and then selecting where rank <= N or itemId in (specific item list).