Top N plus

  • Ranking the whole result set is probably your only option.

    One last idea, have you tried:

    select top 5 ...

    from ...

    order by ranking desc

    union

    select ...

    from ...

    where product = @param

    So the first query just gets the top N, and the Union gets the one you want?

    I'm not sure that would solve anything, but have you tried it?

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • You're missing the point I'm trying to make: the two are related. By not giving it something to work with that it "likes" (i.e. appropriate indexing, etc...), and by requiring the RANK() function over the 1.2M to just to get the rankings, it is going to force a scan of all 1.2M rows, just to find the distinct values and create said ranks.

    Depending on how the view is set up - it's very possible that it by itself is forcing the entire view to be materialized just so a really small subset can be used, which is yet another spot to look at.

    Try this on:

    Use the Top(N) with TIES to get you the records you need, and THEN create the rankings on the results of the Top(N).

    such as

    declare @g datetime;

    select @g=getdate();

    select RANK() OVER (ORDER BY orderamount DESC) rnk,

    *

    from (

    select top(10) WITH ties

    *

    from pivottable

    Order by orderamount DESC

    ) pvt;

    select datediff(ms,@g,getdate());

    In this way - at least you're running the RANK() on a much smaller subset, versus creating a hash table that might contain 1.2Million values that then need to be ordered at run-time.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • You're missing the point I'm trying to make: the two are related. By not giving it something to work with that it "likes" (i.e. appropriate indexing, etc...), and by requiring the RANK() function over the 1.2M to just to get the rankings, it is going to force a scan of all 1.2M rows, just to find the distinct values and create said ranks.

    ...snip...

    Use the Top(N) with TIES to get you the records you need, and THEN create the rankings on the results of the Top(N).

    ...snip...

    I understand your point, but you missed mine. The RANK() is already based on the dataset corresponding to the given criteria, not all records in the table(s). So if someone provides criteria as size=mid, engine=v6, qtr=3Q 07 they'll get a different rank for Toyota Camry over 3yr_lease_value than they would for engine=4cyl, qtr=3Q 07 ranked over 2yr_lease_value. TOP (99) WITH TIES will not include Toyota Camry if its rank is > 99. Also nearly all rankable metrics are summarizations or calculations that include business rules so the only way to sort/order/rank is to calculate the metric.

  • I did catch that, and I'm not denying in any way that your metric is required to get to the top 10, 15, 20, etc... choices they might want. What I'm trying to impress upon you is that once you have calculated the personalized rating or metric as you call it, using the Rank() function against the TOP(N) with TIES subquery will return your results faster than the RANK() function against the whole set, since the top(N) is an ordering function, and not an aggregation-type function. And you'll still end up with the same results - just quite a bit faster.

    In other words - if you replicated the query above , and replaced "orderamount" with the name of your calculated metric, by itself you should see a noticeable difference in efficiency in reads and execution time. RANK() is pretty much always going to go out an build a hash table of all of the values, so it's not going to be good for performance against a large set.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • ... using the Rank() function against the TOP(N) with TIES subquery will return your results faster than the RANK() function against the whole set, since the top(N) is an ordering function, and not an aggregation-type function. And you'll still end up with the same results - just quite a bit faster.

    No it won't be any faster because RANK() can only be calculated on the entire candidate data set (what's identified by the where clause). If the where clause boils down to 2000 rows, it's going to calculate RANK() for 2000 items; if its 2M rows, all 2M will be ranked.

    But maybe I'm wrong. So, here's a simple example.

    create table #z ( model varchar(20), metric1 int, metric2 int )

    insert into #z values ('honda accord', 1540, 200 )

    insert into #z values ('honda civic', 1530, 250 )

    insert into #z values ('chevy malibu', 1520, 26 )

    insert into #z values ('toyota camry', 1250, 27 )

    insert into #z values ('toyota avalon', 1350, 30 )

    insert into #z values ('toyota rav4', 1450, 40 )

    insert into #z values ('honda pilot', 150, 50 )

    insert into #z values ('honda odyssey', 550, 55 )

    insert into #z values ('chevy suburban', 750, 600 )

    insert into #z values ('pontiac gto', 850, 100 )

    insert into #z values ('toyota land cruiser', 950, 300 )

    insert into #z values ('pontiac g6', 650, 155 )

    insert into #z values ('ford fusion', 250, 255 )

    insert into #z values ('ford edge', 440, 270 )

    insert into #z values ('mazda 3', 430, 300 )

    insert into #z values ('mazda 6', 180, 205 )

    insert into #z values ('mazda rx7', 70, 200 )

    insert into #z values ('kia spectra', 90, 205 )

    insert into #z values ('hyundai tucson', 80, 25 )

    insert into #z values ('nissan murano', 110, 50 )

    insert into #z values ('nissan altima', 130, 55 )

    Toyota Camry is #6 when ranked by metic1, but #19 when ranked by metric2. What's the query to produce this top 10 by metric1

    honda accord15401

    honda civic15302

    chevy malibu15203

    toyota rav414504

    toyota avalon13505

    toyota camry12506

    toyota land cruiser9507

    pontiac gto8508

    chevy suburban7509

    pontiac g665010

    and this top 10 by metric 2 with Toyota Camry included in both?

    chevy suburban6001

    toyota land cruiser3002

    mazda 33002

    ford edge2704

    ford fusion2555

    honda civic2506

    mazda 62057

    kia spectra2057

    mazda rx72009

    honda accord2009

    toyota camry2719

  • ACtually - no. RANK just needs to be run over the top N and in everything from "camry" upwards. Everything else is worthless.

    with that in mind:

    declare @g datetime;

    --create index ix_pivot4 on testpivot(orderamount) include(coid,prodid)

    declare @camryID int;

    declare @camryamount int;

    select @camryID=5466326; --#50000 in the ranking

    --select @camryID=5466326; --#2415667 in the ranking

    select @g=getdate();

    select @camryamount=orderamount from testpivot where rid=@camryID

    select *

    from (

    select Top(20) RANK() over (order by orderamount desc) rnk,

    *

    from (

    select top(20) with ties

    coid,prodid,orderamount,rid

    fromtestpivot

    wherecoid between 1 and 5

    and prodid between 1 and 40

    order by orderamount desc

    ) pvt

    order by orderamount desc

    ) topN

    UNION

    select * from

    (select Top 1 *

    from(

    select RANK() over (order by orderamount desc) rnk,

    coid,prodid,orderamount,rid

    fromtestpivot

    where(orderamount>@camryamount

    and coid between 1 and 10

    and prodid between 1 and 40)

    or rid=@camryID

    ) camry

    order by orderamount) c;

    select datediff(ms,@g,getdate())

    dbcc freeproccache

    ---and now - for the RANK solution

    select @g=getdate();

    ;With RANKCTE as (

    select RANK() over (order by orderamount desc) rnk,

    coid,prodid,orderamount,rid

    fromtestpivot

    wherecoid between 1 and 10

    and prodid between 1 and 40)

    select * from RANKCTE

    where rnk<=20

    UNION

    select * from RANKCTE where rid=@camryid

    order by orderamount desc;

    select datediff(ms,@g,getdate())

    Results:

    WITH the index in place

    the TOP(n) syntax : from 190ms to 19453ms

    the CTE: 21780 to 24570

    WITHOUT any helper index in place

    the TOP(n) syntax : from 14760ms to 37516ms

    the CTE: 71733 to 78943

    (variations seem to have to do with how far down on the list the "camry" is).

    So - yes it's a little bit more to write, but you can outpace RANK() fairly dramatically if you should so desire.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • This will give you the result you seek.

    set nocount on

    declare @z table ( model varchar(20), metric1 int, metric2 int )

    insert into @z values ('honda accord', 1540, 200 )

    insert into @z values ('honda civic', 1530, 250 )

    insert into @z values ('chevy malibu', 1520, 26 )

    insert into @z values ('toyota camry', 1250, 27 )

    insert into @z values ('toyota avalon', 1350, 30 )

    insert into @z values ('toyota rav4', 1450, 40 )

    insert into @z values ('honda pilot', 150, 50 )

    insert into @z values ('honda odyssey', 550, 55 )

    insert into @z values ('chevy suburban', 750, 600 )

    insert into @z values ('pontiac gto', 850, 100 )

    insert into @z values ('toyota land cruiser', 950, 300 )

    insert into @z values ('pontiac g6', 650, 155 )

    insert into @z values ('ford fusion', 250, 255 )

    insert into @z values ('ford edge', 440, 270 )

    insert into @z values ('mazda 3', 430, 300 )

    insert into @z values ('mazda 6', 180, 205 )

    insert into @z values ('mazda rx7', 70, 200 )

    insert into @z values ('kia spectra', 90, 205 )

    insert into @z values ('hyundai tucson', 80, 25 )

    insert into @z values ('nissan murano', 110, 50 )

    insert into @z values ('nissan altima', 130, 55 )

    ;WITH MyCTE1 (Model, Row, Ranking, Metric1, Metric2)

    AS (

    SELECTModel,

    ROW_NUMBER() OVER (ORDER BY Metric1 DESC),

    RANK() OVER (ORDER BY Metric1 DESC),

    Metric1,

    Metric2

    FROM@z

    )

    -- Show the expected result

    SELECTTOP 10 WITH TIES

    Ranking,

    Model,

    Metric1,

    Metric2

    FROMMyCTE1

    WHERERow <= 10

    OR Model = 'Toyota Camry'

    ORDER BYCASE WHEN Model = 'Toyota Camry' AND Row > 10 THEN 10 ELSE Row END

    ;WITH MyCTE2 (Model, Row, Ranking, Metric1, Metric2)

    AS (

    SELECTModel,

    ROW_NUMBER() OVER (ORDER BY Metric2 DESC),

    RANK() OVER (ORDER BY Metric2 DESC),

    Metric1,

    Metric2

    FROM@z

    )

    -- Show the expected result

    SELECTTOP 10 WITH TIES

    Ranking,

    Model,

    Metric1,

    Metric2

    FROMMyCTE2

    WHERERow <= 10

    OR Model = 'Toyota Camry'

    ORDER BYCASE WHEN Model = 'Toyota Camry' AND Row > 10 THEN 10 ELSE Row END

    Could you please timetest it and post back times here?


    N 56°04'39.16"
    E 12°55'05.25"

  • peso: that's the CTE version that was discussed earlier. it performs the same as the temp table version if the expression is only referred to once.

    as i said before, the bulk of the time is calculating the rank. getting the top N results is a breeze (only 70ms to get them out of a temp table). i don't think there is a way to speed up a rank()ed query.

  • Notice that my latest calls your Temp table solution, the CTE solution.

    You really should give some of this a whirl. Running RANK() to the whole set just to get 10 or 11 records back is ultimately a waste. It's the same as running a count to check for existence - you're using a bazooka to swat a fly.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Here is a fast algorithm. Please notice that NO INDEXES AT ALL is present.

    SET NOCOUNT ON

    -- Create 100000 sample records

    CREATE TABLE#Sample

    (

    Model CHAR(20) NOT NULL,

    Metric1 INT,

    Metric2 INT

    )

    INSERT#Sample

    (

    Model,

    Metric1,

    Metric2

    )

    SELECTLEFT(REPLACE(CAST(NEWID() AS CHAR(36)), '-', ''), 20),

    ABS(CHECKSUM(NEWID())) % 600000,

    ABS(CHECKSUM(NEWID())) % 600000

    FROMmaster..spt_values AS v1

    INNER JOINmaster..spt_values AS v2 ON v2.Type = 'p'

    WHEREv1.Type = 'p'

    AND v1.Number < 100

    AND v1.Number < 1000

    -- Mimic user supplied parameter

    DECLARE@Model VARCHAR(20)

    SELECT TOP 1@Model = Model

    FROM#Sample

    ORDER BYNEWID()

    -- Timetest staging table

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    DECLARE @Time DATETIME

    SET @Time = GETDATE()

    -- Stage query result

    DECLARE@Stage TABLE (Model VARCHAR(20), Metric INT, Ranking INT)

    INSERT@Stage

    (

    Model,

    Metric

    )

    SELECT TOP 15Model,

    Metric1

    FROM#Sample

    ORDER BYMetric1 DESC

    UPDATEs

    SETs.Ranking = 1 + (SELECT COUNT(*) FROM @Stage AS x WHERE x.Metric > s.Metric)

    FROM@Stage AS s

    IF NOT EXISTS (SELECT * FROM @Stage WHERE Model = @Model)

    BEGIN

    INSERT@Stage

    (

    Model,

    Metric

    )

    SELECTModel,

    Metric1

    FROM#Sample

    WHEREModel = @Model

    UPDATEs

    SETs.Ranking = 1 + (SELECT COUNT(*) FROM #Sample AS x WHERE x.Metric1 > s.Metric)

    FROM@Stage AS s

    WHEREs.Model = @Model

    END

    SELECT*

    FROM@Stage

    -- Average at 80 ms

    SELECT DATEDIFF(MILLISECOND, @Time, GETDATE())

    -- Timetest CTE

    DBCC FREEPROCCACHE

    DBCC DROPCLEANBUFFERS

    SET @Time = GETDATE()

    ;WITH MyCTE (Model, Row, Ranking, Metric1)

    AS (

    SELECTModel,

    ROW_NUMBER() OVER (ORDER BY Metric1 DESC),

    RANK() OVER (ORDER BY Metric2 DESC),

    Metric1

    FROM#Sample

    )

    SELECTTOP 15 WITH TIES

    Ranking,

    Model,

    Metric1

    FROMMyCTE

    WHERERow <= 15

    OR Model = @Model

    ORDER BYCASE WHEN Model = @Model AND Row > 15 THEN 15 ELSE Row END

    -- Average at 983 ms

    SELECT DATEDIFF(MILLISECOND, @Time, GETDATE())


    N 56°04'39.16"
    E 12°55'05.25"

  • Statistics IO for

    "Staging algorithm"

    Table '#2180FB33'. Scan count 0, logical reads 15, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#Sample_____________________________________________________________________________________________________________00000000001C'. Scan count 1, logical reads 940, physical reads 0, read-ahead reads 30, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#2180FB33'. Scan count 16, logical reads 31, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 1, logical reads 33, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#2180FB33'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#2180FB33'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    "CTE algorithm"

    Table '#Sample_____________________________________________________________________________________________________________00000000001C'. Scan count 3, logical reads 940, physical reads 0, read-ahead reads 30, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.


    N 56°04'39.16"
    E 12°55'05.25"

  • Matt Miller (2/5/2008)


    Notice that my latest calls your Temp table solution, the CTE solution.

    You really should give some of this a whirl. Running RANK() to the whole set just to get 10 or 11 records back is ultimately a waste. It's the same as running a count to check for existence - you're using a bazooka to swat a fly.

    Hey guys, thanks for all your suggestions. I was hoping we'd find some syntax or technique that could produce a rank() and allow the query engine to skip/ignore some of the candidate rows, but it looks like that's not possible.

    Ranking is extremely valuable for certain uses. Raw figures aren't that useful unless they're put into context with the competition. For example, if sales dropped 15% last month, the first thing a sales manager might do is run a ranking report to figure out where they are relative to everyone else. So if their rank hasn't changed, there's reason to panic (since everyone else's sales dropped also). Top N Plus will let them see the top performers and themselves even if they aren't in the top cut and avoid having to run two seperate reports. This is valuable since once you go past the top 5 or 10, everyone is bunched up very closely.

    Ranking also let's them put a spin on numbers and find out where their strengths are. So they've also asked for reports that will locate the categories in which they're ranked N or higher (we're #3 in the southeast and #2 in Michigan) and alerts that will notify them if their category rank changes.

  • antonio.collins (2/5/2008)


    Hey guys, thanks for all your suggestions. I was hoping we'd find some syntax or technique that could produce a rank() and allow the query engine to skip/ignore some of the candidate rows, but it looks like that's not possible.

    Huh?

    What about the algorithm posted by me 34 minutes earlier?

    http://www.sqlservercentral.com/Forums/FindPost451626.aspx


    N 56°04'39.16"
    E 12°55'05.25"

  • peso: you're trying to speed up the selection of items from the cte/temp table and that's not the issue. even in my original temp table approach, it took 150s to load the temp table and only 70ms to get the proper data out of it.

  • Well, in my temp table algorithm above it takes an average of 80 milliseconds to fill the temptable and get the rankings calculated...

    Are you sure we are talking about the same thing?

    In my previous post, the population of #Sample table is just to mimic your environment, to get the 100000 records to work with.

    Have you checked the code below that?

    From the "Timetest staging" line to before "Timetest CTE" ?

    CREATE PROCEDURE dbo.uspGetMyModels

    (

    @Top TINYINT,

    @Model VARCHAR(20)

    )

    AS

    SET NOCOUNT ON

    -- Stage query result

    DECLARE @Stage TABLE

    (

    Model VARCHAR(20),

    Metric INT,

    Ranking INT

    )

    -- Populate stagin table

    INSERT@Stage

    (

    Model,

    Metric

    )

    SELECTTOP (@Top)

    Model,

    Metric1

    FROM{YourTableNameHere}

    ORDER BYMetric1 DESC

    -- Update ranking for top n records

    UPDATEs

    SETs.Ranking = 1 + (SELECT COUNT(*) FROM @Stage AS x WHERE x.Metric > s.Metric)

    FROM@Stage AS s

    -- If user supplied model is not among the top n records, add it

    IF NOT EXISTS (SELECT * FROM @Stage WHERE Model = @Model)

    BEGIN

    INSERT@Stage

    (

    Model,

    Metric

    )

    SELECTModel,

    Metric1

    FROM{YourTableNameHere}

    WHEREModel = @Model

    UPDATEs

    SETs.Ranking = 1 + (SELECT COUNT(*) FROM {YourTableNameHere} AS x WHERE x.Metric1 > s.Metric)

    FROM@Stage AS s

    WHEREs.Model = @Model

    END

    -- Show the expected result

    SELECTModel,

    Metric

    FROM@Stage


    N 56°04'39.16"
    E 12°55'05.25"

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply