Nasty Fast PERCENT_RANK

  • Comments posted to this topic are about the item Nasty Fast PERCENT_RANK

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Excellent article Alan, very nice work indeed!

    😎

  • Thanks Eirikur!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • this is the kind of query that kills some servers - ranking. it's a bit hard to make programmers understand how important the performance of their code can have a big impact on server speed - they all seems to think that throwing more hardware will solve all world's problems...

    unfortunately it does not work that way, thanks for the article!

  • Wonderful work and article Alan!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thank you Kevin and Mauricio.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks. I've been collecting performance counters on several servers. The easy calculations like average, extreme, and deviation tend to wash out performance details. I want to get a "real" picture via a set of several percentiles. Unfortunately, I don't think the %-tiles are easy to aggregate. For example, if I have averages and deviations for data collected every 5 minutes, I can use it to get the hourly values. From what I can tell, I have to recalculate the %-tile from scratch. I noticed that the PERCENT_RANK is costly. A faster method is appreciated.

    RandyHelpdesk: Perhaps Im not the only one that does not know what you are doing. πŸ˜‰

  • Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using πŸ™‚

    While my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?

    Thanks,

    Ken

  • From the Article:


    Bring on the one million row test harness!!!

    Heh... I love it! It's like a war cry! πŸ˜€

    Very cool article, Alan. Well done.

    Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Usually multiplying by 1. or 1.0 is a shortcut to cast from integer to decimal based on datatype precedence. This is what is being done in the latter example: RANK() and COUNT() return integer values so if you divide by integers you will get an integer quotient; multiplying by 1.0 or 1. changes it to decimal. I am not sure why this is being done in the former example though. As stated, the PERCENT_RANK() window function returns a float which is higher in precedence than decimal. A float multiplied by a decimal will still be a float as a result:

    Datatype precedence: https://msdn.microsoft.com/en-us/library/ms190309.aspx

  • Ken Hiatt (6/7/2016)


    Thanks Alan, it's always fun to learn faster ways to do things, even if it's not something I expect to be using πŸ™‚

    While my SQL programming skills are generally sufficient to understand what's going on, there's one thing I'm not seeing the point of. Why are you multiplying the PERCENT_RANK() by 1. ? Or does "@pr = 1.*PERCENT_RANK()..." mean something else I'm not aware of? For that matter, in the line "PercentileRank = 1.0*(rk-1)/(nr-1)" again multiplying something by 1, seems to me to be adding additional overhead. Does it cause a change in the execution plan or maybe the formatting of the final result?

    Thanks,

    Ken

    It affects the formatting of the final result. The "1.*" trick is something seen around SQLServerCentral to return a decimal value instead of a whole number when dividing whole numbers. Note the difference here:

    WITH SampleValues(nr,rn,SomeValue) AS

    (

    SELECT TOP(5) COUNT(*) OVER (), RANK() OVER (ORDER BY SomeValue), SomeValue

    FROM (VALUES (1),(2),(3),(4),(5))t(SomeValue)

    )

    SELECT

    SomeValue,

    Percent_Rank_int = (rn-1)/(nr-1), -- whole number

    Percent_Rank_numeric = 1.*(rn-1)/(nr-1) -- numeric(32,11)

    FROM SampleValues;

    Results:

    SomeValue Percent_Rank_int Percent_Rank_numeric

    ----------- -------------------- ---------------------------------------

    1 0 0.00000000000

    2 0 0.25000000000

    3 0 0.50000000000

    4 0 0.75000000000

    5 1 1.00000000000

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jeff Moden (6/7/2016)


    From the Article:


    Bring on the one million row test harness!!!

    Heh... I love it! It's like a war cry! πŸ˜€

    Very cool article, Alan. Well done.

    Hat's off also for honoring Dwain. He was a great man. His passing left a huge gap in this fine community.

    Thank you very much Jeff!

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Thanks guys, just the info I was missing. Much appreciated.

  • Great article.

    Can someone explain the following to me please

    So now we know the window aggregate function is the culprit. Since we’re not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Let’s re-write the pre-2012 version like so:

    SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

    PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

    FROM ##Sales;

    I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??

    I would normally store the count in a variable first

    SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

    Is my understanding incorrect?

  • Mike Is Here (6/9/2016)


    Great article.

    Thank you very much. πŸ˜€

    So now we know the window aggregate function is the culprit. Since we’re not partitioning the data yet (but will in a moment), the row count (nr) only needs to be calculated once. Let’s re-write the pre-2012 version like so:

    SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

    PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

    FROM ##Sales;

    I was always under the impression that putting a query in the select clause cause the query to be run for every row. According to what the author wrote it is quite the opposite??

    In some cases that is true but not here. When you do a correlated sub query, such as the ones demonstrated in this article[/url], then, yes. This is why correlated subqueries are often evil. In this case, however, there is no reference in the subquery to the outer query and the optimizer is smart enough to know that the result set only needs to be calculated once, which is what happens.

    I would normally store the count in a variable first

    SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

    Is my understanding incorrect?

    This will be equally efficient and is a valid alternative. I did it without a variable for people that would like to use this logic inside a view or inline table valued function (neither of which support variables).

    Here's two performance tests against 1,000,000 rows. First, an I/O test.

    SET NOCOUNT ON;

    SET STATISTICS IO ON;

    PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);

    DECLARE @ROWS int;

    SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

    SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

    PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS

    FROM ##Sales;

    PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);

    SELECT SalesPersonID, SaleYear, SaleAmt, -- Below is the (rk – 1) / (nr – 1) calculation:

    PercentileRank = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

    FROM ##Sales;

    SET STATISTICS IO OFF;

    Results:

    Pre-aggregation using a variable:

    --------------------------------------------------

    Table '##Sales'. Scan count 1, logical reads 2728, physical reads 1, read-ahead reads 48, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '##Sales'. Scan count 1, logical reads 2731, physical reads 1, read-ahead reads 38, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Method from the article:

    --------------------------------------------------

    Table '##Sales'. Scan count 6, logical reads 5489, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Notice that doing it both ways produces relatively the same number of reads. The technique I wrote about is just producing the reads all in one batch.

    And a 1M row performance test:

    EXEC dbo.GenerateSalesData 1, 1000000, 100, 2000; --1M rows, performance test

    PRINT CHAR(10)+'Pre-aggregation using a variable:'+CHAR(10)+REPLICATE('-',50);

    GO

    DECLARE @st datetime = getdate(), @pr numeric(32,11);

    DECLARE @ROWS int;

    SET @ROWS = (SELECT COUNT(*)-1 FROM ##Sales)

    SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/@ROWS

    FROM ##Sales;

    PRINT DATEDIFF(MS,@st,getdate());

    GO 5

    PRINT 'Method from the article:'+CHAR(10)+REPLICATE('-',50);

    GO

    DECLARE @st datetime = getdate(), @pr numeric(32,11);

    SELECT @pr = 1.*(RANK() OVER (ORDER BY SaleAmt)-1)/(SELECT COUNT(*)-1 FROM ##Sales)

    FROM ##Sales;

    PRINT DATEDIFF(MS,@st,getdate());

    GO 5

    Results (in MS):

    Pre-aggregation using a variable:

    --------------------------------------------------

    Beginning execution loop

    560

    566

    564

    573

    550

    Batch execution completed 5 times.

    Method from the article:

    --------------------------------------------------

    Beginning execution loop

    590

    623

    587

    610

    593

    Batch execution completed 5 times.

    Relatively the same.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply