PERCENTILE_DISC for median calculation

  • Question: is PERCENTILE_DISC for median calculation works for both odd and even number of records?

    In SQL Server 2008 to calculate median we used this code based on modulo (@modulo was calculated before this)

    case @modulo

    when 1 then

    (SELECT MAX([TimeInDepartmentMinutes]) FROM

    (SELECT TOP 50 PERCENT [TimeInDepartmentMinutes] FROM dbo.vw_ED_Time_Calc

    where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital

    ORDER BY [TimeInDepartmentMinutes]) AS Median)

    else -- even

    (SELECT

    ( ( SELECT MAX([TimeInDepartmentMinutes]) FROM

    ( SELECT TOP 50 PERCENT [TimeInDepartmentMinutes]

    FROM dbo.vw_ED_Time_Calc where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital

    ORDER BY 1--==>

    ) AS m1 )

    + (

    SELECT MIN([TimeInDepartmentMinutes]) FROM

    ( SELECT TOP 50 PERCENT [TimeInDepartmentMinutes] FROM dbo.vw_ED_Time_Calc

    where ArrivalDateTime >= @StartDate and ArrivalDateTime <= @EndDate and Psyc ='yes' and HospitalName = @Hospital

    ORDER BY 1 DESC --==>

    ) AS m2)

    ) / 2 AS Me)

    end AS Median_LOS_All_Psyc_Patients

  • Run a test and find out. 😉  When you're done testing it, tell us what you found. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Test result (table has 4 records: 1,2,3,4):

    Select distinct PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY [recordvalue] OVER (PARTITION BY null) FROM  __testForMedian
    Result: 2 
    Correct answer: 2.5 (in Excel =MEDIAN(1,2,3,4))
    Hence
    1. PERCENTILE_DISC
    (0.5) is the same as SELECT MAX([recordvalue]) FROM (SELECT TOP 50 PERCENT recordvalue  FROM ... etc
    2.
    PERCENTILE_DISC(0.5) can be used for ODD number of records. For EVEN nuber of records we need to apply two step strategy listed in my original code after ELSE ...
    Thanks

  • So, in reference to your original question of...

    Question: is PERCENTILE_DISC for median calculation works for both odd and even number of records?


    ...what is your conclusion?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • PERCENTILE_DISC (0.5) does not provide the median for EVEN number of records. Only for ODD number of records. Actually this is the same as MAX (Select 50 percent ... - see my original code above
    Thanks Jeff. You are always very helpful.

  • valeryk2000 - Sunday, December 16, 2018 12:50 PM

    PERCENTILE_DISC (0.5) does not provide the median for EVEN number of records. Only for ODD number of records. Actually this is the same as MAX (Select 50 percent ... - see my original code above
    Thanks Jeff. You are always very helpful.

    Interesting.  Thanks for taking the time to test.  I'll have to look into that a bit more because BOL makes no such stipulation (or at least no one that I've seen).  Might just be that BOL needs a correction.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ah... I think I found it.  From BOL...

    Return Types

    The return type is determined by the order_by_expression type.

    Kind of makes sense.  So, with Integer values of 1,2,3,4, it returns the number closest to the median.  I don't have 2012 or above to play with just now but if you run the same test as you did but with a DECIMAL(9,5) for a datatype, what do you get for a result there?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Same result:
    2.00000

  • valeryk2000 - Sunday, December 16, 2018 2:57 PM

    Same result:
    2.00000

    Yowch.  That's just not right.  Thank you very much for testing.  I should be able to log into my 2016 work box in a couple of minutes from now.  I'll do the same test.

    Actually, for Percentile_Disc, it's absolutely correct.  See my explanation and test code below.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Heh.... lordy.  I just reread BOL for PERCENTILE_DISC ("disc" means "discrete").. It INTENTIONALLY works the way your tests came out.  Here's the description found at the official MS documentation at the following URL.  So, no... BOL is NOT incorrect.  It's spot on.

    https://docs.microsoft.com/en-us/sql/t-sql/functions/percentile-disc-transact-sql?view=sql-server-2017

    Computes a specific percentile for sorted values in an entire rowset or within distinct partitions of a rowset in SQL Server. For a given percentile value P, PERCENTILE_DISC sorts the values of the expression in the ORDER BY clause and returns the value with the smallest CUME_DIST value (with respect to the same sort specification) that is greater than or equal to P. For example, PERCENTILE_DISC (0.5) will compute the 50th percentile (that is, the median) of an expression. PERCENTILE_DISC calculates the percentile based on a discrete distribution of the column values; the result is equal to a specific value in the column.

    It turns out that PERCENTILE_CONT ("cont" means "continuous") will calculate the values that you'd normally expect from a "median".  The two functions are designed to return different things.  PERCENTILE_DISC returns an existing value from the column that is the maximum value less than or equal to the median.  PERCENTILE_CONT returns the calculated median (interpolated value of the two "middle" values for even numbers of rows.

    With all of that in mind, please run the following test code to see what I'm talking about.

    --===== Create the test table with INT and DECIMAL datatypes
       DROP TABLE IF EXISTS #MyHead;
     CREATE TABLE #MyHead
            (
             SomeInt        INT
            ,SomeDecimal    DECIMAL(9,5)
            )
    ;
    --===== Populate the both columns with identical values (1 thru 4) for an EVEN number of rows.
     INSERT INTO #MyHead
            (SomeInt,SomeDecimal)
     SELECT SomeInt,SomeDecimal
       FROM (VALUES (1,1),(2,2),(3,3),(4,4))v(SomeInt,SomeDecimal)
    ;
    --===== Run the test for both PERCENTILE_* functions against both columns
     SELECT  DISTINCT
             PrecentileContINT = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
            ,PercentileDiscINT = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
            ,PrecentileContDEC = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
            ,PercentileDiscDEC = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
       FROM #MyHead
    ;
    --===== Add a 5th row to the test table to make the number of rows and ODD number.
     INSERT INTO #MyHead
            (SomeInt,SomeDecimal)
     SELECT SomeInt,SomeDecimal
       FROM (VALUES (10,10))v(SomeInt,SomeDecimal)
    ;
    --===== Run the exact same test again.
     SELECT  DISTINCT
             PrecentileContINT = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
            ,PercentileDiscINT = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeInt) OVER ()
            ,PrecentileContDEC = PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
            ,PercentileDiscDEC = PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY SomeDecimal) OVER ()
       FROM #MyHead
    ;

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • BTW... the way to get the code to display without all the extra blank lines in it is to copy'n'paste the code into NOTEPAD first.  Then, copy'n'paste from NOTEPAD to in between the SQL Code tags that appear when you hit the "SQL Code" button at the bottom of the edit window.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods.  With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
    https://sqlperformance.com/2012/08/t-sql-queries/median

    And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes,  it is PERCENTILE_CONT(0.5) that returns median. I'll play with it tomorrow.
    Thanks

  • Jeff Moden - Sunday, December 16, 2018 5:37 PM

    One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods.  With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
    https://sqlperformance.com/2012/08/t-sql-queries/median

    And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉

    One question what about NULL values

  • valeryk2000 - Sunday, December 16, 2018 7:00 PM

    Jeff Moden - Sunday, December 16, 2018 5:37 PM

    One more BTW.... I was going to test performance of the PERCENTILE_CONT function against some other methods.  With the disclaimer that I've not verified the testing (and it DOES need to be verified because of possible "Devils in the Data"), it would appear that PERCENTILE_CONT is the worst performing of many different methods for calculating the MEDIAN according to the following article...
    https://sqlperformance.com/2012/08/t-sql-queries/median

    And, on that note, I'll say that, yet again... "Change is inevitable... change for the better is not". 😉

    One question what about NULL values

    PERCENTILE_CONT and PERCENTILE_DISC both ignore null values.

    But, to be honest, I wouldn't use either.  I was testing some the code in the performance article and got to the last contender and had an extreme bout with deja vue... I'd seen the code before.  I "dialed back" and found the code that Peter Larsson wrote that uses ORDER BY/OFFSET/FETCH that he was given credit for by the author.  Peter wrote it way back in 2009 and it absolutely blows the doors off the PERCENTILE_* functions.  PERCENTILE_CONT took 1 minutes an 23 seconds to find the median on 10 million rows.  While that sounds pretty good, Peter's method finds it in sub-second times.  No... that's not a misprint.  Peter's method takes less than a second!!!  His formulas are absolutely brilliant in their simplicity.

    I tried to find Peter's original article on the subject... unfortunately, the link I had no longer works and the link I found doesn't either.  However, it IS the last contender in the article that Aaron wrote and you can get the code from there.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 1 through 15 (of 42 total)

You must be logged in to reply to this topic. Login to reply