Calculating Median and Percentiles

  • Hi everyone, I've been looking at calculating the median and percentile values for some of our data. For the purpose of this particular set of calculations the data itself in its simplest form is a list of specialties with a length of stay (LOS) assigned to it. I have the following that I have come up with and was wondering if there is a more efficient way of doing it (One of the developers brought up the possibly use of NTILE Function, but I wasn't convinced).

    Test Data:

    /*

    For testing purposes, rather than using real specialty names I've generated them as 2 alpha characters.

    The update on the test table converts all 0 values to NULL as those patients with a LOS value of 0

    are deemed as "Daycase" rather than "Inpatient" and are treated slightly differently.

    */

    IF OBJECT_ID(N'tempdb..#data', N'U') IS NOT NULL

    DROP TABLE #Data;

    GO

    CREATE TABLE #Data

    (

    SpecialtyVARCHAR(2),

    LOSINT

    );

    --Generate Number Table

    ;WITH N1(N) AS (

    SELECT N

    FROM (SELECT 0 AS N0, 1 AS N1, 2 AS N2, 3 AS N3, 4 AS N4, 5 AS N5, 6 AS N6, 7 AS N7, 8 AS N8, 9 AS N9) AS N0

    UNPIVOT (N FOR Nx IN (N0, N1, N2, N3, N4, N5, N6, N7, N8, N9)) AS unpvt

    ),--10

    N2(N) AS (SELECT 1 FROM N1 AS A CROSS JOIN N1 AS B), --100

    N3(N) AS (SELECT 1 FROM N2 AS A CROSS JOIN N2 AS B), --10000

    CTETally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N ASC) FROM N3)

    --Generate Some Test Data

    INSERT INTO #Data (Specialty, LOS)

    SELECT

    CHAR(ABS(CHECKSUM(NEWID()))%26+65)

    + CHAR(ABS(CHECKSUM(NEWID()))%26+65),

    ABS(CHECKSUM(NEWID()))%30

    FROM

    CTETally T

    WHERE N <= 10000;

    --Replace 0 LOS with NULL

    UPDATE #Data

    SET LOS = NULLIF(LOS,0);

    --Check Test Data

    SELECT DA.Specialty, DA.LOS FROM #Data AS DA

    ORDER BY DA.Specialty ASC, DA.LOS ASC;

    The Query:

    --Derive additional Test data

    ;WITH Data (Specialty, LOS, Rank, Cnt)

    AS

    (

    SELECTDA.Specialty,

    DA.LOS,

    ROW_NUMBER() OVER (PARTITION BY DA.Specialty ORDER BY DA.LOS),

    COUNT(*) OVER (PARTITION BY DA.Specialty)

    FROM#Data AS DA

    WHEREDA.LOS IS NOT NULL

    )

    --Calculate Percentiles/Median - Include MIN/MAX figures just for sense check

    SELECTDA1.Specialty,

    MIN(DA1.LOS) AS MinLos,

    MAX(DA1.LOS) AS MaxLos,

    AVG(CASE

    WHEN DA1.Rank IN ((DA1.Cnt+1)/10, (DA1.Cnt/10)+1) THEN 1.0 * DA1.LOS

    ELSE NULL

    END

    ) AS Percentile10,

    AVG(CASE

    WHEN DA1.Rank IN ( (DA1.Cnt+1)/4, (DA1.Cnt/4)+1) THEN 1.0 * DA1.LOS

    ELSE NULL

    END

    ) AS Percentile25,

    AVG(CASE

    WHEN DA1.Rank IN ( (DA1.Cnt+1)/2, (DA1.Cnt/2)+1) THEN 1.0 * DA1.LOS

    ELSE NULL

    END

    ) AS Percentile50,

    AVG(CASE

    WHEN DA1.Rank IN ( ((DA1.Cnt+1)/4)*3, ((DA1.Cnt/4)*3)+1) THEN 1.0 * DA1.LOS

    ELSE NULL

    END

    ) AS Percentile75

    FROMData AS DA1

    GROUP BY DA1.Specialty;

    Any help is much appreciated!

  • That's a pretty interesting way to calculate "Median". Haven't tried it but it looks fast. I know a different way that we can test together tommorow night. It's 3AM here and I better hit the hay so I can get up in 3 hours or so.

    As a side bar... really nice job on the test table.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I'd certainly be interested to see any other methods people have to compare.

    Also, I'm sure I've read on this forum at some point that it may be slightly more efficient to use 0.0 + Integer rather than 1.0 * Integer to prevent a rounded integer figure as the answer. Is that actually the case?

  • I haven't actually tested it but I believe that +0.0 is more effecient. At the machine language level, a simple add takes about 1-2 clock cycles... whereas a multiply takes about 4-6 depending on the processor.

    I got tied up at work yesterday and didn't have much time for the forum. Perhaps the best thing to do is to refer you to a rather long winded post I did for someone else concerning MEDIAN calculations. I just need to find it and I'll try to do that after work tonight.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I was just researching Median on SQL Server last week. Here is a good series of articles from some very qualified folks where a back-and-forth discussion broke out on the topic about which method was fastest:

    1. December 15, 2006 - Aaron Bertrand blog showing Itzik Ben-Gan method for finding median: http://sqlblog.com/blogs/aaron_bertrand/archive/2006/12/15/428.aspx

    2. December 18, 2006 - Adam Machanic blog showing a Joe Celko method faster than Itzik Ben-Gan's method: http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/437.aspx

    3. December 20, 2006 - Peter DeBetta blog entry showing a revised Itzik Ben-Gan method is faster than the Joe Celko method: http://sqlblog.com/blogs/peter_debetta/archive/2006/12/20/Medians_Actual_Query_Cost_and_Statistics.aspx

    And here is a buffet of methods from Celko for cross-reference:

    - April 05, 2009 - T-SQL median adventure with Celko: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thanks for those links.

    I've had a bookmark for some time to Ben-Gan's original write-up, which you might want to add to your list of median computation web pages:

    http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005

    That article explains how/why the method works. I found the logic confusing at first until I absorbed the fact that integer division in T-SQL truncates remainders. Here are my notes I created for myself after running Ben-Gan's method:

    /*

    HOW IT WORKS (Returns median of [Value]), relying on INTeger division truncating remainders:

    The CTE's ROW_NUMBER function orders the data in sequence of [Value], partitioned here by EmployeeID, thought that's not required.

    THE CTE also includes a Count() function for the number of rows per EmployeeID.

    Then the main query selects AVG([Value]) as the median, from a list of 2 rows specified by WHERE:

    -RowNum = (Cnt + 1)/2

    -RowNum = (Cnt + 2)/2

    Consider the 2 cases possible - an even or odd number for Cnt:

    EVEN (e.g., Cnt=16)

    (Cnt + 1)/2 = 8! b/c Cnt is by default an INT and SQL's INT division truncates the remainder of 17/2

    (Cnt + 2)/2 = 9.

    So the average of the 8th and 9th rows is returned as the median.

    ODD (e.g., Cnt=15)

    (Cnt + 1)/2 = 8

    (Cnt + 2)/2 = 8 (again, INTeger math truncation).

    The average of the 8th and 8th rows is returned as the median.

    */

    Rich

  • Very cool. Thanks for the additional link and for your notes.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply