Recent PostsRecent Posts Popular TopicsPopular Topics
 Home Search Members Calendar Who's On

 Calculating Median and Percentiles Rate Topic Display Mode Topic Options
Author
 Message
 Posted Tuesday, May 18, 2010 9:15 AM
 Old Hand Group: General Forum Members Last Login: 2 days ago @ 10:51 AM Points: 301, Visits: 1,923
 Hi everyone, I've been looking at calculating the median and percentile values for some of our data. For the purpose of this particular set of calculations the data itself in its simplest form is a list of specialties with a length of stay (LOS) assigned to it. I have the following that I have come up with and was wondering if there is a more efficient way of doing it (One of the developers brought up the possibly use of NTILE Function, but I wasn't convinced).Test Data:`/*For testing purposes, rather than using real specialty names I've generated them as 2 alpha characters.The update on the test table converts all 0 values to NULL as those patients with a LOS value of 0are deemed as "Daycase" rather than "Inpatient" and are treated slightly differently.*/IF OBJECT_ID(N'tempdb..#data', N'U') IS NOT NULL DROP TABLE #Data;GOCREATE TABLE #Data ( Specialty VARCHAR(2), LOS INT );--Generate Number Table;WITH N1(N) AS ( SELECT N FROM (SELECT 0 AS N0, 1 AS N1, 2 AS N2, 3 AS N3, 4 AS N4, 5 AS N5, 6 AS N6, 7 AS N7, 8 AS N8, 9 AS N9) AS N0 UNPIVOT (N FOR Nx IN (N0, N1, N2, N3, N4, N5, N6, N7, N8, N9)) AS unpvt ),--10N2(N) AS (SELECT 1 FROM N1 AS A CROSS JOIN N1 AS B), --100N3(N) AS (SELECT 1 FROM N2 AS A CROSS JOIN N2 AS B), --10000CTETally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N ASC) FROM N3)--Generate Some Test DataINSERT INTO #Data (Specialty, LOS)SELECT CHAR(ABS(CHECKSUM(NEWID()))%26+65) + CHAR(ABS(CHECKSUM(NEWID()))%26+65), ABS(CHECKSUM(NEWID()))%30FROM CTETally TWHERE N <= 10000;--Replace 0 LOS with NULLUPDATE #DataSET LOS = NULLIF(LOS,0);--Check Test DataSELECT DA.Specialty, DA.LOS FROM #Data AS DAORDER BY DA.Specialty ASC, DA.LOS ASC;`The Query:`--Derive additional Test data;WITH Data (Specialty, LOS, Rank, Cnt)AS( SELECT DA.Specialty, DA.LOS, ROW_NUMBER() OVER (PARTITION BY DA.Specialty ORDER BY DA.LOS), COUNT(*) OVER (PARTITION BY DA.Specialty) FROM #Data AS DA WHERE DA.LOS IS NOT NULL)--Calculate Percentiles/Median - Include MIN/MAX figures just for sense checkSELECT DA1.Specialty, MIN(DA1.LOS) AS MinLos, MAX(DA1.LOS) AS MaxLos, AVG(CASE WHEN DA1.Rank IN ((DA1.Cnt+1)/10, (DA1.Cnt/10)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile10, AVG(CASE WHEN DA1.Rank IN ( (DA1.Cnt+1)/4, (DA1.Cnt/4)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile25, AVG(CASE WHEN DA1.Rank IN ( (DA1.Cnt+1)/2, (DA1.Cnt/2)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile50, AVG(CASE WHEN DA1.Rank IN ( ((DA1.Cnt+1)/4)*3, ((DA1.Cnt/4)*3)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile75FROM Data AS DA1GROUP BY DA1.Specialty;`Any help is much appreciated!
Post #923660
 Posted Wednesday, May 19, 2010 1:00 AM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 11:04 PM Points: 34,706, Visits: 28,897
 That's a pretty interesting way to calculate "Median". Haven't tried it but it looks fast. I know a different way that we can test together tommorow night. It's 3AM here and I better hit the hay so I can get up in 3 hours or so.As a side bar... really nice job on the test table. --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #924083
 Posted Thursday, May 20, 2010 2:45 AM
 Old Hand Group: General Forum Members Last Login: 2 days ago @ 10:51 AM Points: 301, Visits: 1,923
 I'd certainly be interested to see any other methods people have to compare.Also, I'm sure I've read on this forum at some point that it may be slightly more efficient to use 0.0 + Integer rather than 1.0 * Integer to prevent a rounded integer figure as the answer. Is that actually the case?
Post #924907
 Posted Thursday, May 20, 2010 7:38 AM
 SSC-Dedicated Group: General Forum Members Last Login: Yesterday @ 11:04 PM Points: 34,706, Visits: 28,897
 I haven't actually tested it but I believe that +0.0 is more effecient. At the machine language level, a simple add takes about 1-2 clock cycles... whereas a multiply takes about 4-6 depending on the processor. I got tied up at work yesterday and didn't have much time for the forum. Perhaps the best thing to do is to refer you to a rather long winded post I did for someone else concerning MEDIAN calculations. I just need to find it and I'll try to do that after work tonight. --Jeff Moden"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column." "Change is inevitable. Change for the better is not." -- 04 August 2013(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013Helpful Links:How to post code problemsHow to post performance problems
Post #925127
 Posted Monday, May 09, 2011 9:45 PM
 SSCertifiable Group: General Forum Members Last Login: Monday, December 16, 2013 5:54 AM Points: 7,062, Visits: 12,468
 I was just researching Median on SQL Server last week. Here is a good series of articles from some very qualified folks where a back-and-forth discussion broke out on the topic about which method was fastest:1. December 15, 2006 - Aaron Bertrand blog showing Itzik Ben-Gan method for finding median: http://sqlblog.com/blogs/aaron_bertrand/archive/2006/12/15/428.aspx2. December 18, 2006 - Adam Machanic blog showing a Joe Celko method faster than Itzik Ben-Gan's method: http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/437.aspx3. December 20, 2006 - Peter DeBetta blog entry showing a revised Itzik Ben-Gan method is faster than the Joe Celko method: http://sqlblog.com/blogs/peter_debetta/archive/2006/12/20/Medians_Actual_Query_Cost_and_Statistics.aspxAnd here is a buffet of methods from Celko for cross-reference:- April 05, 2009 - T-SQL median adventure with Celko: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/ __________________________________________________________________________________________________There are no special teachers of virtue, because virtue is taught by the whole community. --PlatoBelieve you can and you're halfway there. --Theodore RooseveltEverything Should Be Made as Simple as Possible, But Not Simpler --Albert EinsteinThe significant problems we face cannot be solved at the same level of thinking we were at when we created them. --Albert Einstein1 apple is not exactly 1/8 of 8 apples. Because there are no absolutely identical apples. --Giordy
Post #1105837
 Posted Wednesday, May 11, 2011 6:59 AM
 Say Hey Kid Group: General Forum Members Last Login: Thursday, December 19, 2013 10:52 AM Points: 685, Visits: 2,985
 Thanks for those links.I've had a bookmark for some time to Ben-Gan's original write-up, which you might want to add to your list of median computation web pages:http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005That article explains how/why the method works. I found the logic confusing at first until I absorbed the fact that integer division in T-SQL truncates remainders. Here are my notes I created for myself after running Ben-Gan's method:`/*HOW IT WORKS (Returns median of [Value]), relying on INTeger division truncating remainders: The CTE's ROW_NUMBER function orders the data in sequence of [Value], partitioned here by EmployeeID, thought that's not required. THE CTE also includes a Count() function for the number of rows per EmployeeID. Then the main query selects AVG([Value]) as the median, from a list of 2 rows specified by WHERE: - RowNum = (Cnt + 1)/2 - RowNum = (Cnt + 2)/2 Consider the 2 cases possible - an even or odd number for Cnt: EVEN (e.g., Cnt=16) (Cnt + 1)/2 = 8! b/c Cnt is by default an INT and SQL's INT division truncates the remainder of 17/2 (Cnt + 2)/2 = 9. So the average of the 8th and 9th rows is returned as the median. ODD (e.g., Cnt=15) (Cnt + 1)/2 = 8 (Cnt + 2)/2 = 8 (again, INTeger math truncation). The average of the 8th and 8th rows is returned as the median.*/`Rich
Post #1106866
 Posted Wednesday, May 11, 2011 7:35 AM
 SSCertifiable Group: General Forum Members Last Login: Monday, December 16, 2013 5:54 AM Points: 7,062, Visits: 12,468
 Very cool. Thanks for the additional link and for your notes. __________________________________________________________________________________________________There are no special teachers of virtue, because virtue is taught by the whole community. --PlatoBelieve you can and you're halfway there. --Theodore RooseveltEverything Should Be Made as Simple as Possible, But Not Simpler --Albert EinsteinThe significant problems we face cannot be solved at the same level of thinking we were at when we created them. --Albert Einstein1 apple is not exactly 1/8 of 8 apples. Because there are no absolutely identical apples. --Giordy
Post #1106918

 Permissions