

Old Hand
Group: General Forum Members
Last Login: Monday, October 13, 2014 4:15 AM
Points: 356,
Visits: 2,616


Hi everyone, I've been looking at calculating the median and percentile values for some of our data. For the purpose of this particular set of calculations the data itself in its simplest form is a list of specialties with a length of stay (LOS) assigned to it. I have the following that I have come up with and was wondering if there is a more efficient way of doing it (One of the developers brought up the possibly use of NTILE Function, but I wasn't convinced).
Test Data:
/* For testing purposes, rather than using real specialty names I've generated them as 2 alpha characters. The update on the test table converts all 0 values to NULL as those patients with a LOS value of 0 are deemed as "Daycase" rather than "Inpatient" and are treated slightly differently. */
IF OBJECT_ID(N'tempdb..#data', N'U') IS NOT NULL DROP TABLE #Data; GO
CREATE TABLE #Data ( Specialty VARCHAR(2), LOS INT );
Generate Number Table ;WITH N1(N) AS ( SELECT N FROM (SELECT 0 AS N0, 1 AS N1, 2 AS N2, 3 AS N3, 4 AS N4, 5 AS N5, 6 AS N6, 7 AS N7, 8 AS N8, 9 AS N9) AS N0 UNPIVOT (N FOR Nx IN (N0, N1, N2, N3, N4, N5, N6, N7, N8, N9)) AS unpvt ),10 N2(N) AS (SELECT 1 FROM N1 AS A CROSS JOIN N1 AS B), 100 N3(N) AS (SELECT 1 FROM N2 AS A CROSS JOIN N2 AS B), 10000 CTETally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N ASC) FROM N3)
Generate Some Test Data INSERT INTO #Data (Specialty, LOS) SELECT CHAR(ABS(CHECKSUM(NEWID()))%26+65) + CHAR(ABS(CHECKSUM(NEWID()))%26+65), ABS(CHECKSUM(NEWID()))%30 FROM CTETally T WHERE N <= 10000;
Replace 0 LOS with NULL UPDATE #Data SET LOS = NULLIF(LOS,0);
Check Test Data SELECT DA.Specialty, DA.LOS FROM #Data AS DA ORDER BY DA.Specialty ASC, DA.LOS ASC;
The Query:
Derive additional Test data ;WITH Data (Specialty, LOS, Rank, Cnt) AS ( SELECT DA.Specialty, DA.LOS, ROW_NUMBER() OVER (PARTITION BY DA.Specialty ORDER BY DA.LOS), COUNT(*) OVER (PARTITION BY DA.Specialty) FROM #Data AS DA WHERE DA.LOS IS NOT NULL )
Calculate Percentiles/Median  Include MIN/MAX figures just for sense check SELECT DA1.Specialty, MIN(DA1.LOS) AS MinLos, MAX(DA1.LOS) AS MaxLos, AVG(CASE WHEN DA1.Rank IN ((DA1.Cnt+1)/10, (DA1.Cnt/10)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile10, AVG(CASE WHEN DA1.Rank IN ( (DA1.Cnt+1)/4, (DA1.Cnt/4)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile25, AVG(CASE WHEN DA1.Rank IN ( (DA1.Cnt+1)/2, (DA1.Cnt/2)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile50, AVG(CASE WHEN DA1.Rank IN ( ((DA1.Cnt+1)/4)*3, ((DA1.Cnt/4)*3)+1) THEN 1.0 * DA1.LOS ELSE NULL END ) AS Percentile75 FROM Data AS DA1 GROUP BY DA1.Specialty;
Any help is much appreciated!




SSCDedicated
Group: General Forum Members
Last Login: Today @ 11:30 PM
Points: 35,364,
Visits: 31,900


That's a pretty interesting way to calculate "Median". Haven't tried it but it looks fast. I know a different way that we can test together tommorow night. It's 3AM here and I better hit the hay so I can get up in 3 hours or so.
As a side bar... really nice job on the test table.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
(play on words) "Just because you CAN do something in TSQL, doesn't mean you SHOULDN'T." 22 Aug 2013
Helpful Links: How to post code problems How to post performance problems




Old Hand
Group: General Forum Members
Last Login: Monday, October 13, 2014 4:15 AM
Points: 356,
Visits: 2,616


I'd certainly be interested to see any other methods people have to compare.
Also, I'm sure I've read on this forum at some point that it may be slightly more efficient to use 0.0 + Integer rather than 1.0 * Integer to prevent a rounded integer figure as the answer. Is that actually the case?




SSCDedicated
Group: General Forum Members
Last Login: Today @ 11:30 PM
Points: 35,364,
Visits: 31,900


I haven't actually tested it but I believe that +0.0 is more effecient. At the machine language level, a simple add takes about 12 clock cycles... whereas a multiply takes about 46 depending on the processor.
I got tied up at work yesterday and didn't have much time for the forum. Perhaps the best thing to do is to refer you to a rather long winded post I did for someone else concerning MEDIAN calculations. I just need to find it and I'll try to do that after work tonight.
Jeff Moden "RBAR is pronounced "reebar" and is a "Modenism" for "RowByAgonizingRow".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
(play on words) "Just because you CAN do something in TSQL, doesn't mean you SHOULDN'T." 22 Aug 2013
Helpful Links: How to post code problems How to post performance problems




SSCertifiable
Group: General Forum Members
Last Login: Yesterday @ 7:06 PM
Points: 7,125,
Visits: 12,720





Say Hey Kid
Group: General Forum Members
Last Login: Today @ 10:08 AM
Points: 702,
Visits: 3,013


Thanks for those links.
I've had a bookmark for some time to BenGan's original writeup, which you might want to add to your list of median computation web pages: http://www.sqlmag.com/article/tsql3/calculatingthemediangetssimplerinsqlserver2005
That article explains how/why the method works. I found the logic confusing at first until I absorbed the fact that integer division in TSQL truncates remainders. Here are my notes I created for myself after running BenGan's method:
/* HOW IT WORKS (Returns median of [Value]), relying on INTeger division truncating remainders: The CTE's ROW_NUMBER function orders the data in sequence of [Value], partitioned here by EmployeeID, thought that's not required. THE CTE also includes a Count() function for the number of rows per EmployeeID.
Then the main query selects AVG([Value]) as the median, from a list of 2 rows specified by WHERE:  RowNum = (Cnt + 1)/2  RowNum = (Cnt + 2)/2
Consider the 2 cases possible  an even or odd number for Cnt: EVEN (e.g., Cnt=16) (Cnt + 1)/2 = 8! b/c Cnt is by default an INT and SQL's INT division truncates the remainder of 17/2 (Cnt + 2)/2 = 9. So the average of the 8th and 9th rows is returned as the median. ODD (e.g., Cnt=15) (Cnt + 1)/2 = 8 (Cnt + 2)/2 = 8 (again, INTeger math truncation). The average of the 8th and 8th rows is returned as the median. */ Rich




SSCertifiable
Group: General Forum Members
Last Login: Yesterday @ 7:06 PM
Points: 7,125,
Visits: 12,720


Very cool. Thanks for the additional link and for your notes.
__________________________________________________________________________________________________ There are no special teachers of virtue, because virtue is taught by the whole community. Plato



