Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase

Calculating Median and Percentiles Expand / Collapse
Author
Message
Posted Tuesday, May 18, 2010 9:15 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, July 14, 2014 6:36 AM
Points: 349, Visits: 2,526
Hi everyone, I've been looking at calculating the median and percentile values for some of our data. For the purpose of this particular set of calculations the data itself in its simplest form is a list of specialties with a length of stay (LOS) assigned to it. I have the following that I have come up with and was wondering if there is a more efficient way of doing it (One of the developers brought up the possibly use of NTILE Function, but I wasn't convinced).

Test Data:
/*
For testing purposes, rather than using real specialty names I've generated them as 2 alpha characters.
The update on the test table converts all 0 values to NULL as those patients with a LOS value of 0
are deemed as "Daycase" rather than "Inpatient" and are treated slightly differently.
*/

IF OBJECT_ID(N'tempdb..#data', N'U') IS NOT NULL
DROP TABLE #Data;
GO

CREATE TABLE #Data
(
Specialty VARCHAR(2),
LOS INT
);

--Generate Number Table
;WITH N1(N) AS (
SELECT N
FROM (SELECT 0 AS N0, 1 AS N1, 2 AS N2, 3 AS N3, 4 AS N4, 5 AS N5, 6 AS N6, 7 AS N7, 8 AS N8, 9 AS N9) AS N0
UNPIVOT (N FOR Nx IN (N0, N1, N2, N3, N4, N5, N6, N7, N8, N9)) AS unpvt
),--10
N2(N) AS (SELECT 1 FROM N1 AS A CROSS JOIN N1 AS B), --100
N3(N) AS (SELECT 1 FROM N2 AS A CROSS JOIN N2 AS B), --10000
CTETally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N ASC) FROM N3)

--Generate Some Test Data
INSERT INTO #Data (Specialty, LOS)
SELECT
CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
ABS(CHECKSUM(NEWID()))%30
FROM
CTETally T
WHERE N <= 10000;

--Replace 0 LOS with NULL
UPDATE #Data
SET LOS = NULLIF(LOS,0);

--Check Test Data
SELECT DA.Specialty, DA.LOS FROM #Data AS DA
ORDER BY DA.Specialty ASC, DA.LOS ASC;

The Query:
--Derive additional Test data
;WITH Data (Specialty, LOS, Rank, Cnt)
AS
(
SELECT DA.Specialty,
DA.LOS,
ROW_NUMBER() OVER (PARTITION BY DA.Specialty ORDER BY DA.LOS),
COUNT(*) OVER (PARTITION BY DA.Specialty)
FROM #Data AS DA
WHERE DA.LOS IS NOT NULL
)

--Calculate Percentiles/Median - Include MIN/MAX figures just for sense check
SELECT DA1.Specialty,
MIN(DA1.LOS) AS MinLos,
MAX(DA1.LOS) AS MaxLos,
AVG(CASE
WHEN DA1.Rank IN ((DA1.Cnt+1)/10, (DA1.Cnt/10)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile10,
AVG(CASE
WHEN DA1.Rank IN ( (DA1.Cnt+1)/4, (DA1.Cnt/4)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile25,
AVG(CASE
WHEN DA1.Rank IN ( (DA1.Cnt+1)/2, (DA1.Cnt/2)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile50,
AVG(CASE
WHEN DA1.Rank IN ( ((DA1.Cnt+1)/4)*3, ((DA1.Cnt/4)*3)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile75
FROM Data AS DA1
GROUP BY DA1.Specialty;


Any help is much appreciated!
Post #923660
Posted Wednesday, May 19, 2010 1:00 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 11:31 AM
Points: 36,728, Visits: 31,180
That's a pretty interesting way to calculate "Median". Haven't tried it but it looks fast. I know a different way that we can test together tommorow night. It's 3AM here and I better hit the hay so I can get up in 3 hours or so.

As a side bar... really nice job on the test table.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #924083
Posted Thursday, May 20, 2010 2:45 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Monday, July 14, 2014 6:36 AM
Points: 349, Visits: 2,526
I'd certainly be interested to see any other methods people have to compare.

Also, I'm sure I've read on this forum at some point that it may be slightly more efficient to use 0.0 + Integer rather than 1.0 * Integer to prevent a rounded integer figure as the answer. Is that actually the case?
Post #924907
Posted Thursday, May 20, 2010 7:38 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 11:31 AM
Points: 36,728, Visits: 31,180
I haven't actually tested it but I believe that +0.0 is more effecient. At the machine language level, a simple add takes about 1-2 clock cycles... whereas a multiply takes about 4-6 depending on the processor.

I got tied up at work yesterday and didn't have much time for the forum. Perhaps the best thing to do is to refer you to a rather long winded post I did for someone else concerning MEDIAN calculations. I just need to find it and I'll try to do that after work tonight.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #925127
Posted Monday, May 9, 2011 9:45 PM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 9:57 AM
Points: 7,080, Visits: 12,571
I was just researching Median on SQL Server last week. Here is a good series of articles from some very qualified folks where a back-and-forth discussion broke out on the topic about which method was fastest:

1. December 15, 2006 - Aaron Bertrand blog showing Itzik Ben-Gan method for finding median: http://sqlblog.com/blogs/aaron_bertrand/archive/2006/12/15/428.aspx
2. December 18, 2006 - Adam Machanic blog showing a Joe Celko method faster than Itzik Ben-Gan's method: http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/437.aspx
3. December 20, 2006 - Peter DeBetta blog entry showing a revised Itzik Ben-Gan method is faster than the Joe Celko method: http://sqlblog.com/blogs/peter_debetta/archive/2006/12/20/Medians_Actual_Query_Cost_and_Statistics.aspx

And here is a buffet of methods from Celko for cross-reference:

- April 05, 2009 - T-SQL median adventure with Celko: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/


__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Post #1105837
Posted Wednesday, May 11, 2011 6:59 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Thursday, July 17, 2014 2:01 PM
Points: 687, Visits: 3,002
Thanks for those links.

I've had a bookmark for some time to Ben-Gan's original write-up, which you might want to add to your list of median computation web pages:
http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005

That article explains how/why the method works. I found the logic confusing at first until I absorbed the fact that integer division in T-SQL truncates remainders. Here are my notes I created for myself after running Ben-Gan's method:
/*
HOW IT WORKS (Returns median of [Value]), relying on INTeger division truncating remainders:
The CTE's ROW_NUMBER function orders the data in sequence of [Value], partitioned here by EmployeeID, thought that's not required.
THE CTE also includes a Count() function for the number of rows per EmployeeID.

Then the main query selects AVG([Value]) as the median, from a list of 2 rows specified by WHERE:
- RowNum = (Cnt + 1)/2
- RowNum = (Cnt + 2)/2

Consider the 2 cases possible - an even or odd number for Cnt:
EVEN (e.g., Cnt=16)
(Cnt + 1)/2 = 8! b/c Cnt is by default an INT and SQL's INT division truncates the remainder of 17/2
(Cnt + 2)/2 = 9.
So the average of the 8th and 9th rows is returned as the median.
ODD (e.g., Cnt=15)
(Cnt + 1)/2 = 8
(Cnt + 2)/2 = 8 (again, INTeger math truncation).
The average of the 8th and 8th rows is returned as the median.
*/

Rich
Post #1106866
Posted Wednesday, May 11, 2011 7:35 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Today @ 9:57 AM
Points: 7,080, Visits: 12,571
Very cool. Thanks for the additional link and for your notes.

__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Post #1106918
« Prev Topic | Next Topic »

Add to briefcase

Permissions Expand / Collapse