SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Calculating Median and Percentiles


Calculating Median and Percentiles

Author
Message
Dohsan
Dohsan
SSC Eights!
SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)

Group: General Forum Members
Points: 942 Visits: 3482
Hi everyone, I've been looking at calculating the median and percentile values for some of our data. For the purpose of this particular set of calculations the data itself in its simplest form is a list of specialties with a length of stay (LOS) assigned to it. I have the following that I have come up with and was wondering if there is a more efficient way of doing it (One of the developers brought up the possibly use of NTILE Function, but I wasn't convinced).

Test Data:

/*
For testing purposes, rather than using real specialty names I've generated them as 2 alpha characters.
The update on the test table converts all 0 values to NULL as those patients with a LOS value of 0
are deemed as "Daycase" rather than "Inpatient" and are treated slightly differently.
*/

IF OBJECT_ID(N'tempdb..#data', N'U') IS NOT NULL
DROP TABLE #Data;
GO

CREATE TABLE #Data
(
Specialty VARCHAR(2),
LOS INT
);

--Generate Number Table
;WITH N1(N) AS (
SELECT N
FROM (SELECT 0 AS N0, 1 AS N1, 2 AS N2, 3 AS N3, 4 AS N4, 5 AS N5, 6 AS N6, 7 AS N7, 8 AS N8, 9 AS N9) AS N0
UNPIVOT (N FOR Nx IN (N0, N1, N2, N3, N4, N5, N6, N7, N8, N9)) AS unpvt
),--10
N2(N) AS (SELECT 1 FROM N1 AS A CROSS JOIN N1 AS B), --100
N3(N) AS (SELECT 1 FROM N2 AS A CROSS JOIN N2 AS B), --10000
CTETally(N) AS (SELECT ROW_NUMBER() OVER (ORDER BY N ASC) FROM N3)

--Generate Some Test Data
INSERT INTO #Data (Specialty, LOS)
SELECT
CHAR(ABS(CHECKSUM(NEWID()))%26+65)
+ CHAR(ABS(CHECKSUM(NEWID()))%26+65),
ABS(CHECKSUM(NEWID()))%30
FROM
CTETally T
WHERE N <= 10000;

--Replace 0 LOS with NULL
UPDATE #Data
SET LOS = NULLIF(LOS,0);

--Check Test Data
SELECT DA.Specialty, DA.LOS FROM #Data AS DA
ORDER BY DA.Specialty ASC, DA.LOS ASC;



The Query:

--Derive additional Test data
;WITH Data (Specialty, LOS, Rank, Cnt)
AS
(
SELECT DA.Specialty,
DA.LOS,
ROW_NUMBER() OVER (PARTITION BY DA.Specialty ORDER BY DA.LOS),
COUNT(*) OVER (PARTITION BY DA.Specialty)
FROM #Data AS DA
WHERE DA.LOS IS NOT NULL
)

--Calculate Percentiles/Median - Include MIN/MAX figures just for sense check
SELECT DA1.Specialty,
MIN(DA1.LOS) AS MinLos,
MAX(DA1.LOS) AS MaxLos,
AVG(CASE
WHEN DA1.Rank IN ((DA1.Cnt+1)/10, (DA1.Cnt/10)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile10,
AVG(CASE
WHEN DA1.Rank IN ( (DA1.Cnt+1)/4, (DA1.Cnt/4)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile25,
AVG(CASE
WHEN DA1.Rank IN ( (DA1.Cnt+1)/2, (DA1.Cnt/2)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile50,
AVG(CASE
WHEN DA1.Rank IN ( ((DA1.Cnt+1)/4)*3, ((DA1.Cnt/4)*3)+1) THEN 1.0 * DA1.LOS
ELSE NULL
END
) AS Percentile75
FROM Data AS DA1
GROUP BY DA1.Specialty;




Any help is much appreciated!
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)

Group: General Forum Members
Points: 215947 Visits: 41984
That's a pretty interesting way to calculate "Median". Haven't tried it but it looks fast. I know a different way that we can test together tommorow night. It's 3AM here and I better hit the hay so I can get up in 3 hours or so.

As a side bar... really nice job on the test table.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Dohsan
Dohsan
SSC Eights!
SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)SSC Eights! (942 reputation)

Group: General Forum Members
Points: 942 Visits: 3482
I'd certainly be interested to see any other methods people have to compare.

Also, I'm sure I've read on this forum at some point that it may be slightly more efficient to use 0.0 + Integer rather than 1.0 * Integer to prevent a rounded integer figure as the answer. Is that actually the case?
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)SSC Guru (215K reputation)

Group: General Forum Members
Points: 215947 Visits: 41984
I haven't actually tested it but I believe that +0.0 is more effecient. At the machine language level, a simple add takes about 1-2 clock cycles... whereas a multiply takes about 4-6 depending on the processor.

I got tied up at work yesterday and didn't have much time for the forum. Perhaps the best thing to do is to refer you to a rather long winded post I did for someone else concerning MEDIAN calculations. I just need to find it and I'll try to do that after work tonight.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Orlando Colamatteo
Orlando Colamatteo
SSC-Dedicated
SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)

Group: General Forum Members
Points: 39860 Visits: 14412
I was just researching Median on SQL Server last week. Here is a good series of articles from some very qualified folks where a back-and-forth discussion broke out on the topic about which method was fastest:

1. December 15, 2006 - Aaron Bertrand blog showing Itzik Ben-Gan method for finding median: http://sqlblog.com/blogs/aaron_bertrand/archive/2006/12/15/428.aspx
2. December 18, 2006 - Adam Machanic blog showing a Joe Celko method faster than Itzik Ben-Gan's method: http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/437.aspx
3. December 20, 2006 - Peter DeBetta blog entry showing a revised Itzik Ben-Gan method is faster than the Joe Celko method: http://sqlblog.com/blogs/peter_debetta/archive/2006/12/20/Medians_Actual_Query_Cost_and_Statistics.aspx

And here is a buffet of methods from Celko for cross-reference:

- April 05, 2009 - T-SQL median adventure with Celko: http://www.simple-talk.com/sql/t-sql-programming/median-workbench/

__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Rich Mechaber
Rich Mechaber
SSCrazy
SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)SSCrazy (2.8K reputation)

Group: General Forum Members
Points: 2763 Visits: 3671
Thanks for those links.

I've had a bookmark for some time to Ben-Gan's original write-up, which you might want to add to your list of median computation web pages:
http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005

That article explains how/why the method works. I found the logic confusing at first until I absorbed the fact that integer division in T-SQL truncates remainders. Here are my notes I created for myself after running Ben-Gan's method:
/*
HOW IT WORKS (Returns median of [Value]), relying on INTeger division truncating remainders:
The CTE's ROW_NUMBER function orders the data in sequence of [Value], partitioned here by EmployeeID, thought that's not required.
THE CTE also includes a Count() function for the number of rows per EmployeeID.

Then the main query selects AVG([Value]) as the median, from a list of 2 rows specified by WHERE:
- RowNum = (Cnt + 1)/2
- RowNum = (Cnt + 2)/2

Consider the 2 cases possible - an even or odd number for Cnt:
EVEN (e.g., Cnt=16)
(Cnt + 1)/2 = 8! b/c Cnt is by default an INT and SQL's INT division truncates the remainder of 17/2
(Cnt + 2)/2 = 9.
So the average of the 8th and 9th rows is returned as the median.
ODD (e.g., Cnt=15)
(Cnt + 1)/2 = 8
(Cnt + 2)/2 = 8 (again, INTeger math truncation).
The average of the 8th and 8th rows is returned as the median.
*/



Rich
Orlando Colamatteo
Orlando Colamatteo
SSC-Dedicated
SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)SSC-Dedicated (39K reputation)

Group: General Forum Members
Points: 39860 Visits: 14412
Very cool. Thanks for the additional link and for your notes.

__________________________________________________________________________________________________
There are no special teachers of virtue, because virtue is taught by the whole community. --Plato
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search