Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12

Calculating Median, Max, Min, Lower Quartile and Upper Quartile for a Boxchart? Expand / Collapse
Author
Message
Posted Tuesday, January 22, 2013 6:00 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 6:03 PM
Points: 3,609, Visits: 5,221
DataAnalyst011 (1/17/2013)

The output should be something like below with values filled in for x:
Year   Median   Max   Min  LowerQ   UpperQ
2010 x x x x x
2011 x x x x x
2012 x x x x x


CREATE TABLE EStats    
(
PersonID VARCHAR(30) NOT NULL,
Grade VARCHAR(25) NOT NULL,
CourseDate Date NOT NULL
)
;

INSERT INTO EStats
(
PersonID, Grade, CourseDate
)

VALUES
('100', '91', '2010-03-01'),
('101', '96', '2010-03-01'),
('102', '88', '2010-03-01'),
('103', '92', '2010-03-01'),
('104', '81', '2010-03-01'),
('105', '85', '2010-03-01'),
('106', '91', '2010-03-01'),
('107', '89', '2010-03-01'),
('108', '99', '2010-03-01'),
('109', '88', '2010-03-01'),
('110', '81', '2011-03-02'),
('111', '77', '2011-03-02'),
('112', '88', '2011-03-02'),
('113', '76', '2011-03-02'),
('114', '69', '2011-03-02'),
('115', '70', '2011-03-02'),
('116', '75', '2011-03-02'),
('117', '88', '2011-03-02'),
('118', '76', '2011-03-02'),
('119', '95', '2012-03-01'),
('120', '96', '2012-03-01'),
('121', '90', '2012-03-01'),
('122', '80', '2012-03-01'),
('123', '85', '2012-03-01'),
('124', '94', '2012-03-01'),
('125', '89', '2012-03-01'),
('126', '97', '2012-03-01'),
('127', '94', '2012-03-01'),
('128', '72', '2012-03-01'),
('129', '88', '2012-03-01'),
('130', '91', '2012-03-01')



I would love to try and help but help me out please. Can you fill in the expected results with exactly the real results that you expect to see from the given sample data?

I initially did some fumbling around but I couldn't quite 'ken exactly what you were after so I kind of punted with the suggestion I made earlier.

If I have a clear target, I'm usually pretty good at hitting it.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1410304
Posted Wednesday, January 23, 2013 7:45 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Thursday, July 10, 2014 3:36 PM
Points: 92, Visits: 338
Thanks a ton, Dwain.

Just for background (may not be needed) here are a couple of links regarding the math for box-and-whisker charts (or box plots)

http://www.purplemath.com/modules/boxwhisk.htm

https://www.khanacademy.org/math/probability/descriptive-statistics/Box-and-whisker%20plots/v/box-and-whisker-plots

Essentially, I need five things: median, upper quartile, lower quartile, min, and max. Median, min, and max are pretty self explanatory. The lower quartile for a box plot takes the median of all of the numbers below the median for the entire set. The upper quartile takes the median for all the numbers above the median for the entire set.

With that in view, here should be the result set for my sample DDL:

Year   Median   Max   Min  LowerQ   UpperQ
2010 90 99 81 88 92
2011 76 88 69 72.5 84.5
2012 90.5 97 72 86.5 94.5


Again, thanks alot for taking this on. If I can provide further explanation, please let me know!
Post #1410589
Posted Wednesday, January 23, 2013 11:30 AM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Thursday, July 10, 2014 3:36 PM
Points: 92, Visits: 338
EDIT: On review, the original dataset above is correct. I really apologize for the confusion. I'll put in back in when I get back to my office.
Post #1410731
Posted Wednesday, January 23, 2013 1:33 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Thursday, July 10, 2014 3:36 PM
Points: 92, Visits: 338
Okay, two posts above does have the correct dataset (which was the original dataset). I was confused for a moment whether the median in a quartile included the two averaged numbers when total row count is even. It does. In the case of the values 1 through 10 the median would be 5.5, and the quartiles would be the median of everything above 5.5 (6, 7, 8, 9, 10) and below 5.5 (5, 4, 3, 2, 1). So the upper quartile would be 8 and the lower quartile would be 3.

Sorry for the momentary confusion!
Post #1410778
Posted Wednesday, January 23, 2013 2:50 PM
SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: Today @ 8:33 PM
Points: 950, Visits: 2,875
Hi

Thought I would have a go at this. I've used a CTE to order and number the results by course year.
I've also created the statistics in separate queries to try and make it a be easier to manage and read.
To determine the medians I have used rounding to determine low and high row number for each median. These can be the same number. There is probably a nicer way to do this
I've left the columns in the queries that I was using to validate my median choices.
They are all joined together in the final query. I have also added a Geometry to visualize it.

;with cte as (
select ROW_NUMBER() OVER (PARTITION BY year(CourseDate) ORDER BY GRADE) RN,
COUNT(*) OVER (PARTITION BY year(CourseDate)) C,
ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) - .1, 0) HC1,
ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) + .1, 0) HC2,
year(CourseDate) CourseYear,
CAST(Grade AS NUMERIC(3)) Grade
from EStats
)
,minmax as (
select courseYear, min(Grade) minGrade, max(Grade) maxGrade, min(c) c
from cte
group by courseYear
)
,median as (
select courseYear, avg(grade) medianGrade
, cast(min(hc1) as int) hc1, cast(min(hc2) as int) hc2
from cte
where rn in (hc1, hc2)
group by courseYear
)
,lowQtr as (
select courseyear, avg(grade) lowQtrGrade
,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0)) as int) l1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0)) as int) l2
from cte
where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0),round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0))
group by courseyear
)
,highQtr as (
select courseyear, avg(grade) highQtrGrade
,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1) as int) h1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1) as int) h2
from cte
where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1,round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1)
group by courseyear
)
select mm.courseYear,
cast(md.medianGrade as numeric(5,2)) Median,
cast(mm.maxGrade as numeric(5,2)) Max,
cast(mm.minGrade as numeric(5,2)) Min,
cast(lq.lowQtrGrade as numeric(5,2)) LowerQ,
cast(hq.highQtrGrade as numeric(5,2)) UpperQ,
Geometry::STGeomFromText(
'MULTILINESTRING((' +
cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- minTick
cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- maxTick
cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- medianTick
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- lowerTick
cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- upperTick
cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- lowWhisker
cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- highWhisker
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 - 2.5 as varchar(5)) + '),(' + -- box1
cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 + 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + -- box2
+ '))'
,0) graph
from minmax mm
inner join median md on md.courseyear = mm.courseyear
inner join lowQtr lq on lq.courseyear = mm.courseyear
inner join highQtr hq on hq.courseyear = mm.courseyear

Post #1410805
Posted Wednesday, January 23, 2013 3:51 PM
Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Today @ 4:48 PM
Points: 689, Visits: 5,912
Well, definitely not the prettiest thing I've ever written, but it at least works with the data you have. I have no idea how it would scale on a much larger dataset. Give it a try and let me know what you think.

with yearCounts as
(
select DATEPART(year, CourseDate) CourseYear, COUNT(*) GradeCount
from EStats
group by DATEPART(year, CourseDate)
),
qOrder as
(
select *, ROW_NUMBER() over (partition by y.CourseYear order by e.Grade) rn
from EStats e
join yearCounts y on y.CourseYear = DATEPART(year, e.CourseDate)
)
select q.CourseYear
, SUM(case when q.GradeCount % 2 = 1 then case when q.rn = (q.GradeCount+1)/2 then q.Grade else 0 end
else case when q.rn in ((q.GradeCount/2),(q.GradeCount/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end) Median
, SUM(case when q.rn = q.GradeCount then q.Grade else 0 end) [Max]
, SUM(case when q.rn = 1 then q.Grade else 0 end) [Min]
, SUM(case when q.GradeCount % 2 = 1 then case when ((q.GradeCount - 1)/2) % 2 = 1 then
case when q.rn = (((q.GradeCount-1)/2)+1)/2 then q.Grade else 0 end
else case when q.rn in ((((q.GradeCount-1)/2)/2),(((q.GradeCount-1)/2)/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end
else case when (q.GradeCount/2) % 2 = 1 then case when q.rn = ((q.GradeCount/2)+1)/2 then q.Grade else 0 end
else case when q.rn in (((q.GradeCount/2)/2),((q.GradeCount/2)/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end end) LowerQ
, SUM(case when q.GradeCount % 2 = 1 then case when ((q.GradeCount - 1)/2) % 2 = 1 then
case when q.rn = q.GradeCount - ((((q.GradeCount-1)/2)+1)/2) + 1 then q.Grade else 0 end
else case when q.rn in (q.GradeCount - (((q.GradeCount-1)/2)/2) + 1,q.GradeCount - ((((q.GradeCount-1)/2)/2)+1) + 1) then convert(decimal(4,1), q.Grade)/2 else 0 end end
else case when (q.GradeCount/2) % 2 = 1 then case when q.rn = q.GradeCount - (((q.GradeCount/2)+1)/2) + 1 then q.Grade else 0 end
else case when q.rn in (q.GradeCount - ((q.GradeCount/2)/2) + 1,q.GradeCount - (((q.GradeCount/2)/2)+1) + 1) then convert(decimal(4,1), q.Grade)/2 else 0 end end end) UpperQ
from qOrder q
group by q.CourseYear

Post #1410826
Posted Wednesday, January 23, 2013 6:25 PM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 6:03 PM
Points: 3,609, Visits: 5,221
roryp 96873 (1/23/2013)
Well, definitely not the prettiest thing I've ever written...


Are you kidding? Ugly is my middle name...

DataAnalyst011 (1/23/2013)
Year   Median   Max   Min  LowerQ   UpperQ
2010 90 99 81 88 92
2011 76 88 69 72.5 84.5
2012 90.5 97 72 86.5 94.5


I don't think the value in bold is correct. If I understand your computation correctly you're trying to find the median of the lower half of the grades for 2011, which would be in this set from your input stream:

114   69.0   2011-03-02
115 70.0 2011-03-02
116 75.0 2011-03-02
118 76.0 2011-03-02
113 76.0 2011-03-02



And that value is 116's score of 75.

So here is my solution, my ugly baby. Gotta love it cause it's my baby!

;WITH [Stats] AS (
SELECT PersonID, Grade, CourseDate, [Year]
,Median=CASE WHEN Count1%2 = 0 AND rn1 IN (Count1/2, (Count1/2)+1) THEN Grade
WHEN Count1%2 = 1 AND rn1 = (Count1/2)+1 THEN Grade END
-- Used to establish median over quartiles
,rn2=ROW_NUMBER() OVER (PARTITION BY [Year], Quartile ORDER BY Grade)
,Count2=COUNT(*) OVER (PARTITION BY [Year], Quartile)
,Quartile
FROM (
SELECT PersonID, Grade=CAST(Grade AS DECIMAL(10,1)), CourseDate, [Year]
-- Used to establish median over years
,rn1=ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY Grade)
,Count1=COUNT(*) OVER (PARTITION BY [Year])
-- Break the grades into two groups to establish "median" in a quartile
,Quartile=NTILE(2) OVER (PARTITION BY [Year] ORDER BY Grade)
FROM EStats
CROSS APPLY (SELECT [Year]=LEFT(CourseDate, 4)) a) a
)
SELECT [Year]
,Median=AVG(Median)
,[Max]=MAX(Grade)
,[Min]=MIN(Grade)
,LowerQ=AVG(CASE
WHEN Quartile = 1 AND Count2%2 = 0 AND rn2 IN (Count2/2, (Count2/2)+1) THEN Grade
WHEN Quartile = 1 AND Count2%2 = 1 AND rn2 = (Count2/2)+1 THEN Grade END)
,UpperQ=AVG(CASE
WHEN Quartile = 2 AND Count2%2 = 0 AND rn2 IN (Count2/2, (Count2/2)+1) THEN Grade
WHEN Quartile = 2 AND Count2%2 = 1 AND rn2 = (Count2/2)+1 THEN Grade END)
FROM [Stats]
GROUP BY [Year]


Let me know if this helps.

Edit: Tidied up my solution a bit.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1410864
Posted Thursday, January 24, 2013 8:07 AM
Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Today @ 4:48 PM
Points: 689, Visits: 5,912
dwain.c (1/23/2013)


I don't think the value in bold is correct. If I understand your computation correctly you're trying to find the median of the lower half of the grades for 2011, which would be in this set from your input stream:

114   69.0   2011-03-02
115 70.0 2011-03-02
116 75.0 2011-03-02
118 76.0 2011-03-02
113 76.0 2011-03-02



And that value is 116's score of 75.



I know there are a couple different schools of thought when calculating the quartiles. The way I always learned was that when you have an odd number of values (such as this case for the year 2011), the median is the number in the (n+1)/2 position then you do not include that number when breaking up the dataset. So you would exclude 113's grade in your case since it was selected as the median and then calculate the median of 69, 70, 75, and 75 to get the lower quartile which is 72.5.
Post #1411158
Posted Thursday, January 24, 2013 12:52 PM
SSC Journeyman

SSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC JourneymanSSC Journeyman

Group: General Forum Members
Last Login: Thursday, July 10, 2014 3:36 PM
Points: 92, Visits: 338
I LOVE the solutions you all provided! Thanks a ton for the help! I'm learning a lot from playing with what you all wrote in my test environment. Super helpful.

Dwain, I'm going to convert yours over to production and see if this will do the trick. I have high hopes. I also think this thread will be useful for anyone else needing to write something for a box plot. I already don't like them
Post #1411309
Posted Friday, February 15, 2013 11:01 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Thursday, July 17, 2014 2:01 PM
Points: 687, Visits: 3,002
Once I changed your table definition for the Grade column from VARCHAR to NUMERIC(10,2), this works for me:

WITH cte AS
(
SELECT CourseDate, PersonID, Grade,
ROW_NUMBER() OVER(PARTITION BY CourseDate ORDER BY Grade) AS RowNum,
COUNT(*) OVER(PARTITION BY CourseDate) As cnt
FROM estats
)
SELECT CourseDate, AVG(Grade) AS Median
FROM cte
WHERE RowNum IN((Cnt + 1) / 2, (Cnt + 2) / 2)
GROUP BY CourseDate
ORDER BY CourseDate;

It returns medians of 90, 76, and 90.5 for CourseDates of 2010-03-01, 2011-03-02, and 2012-03-01, respectively.

I'm not sure why you would define the Grade column to be VARCHAR() when all your data are numerics. I haven't looked into this more, but it may be that this is a source of error. I suspect that you're getting an implicit conversion from VARCHAR() to INT, which leads to loss of necessary accuracy when the final AVG() function is called.

Rich
Post #1420689
« Prev Topic | Next Topic »

Add to briefcase ««12

Permissions Expand / Collapse