Calculating Median, Max, Min, Lower Quartile and Upper Quartile for a Boxchart?

  • Well, definitely not the prettiest thing I've ever written, but it at least works with the data you have. I have no idea how it would scale on a much larger dataset. Give it a try and let me know what you think.

    with yearCounts as

    (

    select DATEPART(year, CourseDate) CourseYear, COUNT(*) GradeCount

    from EStats

    group by DATEPART(year, CourseDate)

    ),

    qOrder as

    (

    select *, ROW_NUMBER() over (partition by y.CourseYear order by e.Grade) rn

    from EStats e

    join yearCounts y on y.CourseYear = DATEPART(year, e.CourseDate)

    )

    select q.CourseYear

    , SUM(case when q.GradeCount % 2 = 1 then case when q.rn = (q.GradeCount+1)/2 then q.Grade else 0 end

    else case when q.rn in ((q.GradeCount/2),(q.GradeCount/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end) Median

    , SUM(case when q.rn = q.GradeCount then q.Grade else 0 end) [Max]

    , SUM(case when q.rn = 1 then q.Grade else 0 end) [Min]

    , SUM(case when q.GradeCount % 2 = 1 then case when ((q.GradeCount - 1)/2) % 2 = 1 then

    case when q.rn = (((q.GradeCount-1)/2)+1)/2 then q.Grade else 0 end

    else case when q.rn in ((((q.GradeCount-1)/2)/2),(((q.GradeCount-1)/2)/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end

    else case when (q.GradeCount/2) % 2 = 1 then case when q.rn = ((q.GradeCount/2)+1)/2 then q.Grade else 0 end

    else case when q.rn in (((q.GradeCount/2)/2),((q.GradeCount/2)/2)+1) then convert(decimal(4,1), q.Grade)/2 else 0 end end end) LowerQ

    , SUM(case when q.GradeCount % 2 = 1 then case when ((q.GradeCount - 1)/2) % 2 = 1 then

    case when q.rn = q.GradeCount - ((((q.GradeCount-1)/2)+1)/2) + 1 then q.Grade else 0 end

    else case when q.rn in (q.GradeCount - (((q.GradeCount-1)/2)/2) + 1,q.GradeCount - ((((q.GradeCount-1)/2)/2)+1) + 1) then convert(decimal(4,1), q.Grade)/2 else 0 end end

    else case when (q.GradeCount/2) % 2 = 1 then case when q.rn = q.GradeCount - (((q.GradeCount/2)+1)/2) + 1 then q.Grade else 0 end

    else case when q.rn in (q.GradeCount - ((q.GradeCount/2)/2) + 1,q.GradeCount - (((q.GradeCount/2)/2)+1) + 1) then convert(decimal(4,1), q.Grade)/2 else 0 end end end) UpperQ

    from qOrder q

    group by q.CourseYear

  • roryp 96873 (1/23/2013)


    Well, definitely not the prettiest thing I've ever written...

    Are you kidding? Ugly is my middle name... 😛

    DataAnalyst011 (1/23/2013)


    Year Median Max Min LowerQ UpperQ

    2010 90 99 81 88 92

    2011 76 88 69 72.5 84.5

    2012 90.5 97 72 86.5 94.5

    I don't think the value in bold is correct. If I understand your computation correctly you're trying to find the median of the lower half of the grades for 2011, which would be in this set from your input stream:

    114 69.0 2011-03-02

    115 70.0 2011-03-02

    116 75.0 2011-03-02

    118 76.0 2011-03-02

    113 76.0 2011-03-02

    And that value is 116's score of 75.

    So here is my solution, my ugly baby. Gotta love it cause it's my baby!

    ;WITH [Stats] AS (

    SELECT PersonID, Grade, CourseDate, [Year]

    ,Median=CASE WHEN Count1%2 = 0 AND rn1 IN (Count1/2, (Count1/2)+1) THEN Grade

    WHEN Count1%2 = 1 AND rn1 = (Count1/2)+1 THEN Grade END

    -- Used to establish median over quartiles

    ,rn2=ROW_NUMBER() OVER (PARTITION BY [Year], Quartile ORDER BY Grade)

    ,Count2=COUNT(*) OVER (PARTITION BY [Year], Quartile)

    ,Quartile

    FROM (

    SELECT PersonID, Grade=CAST(Grade AS DECIMAL(10,1)), CourseDate, [Year]

    -- Used to establish median over years

    ,rn1=ROW_NUMBER() OVER (PARTITION BY [Year] ORDER BY Grade)

    ,Count1=COUNT(*) OVER (PARTITION BY [Year])

    -- Break the grades into two groups to establish "median" in a quartile

    ,Quartile=NTILE(2) OVER (PARTITION BY [Year] ORDER BY Grade)

    FROM EStats

    CROSS APPLY (SELECT [Year]=LEFT(CourseDate, 4)) a) a

    )

    SELECT [Year]

    ,Median=AVG(Median)

    ,[Max]=MAX(Grade)

    ,[Min]=MIN(Grade)

    ,LowerQ=AVG(CASE

    WHEN Quartile = 1 AND Count2%2 = 0 AND rn2 IN (Count2/2, (Count2/2)+1) THEN Grade

    WHEN Quartile = 1 AND Count2%2 = 1 AND rn2 = (Count2/2)+1 THEN Grade END)

    ,UpperQ=AVG(CASE

    WHEN Quartile = 2 AND Count2%2 = 0 AND rn2 IN (Count2/2, (Count2/2)+1) THEN Grade

    WHEN Quartile = 2 AND Count2%2 = 1 AND rn2 = (Count2/2)+1 THEN Grade END)

    FROM [Stats]

    GROUP BY [Year]

    Let me know if this helps.

    Edit: Tidied up my solution a bit.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • dwain.c (1/23/2013)


    I don't think the value in bold is correct. If I understand your computation correctly you're trying to find the median of the lower half of the grades for 2011, which would be in this set from your input stream:

    114 69.0 2011-03-02

    115 70.0 2011-03-02

    116 75.0 2011-03-02

    118 76.0 2011-03-02

    113 76.0 2011-03-02

    And that value is 116's score of 75.

    I know there are a couple different schools of thought when calculating the quartiles. The way I always learned was that when you have an odd number of values (such as this case for the year 2011), the median is the number in the (n+1)/2 position then you do not include that number when breaking up the dataset. So you would exclude 113's grade in your case since it was selected as the median and then calculate the median of 69, 70, 75, and 75 to get the lower quartile which is 72.5.

  • I LOVE the solutions you all provided! Thanks a ton for the help! I'm learning a lot from playing with what you all wrote in my test environment. Super helpful.

    Dwain, I'm going to convert yours over to production and see if this will do the trick. I have high hopes. I also think this thread will be useful for anyone else needing to write something for a box plot. I already don't like them 🙂

  • Once I changed your table definition for the Grade column from VARCHAR to NUMERIC(10,2), this works for me:

    WITH cte AS

    (

    SELECT CourseDate, PersonID, Grade,

    ROW_NUMBER() OVER(PARTITION BY CourseDate ORDER BY Grade) AS RowNum,

    COUNT(*) OVER(PARTITION BY CourseDate) As cnt

    FROM estats

    )

    SELECT CourseDate, AVG(Grade) AS Median

    FROM cte

    WHERE RowNum IN((Cnt + 1) / 2, (Cnt + 2) / 2)

    GROUP BY CourseDate

    ORDER BY CourseDate;

    It returns medians of 90, 76, and 90.5 for CourseDates of 2010-03-01, 2011-03-02, and 2012-03-01, respectively.

    I'm not sure why you would define the Grade column to be VARCHAR() when all your data are numerics. I haven't looked into this more, but it may be that this is a source of error. I suspect that you're getting an implicit conversion from VARCHAR() to INT, which leads to loss of necessary accuracy when the final AVG() function is called.

    Rich

  • thank u

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply