Calculating Median, Max, Min, Lower Quartile and Upper Quartile for a Boxchart?

  • Hi All - My boss asked me to create a boxchart using SSRS for some of our data. I read articles all morning but honestly, I'm at a loss on how accomplish this. I think I'm growing in my SQL knowledge, but this one has me stumped. I created a sample table that looks like what I'll be working with below. If anyone could lend a hand it would really help me!

    The output should be something like below with values filled in for x:

    Year Median Max Min LowerQ UpperQ

    2010 x x x x x

    2011 x x x x x

    2012 x x x x x

    CREATE TABLE EStats

    (

    PersonIDVARCHAR(30)NOT NULL,

    GradeVARCHAR(25)NOT NULL,

    CourseDateDateNOT NULL

    )

    ;

    INSERT INTO EStats

    (

    PersonID, Grade, CourseDate

    )

    VALUES

    ('100', '91', '2010-03-01'),

    ('101', '96', '2010-03-01'),

    ('102', '88', '2010-03-01'),

    ('103', '92', '2010-03-01'),

    ('104', '81', '2010-03-01'),

    ('105', '85', '2010-03-01'),

    ('106', '91', '2010-03-01'),

    ('107', '89', '2010-03-01'),

    ('108', '99', '2010-03-01'),

    ('109', '88', '2010-03-01'),

    ('110', '81', '2011-03-02'),

    ('111', '77', '2011-03-02'),

    ('112', '88', '2011-03-02'),

    ('113', '76', '2011-03-02'),

    ('114', '69', '2011-03-02'),

    ('115', '70', '2011-03-02'),

    ('116', '75', '2011-03-02'),

    ('117', '88', '2011-03-02'),

    ('118', '76', '2011-03-02'),

    ('119', '95', '2012-03-01'),

    ('120', '96', '2012-03-01'),

    ('121', '90', '2012-03-01'),

    ('122', '80', '2012-03-01'),

    ('123', '85', '2012-03-01'),

    ('124', '94', '2012-03-01'),

    ('125', '89', '2012-03-01'),

    ('126', '97', '2012-03-01'),

    ('127', '94', '2012-03-01'),

    ('128', '72', '2012-03-01'),

    ('129', '88', '2012-03-01'),

    ('130', '91', '2012-03-01')

  • Assume it is the median value that has you stumped, try reading this, it works for myself.

    http://www.mssqltips.com/sqlservertip/2523/script-to-calculate-the-median-value-for-sql-server-data/?utm_source=dailynewsletter&utm_medium=email&utm_content=headline&utm_campaign=20111030

    If everything seems to be going well, you have obviously overlooked something.

    Ron

    Please help us, help you -before posting a question please read[/url]
    Before posting a performance problem please read[/url]

  • I had to do this as well, so you get my supply of bookmarks to help you on your way:

    http://www.sqlmag.com/article/tsql3/calculating-the-median-gets-simpler-in-sql-server-2005

    http://sqlblog.com/blogs/adam_machanic/archive/2006/12/18/medians-row-numbers-and-performance.aspx

    http://sqlblog.com/blogs/peter_debetta/archive/2006/12/20/Medians_Actual_Query_Cost_and_Statistics.aspx

    The first one by Itzik Ben-Gan is the one I used.

    And lastly, a link to one of my prior posts here on SSC. It includes some add'l explanatory notes I wrote to remind myself how Ben-Gan's logic actually works:

    http://www.sqlservercentral.com/Forums/Topic923660-338-1.aspx#bm1105837

    HTH,

    Rich

  • Awesome. Thanks a ton!

  • Actually, I'm having some trouble converting over the example to my sample table above. I just took the grades for 03-01-2012. The median should be the average between 90 and 91 (so 90.5), correct? This gets that result set:

    SELECT grade

    FROM estats

    WHERE CourseDate = '2012-03-01'

    ORDER BY grade DESC

    But when I run the below (which is an attempt to convert the example to my sample data above, I get 89.25. I'm sure I'm making a dumb mistake somewhere, but any help is appreciated.

    WITH cte AS

    (

    SELECT EStats.PersonID, EStats.Grade,

    ROW_NUMBER() OVER(PARTITION BY EStats.PersonID ORDER BY EStats.GRADE) AS RN,

    COUNT(*) OVER(PARTITION BY EStats.PersonID) AS Cnt

    FROM EStats

    WHERE EStats.CourseDate = '2012-03-01'

    )

    SELECT AVG(CAST(cte.Grade AS Numeric)) AS Median

    FROM cte

    WHERE RN IN((cte.Cnt + 1) / 2, (Cnt + 2) / 2)

  • Maybe you need to:

    PARTITION BY LEFT(CourseDate, 4)

    ?


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Alright, its time to confess I'm still not getting this. Would anyone mind taking my original DDL and calculating median and lower/upper quartiles? I think if I can see it with my example data it would click for me. Any help is really appreciated!

  • Below another attempt, but I seem to be running into the same problem. For instance, 2012-03-01 should produce 90.5, but the below gives me 89.25.

    SELECT

    CourseDate,

    AVG(CAST(Grade AS Numeric))

    FROM

    (

    SELECT

    CourseDate,

    Grade,

    ROW_NUMBER() OVER (

    PARTITION BY PersonID

    ORDER BY Grade ASC, CourseDate ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY PersonID

    ORDER BY Grade DESC, CourseDate DESC) AS RowDesc

    FROM EStats

    )x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY CourseDate

    ORDER BY CourseDate

    UPDATE: I think the problem is my PARTITION BY isn't working. In other words, the inner query produces a row_number with all 1's so there is nothing to match. Any ideas on a fix?

  • Okay, using Dwain's suggestion I'm close to getting everything working. My only problem is the row_number in my inner query isn't working well when the numbers are the same. For instance, when you run my inner query...

    SELECT

    CourseDate,

    Grade,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(CourseDate, 4)

    ORDER BY Grade ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(CourseDate, 4)

    ORDER BY Grade DESC) AS RowDesc

    FROM EStats

    ...you'll notice it does this for 2010-03-01:

    RowAsc

    10

    9

    8

    6

    7

    5

    3

    4

    2

    1

    I think that is throwing off the snippet I added which Dwain suggested (THANKS, by the way!). Any suggestions on how to fix this?

    For those interested, here's what I have. I'm sure its about the worst way I could do this, but its all I've got at this stage in my SQL knowledge! Once I get the row_number ordering issue worked out, I think it will work.

    WITH Q3 AS

    (

    SELECT

    CourseDate,

    AVG(CAST(Grade AS Numeric)) AS Median

    FROM

    (

    SELECT

    CourseDate,

    Grade,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(CourseDate, 4)

    ORDER BY Grade ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(CourseDate, 4)

    ORDER BY Grade DESC) AS RowDesc

    FROM EStats

    )x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY CourseDate

    --ORDER BY CourseDate

    ),

    Q2 AS

    (

    SELECT

    x.CourseDate,

    AVG(CAST(Grade AS Numeric)) AS LowerQuartile

    FROM

    (

    SELECT

    Estats.CourseDate,

    Estats.Grade,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(EStats.CourseDate, 4)

    ORDER BY Grade ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(Estats.CourseDate, 4)

    ORDER BY Grade DESC) AS RowDesc

    FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate

    WHERE EStats.Grade < Q3.Median

    )x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY x.CourseDate

    ),

    Q4 AS

    (

    SELECT

    x.CourseDate,

    AVG(CAST(Grade AS Numeric)) AS UpperQuartile

    FROM

    (

    SELECT

    Estats.CourseDate,

    Estats.Grade,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(EStats.CourseDate, 4)

    ORDER BY Grade ASC) AS RowAsc,

    ROW_NUMBER() OVER (

    PARTITION BY LEFT(Estats.CourseDate, 4)

    ORDER BY Grade DESC) AS RowDesc

    FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate

    WHERE EStats.Grade > Q3.Median

    )x

    WHERE

    RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)

    GROUP BY x.CourseDate

    )

    SELECT Q3.CourseDate, Q3.Median AS Median, Q2.LowerQuartile, Q4.UpperQuartile, MIN(EStats.Grade) AS Min, MAX(EStats.Grade) AS Max

    FROM Q3

    JOIN Q2 ON Q3.CourseDate = Q2.CourseDate

    JOIN Q4 ON Q3.CourseDate = Q4.CourseDate

    JOIN EStats ON Q3.CourseDate = EStats.CourseDate

    GROUP BY Q3.CourseDate, Q3.Median, Q2.LowerQuartile, Q4.UpperQuartile

    ORDER BY Q3.CourseDate

  • I'm just going to bump this because I rewrote the above post and added what I've got to date...

  • DataAnalyst011 (1/17/2013)


    The output should be something like below with values filled in for x:

    Year Median Max Min LowerQ UpperQ

    2010 x x x x x

    2011 x x x x x

    2012 x x x x x

    CREATE TABLE EStats

    (

    PersonIDVARCHAR(30)NOT NULL,

    GradeVARCHAR(25)NOT NULL,

    CourseDateDateNOT NULL

    )

    ;

    INSERT INTO EStats

    (

    PersonID, Grade, CourseDate

    )

    VALUES

    ('100', '91', '2010-03-01'),

    ('101', '96', '2010-03-01'),

    ('102', '88', '2010-03-01'),

    ('103', '92', '2010-03-01'),

    ('104', '81', '2010-03-01'),

    ('105', '85', '2010-03-01'),

    ('106', '91', '2010-03-01'),

    ('107', '89', '2010-03-01'),

    ('108', '99', '2010-03-01'),

    ('109', '88', '2010-03-01'),

    ('110', '81', '2011-03-02'),

    ('111', '77', '2011-03-02'),

    ('112', '88', '2011-03-02'),

    ('113', '76', '2011-03-02'),

    ('114', '69', '2011-03-02'),

    ('115', '70', '2011-03-02'),

    ('116', '75', '2011-03-02'),

    ('117', '88', '2011-03-02'),

    ('118', '76', '2011-03-02'),

    ('119', '95', '2012-03-01'),

    ('120', '96', '2012-03-01'),

    ('121', '90', '2012-03-01'),

    ('122', '80', '2012-03-01'),

    ('123', '85', '2012-03-01'),

    ('124', '94', '2012-03-01'),

    ('125', '89', '2012-03-01'),

    ('126', '97', '2012-03-01'),

    ('127', '94', '2012-03-01'),

    ('128', '72', '2012-03-01'),

    ('129', '88', '2012-03-01'),

    ('130', '91', '2012-03-01')

    I would love to try and help but help me out please. Can you fill in the expected results with exactly the real results that you expect to see from the given sample data?

    I initially did some fumbling around but I couldn't quite 'ken exactly what you were after so I kind of punted with the suggestion I made earlier.

    If I have a clear target, I'm usually pretty good at hitting it.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • Thanks a ton, Dwain.

    Just for background (may not be needed) here are a couple of links regarding the math for box-and-whisker charts (or box plots)

    http://www.purplemath.com/modules/boxwhisk.htm

    https://www.khanacademy.org/math/probability/descriptive-statistics/Box-and-whisker%20plots/v/box-and-whisker-plots

    Essentially, I need five things: median, upper quartile, lower quartile, min, and max. Median, min, and max are pretty self explanatory. The lower quartile for a box plot takes the median of all of the numbers below the median for the entire set. The upper quartile takes the median for all the numbers above the median for the entire set.

    With that in view, here should be the result set for my sample DDL:

    Year Median Max Min LowerQ UpperQ

    2010 90 99 81 88 92

    2011 76 88 69 72.5 84.5

    2012 90.5 97 72 86.5 94.5

    Again, thanks alot for taking this on. If I can provide further explanation, please let me know!

  • EDIT: On review, the original dataset above is correct. I really apologize for the confusion. I'll put in back in when I get back to my office.

  • Okay, two posts above does have the correct dataset (which was the original dataset). I was confused for a moment whether the median in a quartile included the two averaged numbers when total row count is even. It does. In the case of the values 1 through 10 the median would be 5.5, and the quartiles would be the median of everything above 5.5 (6, 7, 8, 9, 10) and below 5.5 (5, 4, 3, 2, 1). So the upper quartile would be 8 and the lower quartile would be 3.

    Sorry for the momentary confusion!

  • Hi

    Thought I would have a go at this. I've used a CTE to order and number the results by course year.

    I've also created the statistics in separate queries to try and make it a be easier to manage and read.

    To determine the medians I have used rounding to determine low and high row number for each median. These can be the same number. There is probably a nicer way to do this:-)

    I've left the columns in the queries that I was using to validate my median choices.

    They are all joined together in the final query. I have also added a Geometry to visualize it.

    ;with cte as (

    select ROW_NUMBER() OVER (PARTITION BY year(CourseDate) ORDER BY GRADE) RN,

    COUNT(*) OVER (PARTITION BY year(CourseDate)) C,

    ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) - .1, 0) HC1,

    ROUND(((COUNT(*) OVER (PARTITION BY year(CourseDate)) + 1) / 2.0) + .1, 0) HC2,

    year(CourseDate) CourseYear,

    CAST(Grade AS NUMERIC(3)) Grade

    from EStats

    )

    ,minmax as (

    select courseYear, min(Grade) minGrade, max(Grade) maxGrade, min(c) c

    from cte

    group by courseYear

    )

    ,median as (

    select courseYear, avg(grade) medianGrade

    , cast(min(hc1) as int) hc1, cast(min(hc2) as int) hc2

    from cte

    where rn in (hc1, hc2)

    group by courseYear

    )

    ,lowQtr as (

    select courseyear, avg(grade) lowQtrGrade

    ,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0)) as int) l1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0)) as int) l2

    from cte

    where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0),round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0))

    group by courseyear

    )

    ,highQtr as (

    select courseyear, avg(grade) highQtrGrade

    ,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1) as int) h1,cast(min(round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1) as int) h2

    from cte

    where rn in (round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 - .1,0) + hc1,round((hc1 - 1 + abs(hc2 - hc1) + 1) / 2.0 + .1,0) + hc1)

    group by courseyear

    )

    select mm.courseYear,

    cast(md.medianGrade as numeric(5,2)) Median,

    cast(mm.maxGrade as numeric(5,2)) Max,

    cast(mm.minGrade as numeric(5,2)) Min,

    cast(lq.lowQtrGrade as numeric(5,2)) LowerQ,

    cast(hq.highQtrGrade as numeric(5,2)) UpperQ,

    Geometry::STGeomFromText(

    'MULTILINESTRING((' +

    cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- minTick

    cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- maxTick

    cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(medianGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- medianTick

    cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- lowerTick

    cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + '),(' + -- upperTick

    cast(cast(minGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- lowWhisker

    cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 as varchar(5)) + ',' + cast(cast(maxGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 as varchar(5)) + '),(' + -- highWhisker

    cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 - 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 - 2.5 as varchar(5)) + '),(' + -- box1

    cast(cast(lowQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric(5)) - 2000) * 10 + 2.5 as varchar(5)) + ',' + cast(cast(highQtrGrade as numeric(5,2)) as varchar(5)) + ' ' + cast((cast(mm.courseyear as numeric) - 2000) * 10 + 2.5 as varchar(5)) + -- box2

    + '))'

    ,0) graph

    from minmax mm

    inner join median md on md.courseyear = mm.courseyear

    inner join lowQtr lq on lq.courseyear = mm.courseyear

    inner join highQtr hq on hq.courseyear = mm.courseyear

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply