calculating a median for groups

  • I have the following table that I need to get a median for one of the columns

    mytable

    State City Number

    CA LA 20

    CA LA 25

    CA LA 15

    CA SD 35

    CA SD 12

    CA SD 17

    NY NYC 20

    NY NYC 22

    NY NYC 8

    NY NYC 36

    NY NYC 15

    What I want is to create a new table that has the median for the

    column Number like this:

    State City Median

    CA LA 20

    CA SD 17

    NY NYC 20

    Any help would greatly be appreciated.

  • This will get you there:

    SELECT [State],

    City,

    AVG(NUMBER) AS Median

    FROM (SELECT [State],

    City,

    NUMBER,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY NUMBER)

    AS

    RowNum,

    COUNT(*) OVER (PARTITION BY [State], City)

    AS

    RowCnt

    FROM mytable) x

    WHERE RowNum IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 )

    GROUP BY [State],

    City

    ORDER BY [State],

    City

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • That worked great! Thank you.

    Maybe you could help me with this a bit more. What if I had two

    columns that I needed to get the median in the same table

    instead of column Number, what if I had Number1 and Number2.

    How would I group by State and City and also get the median for those two columns?

  • Can you provide an example?

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • An example would be:

    mytable

    State City Sales Qty

    CA LA 1000 5

    CA LA 2000 10

    CA LA 2200 12

    CA SD 3500 23

    CA SD 2300 18

    CA SD 4000 38

    NY NYC 10000 65

    NY NYC 12000 74

    NY NYC 15000 88

    NY NYC 16000 90

    NY NYC 14500 80

    newtable

    state city median sales median qty

    CA LA 2000 10

    CA SD 3500 23

    NY NYC 14500 80

    Thanks for the help!

  • FYI: Mean, median, and mode are three kinds of "averages".

    Find the mean, median, mode, and range for the following list of values:

    13, 18, 13, 14, 13, 16, 14, 21, 13

    Mean = 15

    Median = 14

    Mode = 13

    Mean, Median, Mode, and Range

    http://www.purplemath.com/modules/meanmode.htm

  • wizard_froto (11/14/2011)


    An example would be:

    mytable

    State City Sales Qty

    CA LA 1000 5

    CA LA 2000 10

    CA LA 2200 12

    CA SD 3500 23

    CA SD 2300 18

    CA SD 4000 38

    NY NYC 10000 65

    NY NYC 12000 74

    NY NYC 15000 88

    NY NYC 16000 90

    NY NYC 14500 80

    newtable

    state city median sales median qty

    CA LA 2000 10

    CA SD 3500 23

    NY NYC 14500 80

    Thanks for the help!

    This isn't as efficient as the first one, but it should get you what you want.

    SELECT [State],

    City,

    MAX(CASE WHEN RowNumSales IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 ) THEN Sales ELSE NULL END) AS [Median Sales],

    MAX(CASE WHEN RowNumQty IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 ) THEN Qty ELSE NULL END) AS [Median Qty]

    FROM (SELECT [State],

    City,

    Sales,

    Qty,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Sales) AS RowNumSales,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Qty) AS RowNumQty,

    COUNT(*) OVER (PARTITION BY [State], City) AS RowCnt

    FROM mytable) x

    GROUP BY [State],

    City

    ORDER BY [State],

    City

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Works perfectly. Thanks Mikes84.

  • A good rule of thumb to use is that you should not use code that you find on the Internet unless you understand what it is doing. If a problem arises, you'll need to be able to troubleshoot it yourself, which you won't be able to do if you don't understand the code. It's obvious that the "author" of this code didn't follow this rule.

    "Author" removed


    SELECT [State],

    City,

    MAX(CASE WHEN RowNumSales IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 ) THEN Sales ELSE NULL END) AS [Median Sales],

    MAX(CASE WHEN RowNumQty IN ( ( RowCnt + 1 ) / 2, ( RowCnt + 2 ) / 2 ) THEN Qty ELSE NULL END) AS [Median Qty]

    FROM (SELECT [State],

    City,

    Sales,

    Qty,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Sales) AS RowNumSales,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Qty) AS RowNumQty,

    COUNT(*) OVER (PARTITION BY [State], City) AS RowCnt

    FROM mytable) x

    GROUP BY [State],

    City

    ORDER BY [State],

    City

    In order to calculate the fiscal median, you need to retrieve both the left statistical median and the right statistical median. By changing the aggregate from AVG() to MAX() this code changed from calculating the fiscal median to calculating the right statistical median, so there is no longer a reason to retrieve the left statistical median, but this code is doing so anyhow.

    Due to the way that RowNumSales and RowNumQty are defined, we know the following:

    * The value of the Sales(or Qty) field for RowNumSales(or RowNumQty) for the value (RowCnt + 1)/2 is always going to be the left statistical median.

    * The value of the Sales(or Qty) field for RowNumSales(or RowNumQty) for the value (RowCnt + 2)/2 is always going to be the right statistical median.

    * The Max() of the left and right statistical medians is always going to be equal to the right statistical median.

    Furthermore, there is no reason to retrieve rows that are not the right statistical median for one of the two values, so we can filter out rows that don't meet one of those criteria to produce the following more efficient code.

    SELECT [State],

    City,

    MAX(CASE WHEN RowNumSales = ( RowCnt + 2 ) / 2 THEN Sales ELSE NULL END) AS [Median Sales],

    MAX(CASE WHEN RowNumQty = ( RowCnt + 2 ) / 2 THEN Qty ELSE NULL END) AS [Median Qty]

    FROM (SELECT [State],

    City,

    Sales,

    Qty,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Sales) AS RowNumSales,

    ROW_NUMBER() OVER (PARTITION BY [State], City ORDER BY Qty) AS RowNumQty,

    COUNT(*) OVER (PARTITION BY [State], City) AS RowCnt

    FROM mytable) x

    WHERE RowNumSales = ( RowCnt + 2 ) / 2

    OR RowNumQty = ( RowCnt + 2 ) / 2

    GROUP BY [State],

    City

    ORDER BY [State],

    City

    Drew

    J. Drew Allen
    Business Intelligence Analyst
    Philadelphia, PA

  • Hi Mike,

    I wish to thank you for your solution for calculating the median for multiple groupings. I have been trying to do this myself, spending a few hours. Your solution is elegantly succinct, accurate and a tool I can use repeatedly.

    Thanks again.

    Sincerely,

    Lester Chin

  • We began with a succession of efforts to apply Excel macro code entered as a function to obtain medians in a filtered worksheet list of scores of thousands of records. The code is widely-cited on the Internet but did not scale well when the worksheet contained more than a few hundred records. In the end it seemed to make more sense to preprocess in SQL before delivering the data to the end-users. The medians are an essential part of medical screening algorithms in use at our company. Thanks to all for this elegant code.

  • Roy.G.Biv (2/27/2014)


    We began with a succession of efforts to apply Excel macro code entered as a function to obtain medians in a filtered worksheet list of scores of thousands of records. The code is widely-cited on the Internet but did not scale well when the worksheet contained more than a few hundred records. In the end it seemed to make more sense to preprocess in SQL before delivering the data to the end-users. The medians are an essential part of medical screening algorithms in use at our company. Thanks to all for this elegant code.

    You might be interested to know that there have been some advances in calculating median with higher performance:

    Calculating the Median Value within a Partitioned Set Using T-SQL [/url]

    Best approaches for grouped median

    Note that the solution listed as 2005+2 in the second article is actually a solution that only works in SQL 2008. Although the author may at some point get this corrected.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply