Calculate the median

  • What is the fastest way to calculate the median

    ThanksSaurabh.D

  • Take the nice barometer to the building manager and ask him?
    https://sqlsunday.com/2016/02/11/median-and-percentile-in-tsql/

  • is this in relation to a median over a very large dataset and what is the most performant way to return the aggregate?

  • I'd use SQL Server's built in PERCENTILE_CONT function. Just call it with the numeric_literal = 0.5 and it will return the median.
    If you have a large number values you and you want a very slightly faster time you might want to use PERCENTILE_DISC, this will pick a value from the set but if you have an even number of items it won't take the average of the middle two.
    drop table #TempTest
    go
    create table #TempTest
    (
        id       int identity(1,1) not null primary key clustered,
        myLetter char(1) not null,
        myNumber int     not null
    )
    go
    insert into #TempTest(myLetter,myNumber)
    select top(103) PartitionBy.x,ABS(CHECKSUM(NewId())) % 1000000
      from master.sys.all_columns a,master.sys.all_columns b,master.sys.all_columns c
     cross apply(values('A'),('B'),('C'),('D'),('E'),('F')) PartitionBy(x)

    go
    select a.myLetter,a.myNumber,
           PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY myNumber) OVER (PARTITION BY myLetter) MedianDISC,
           PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY myNumber) OVER (PARTITION BY myLetter) MedianCONT
      from #TempTest a

  • Saurabh.D - Friday, March 8, 2019 6:07 PM

    What is the fastest way to calculate the median

    At one time in the SQL world, the “programming problem du jour†was finding the median. This was because there were two newsstand database magazines, DBMS and Database Programming & Design, in the 1990s with editorial columns by myself and Chris respectively. The magazines had the same publisher, so to boost sales of both, Chris and I would do “dueling SQL†columns. You had to buy both to get the whole story. We each did several versions of the median with the SQL available at the time starting in 1992 and these articles ran for months. Date and I posted several versions and Rory Murchison and Philip Vaughan of San Jose, CA, and The team of Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the median (SQL Forum 1993, 1994). The late Ken Henderson also added a streamlined version of this code to the discussion.

    The Weighted Median

    The weighted median is an even better measure of central tendency than the plain median. It is also more “set-oriented†than the plain median. It factors in the number of times the two values in the middle subset of a table with an even number of rows appear. The table with (1, 2, 2, 3, 3, 3) has a median of 3, the middle value. It returns the subset (2, 2, 3, 3, 3) in the middle, which gives us (13/5) = 2.6 for the weighted median. The weighted median is a more accurate description of the skew of the data.

    23.07.02. Modern Median

    We now have ROW_NUMBER() and can that function to do the sorting without the complicated self-joins, unions and views of the older methods. Here is an elegant solution from Peso:

    SELECT AVG(X.part_wgt) AS part_wgt_median
    FROM (SELECT part_wgt,
       (2 * ROW_NUMBER() OVER (ORDER BY part_wgt)
       - COUNT(*) OVER ()) AS middle_subset
      FROM Parts) AS X
    WHERE X.middle_subset BETWEEN 0 AND 2;

    This can be modified to give us the weighted median.

    SELECT (1.0 * SUM(X.y)/SUM(X.t)) AS part_wgtmedian
    FROM (SELECT SUM(part_wgt) OVER (PARTITION BY part_wgt) AS y,
       (2 * ROW_NUMBER() OVER (ORDER BY part_wgt)
       - COUNT(*) OVER ()) AS middle_subset,
       COUNT(*) OVER (PARTITION BY part_wgt) AS t
      FROM Parts) AS X
    WHERE X.middle_subset BETWEEN 0 AND 2;
    The 1.0 factor is an old trick to switch from integer math to decimal, but you could use a CAST().
    Today, the PERCENTILE functions are probably better but try it.

    Please post DDL and follow ANSI/ISO standards when asking for help. 

  • I've not tested it personally but the OFFSET/FETCH option that came out in 2012 seems to be both the most effecient and fastest method.  Please see the following article written by one of the Masters of SQL.
    https://www.itprotoday.com/sql-server/sql-server-2012-solutions-median-calculation

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Sunday, March 10, 2019 1:39 PM

    I've not tested it personally but the OFFSET/FETCH option that came out in 2012 seems to be both the most effecient and fastest method.  Please see the following article written by one of the Masters of SQL.
    https://www.itprotoday.com/sql-server/sql-server-2012-solutions-median-calculation

    I have tested it and used in production, and (at least for my situation) it was the most effecient and fastest method.

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply