Calculating median over 2 columns

  • Hello,

    /First post

    I'm new to SQL and i'm in a spot of bother trying to work out how to do something!

    I have a database with 1million+ records in and i'm trying to collect the median values of column(2) for all distinct values in column (1)

    Example DB:

    Column 1 Column 2

    978555 500

    978555 502

    978555 480

    978555 490

    978324 1111

    978324 1102

    978311 122

    978311 120

    978994 804

    978320 359

    and I need it to display on SELECT as

    column 1 column 2

    978555 495

    978324 1106

    978311 121

    978994 804

    978320 359

    Is this possible on 2008 R2 and could anyone help me with this please? thank you in advance!

  • djones 23745 (8/13/2015)


    Hello,

    /First post

    I'm new to SQL and i'm in a spot of bother trying to work out how to do something!

    I have a database with 1million+ records in and i'm trying to collect the median values of column(2) for all distinct values in column (1)

    Example DB:

    Column 1 Column 2

    978555 500

    978555 502

    978555 480

    978555 490

    978324 1111

    978324 1102

    978311 122

    978311 120

    978994 804

    978320 359

    and I need it to display on SELECT as

    column 1 column 2

    978555 495

    978324 1106

    978311 121

    978994 804

    978320 359

    Is this possible on 2008 R2 and could anyone help me with this please? thank you in advance!

    this may help

    https://www.simple-talk.com/sql/t-sql-programming/calculating-the-median-value-within-a-partitioned-set-using-t-sql/

    how did you calculate the following?

    column 1 column 2

    978555 495

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • This article is worth a read too.

    That's a strange median, as JLS points out.

    “Write the query the simplest way. If through testing it becomes clear that the performance is inadequate, consider alternative query forms.” - Gail Shaw

    For fast, accurate and documented assistance in answering your questions, please read this article.
    Understanding and using APPLY, (I) and (II) Paul White
    Hidden RBAR: Triangular Joins / The "Numbers" or "Tally" Table: What it is and how it replaces a loop Jeff Moden

  • Dwain's article is a great resource. Hopefully, it will help the OP.

    Could you explain why do you say that the median is wrong? It's the average of the 2 middle rows. (490+500)/2=495

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Thank you for the reply! I shall have a look now, looks good!

    My example DB was just a quick, off the top of my head example with ballpark medians which is why it is wrong.

    Thanks!

  • hi

    the below code may helpful

    select distinct column1,AVG(column2) maridain

    from table

    group by column1

  • The DISTINCT keyword is redundant as you're using GROUP BY. You're also giving average which is the mean and not the median.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Has anyone considered something like this?

    DECLARE @TEST AS TABLE (

    Column1 int,

    Column2 int

    );

    INSERT INTO @TEST (Column1, Column2)

    VALUES (978555, 500),

    (978555, 502),

    (978555, 480),

    (978555, 490),

    (978324, 1111),

    (978324, 1102),

    (978311, 122),

    (978311, 120),

    (978994, 804),

    (978320, 359);

    WITH RAW_DATA AS (

    SELECT T.Column1, T.Column2,

    COUNT(T.Column2) OVER(PARTITION BY T.Column1) AS THE_COUNT,

    ROW_NUMBER() OVER(PARTITION BY T.Column1 ORDER BY T.Column2) AS RN,

    CASE

    WHEN COUNT(T.Column2) OVER(PARTITION BY T.Column1) % 2 = 0

    AND ROW_NUMBER() OVER(PARTITION BY T.Column1 ORDER BY T.Column2) IN

    (COUNT(T.Column2) OVER(PARTITION BY T.Column1) / 2,

    (COUNT(T.Column2) OVER(PARTITION BY T.Column1) / 2) + 1)

    THEN 1

    WHEN COUNT(T.Column2) OVER(PARTITION BY T.Column1) % 2 = 1

    AND ROW_NUMBER() OVER(PARTITION BY T.Column1 ORDER BY T.Column2) =

    COUNT(T.Column2) OVER(PARTITION BY T.Column1) / 2 + 1

    THEN 1

    ELSE 0

    END AS IS_MIDDLE

    FROM @TEST AS T

    )

    SELECT R.Column1, AVG(R.Column2) AS MEDIAN

    FROM RAW_DATA AS R

    WHERE R.IS_MIDDLE = 1

    GROUP BY R.Column1

    ORDER BY R.Column1

    Results:

    Column1MEDIAN

    978311121

    978320359

    9783241106

    978555495

    978994804

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply