Calculate Weighted Median

  • Hi,

    Could anyone please help me in writing a query to find the weighted Median for the following data

    CREATE TABLE [dbo].[dr_temp](

    [PK] [int] IDENTITY(1,1) NOT NULL,

    [ID] [int] NOT NULL,

    [Code] [nvarchar](10) NOT NULL,

    [Value] [int] NOT NULL,

    [Volume] [int] NOT NULL

    )

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',40,30778)

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',40,24860)

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',40,82043)

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',41,136116)

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',42,106886)

    insert into dr_temp(ID,Code,Volume,Value) values(1,'120',45,179646)

    IDCODEValueVolumeRunningTotalAverage

    1120403077830778

    1120402486055638

    11204082043137681

    112041136116273797

    112042106886380683

    112045179646560329280164.5

    I'm expectiing the Value to be 45... How

    Cumulative/2 value is 280164.5 which actually falls b/w 106886 and 380683 but 380683 is greater than 280164.5 so the value is 45.

    Hope this helps.

    Kindly help me out .

    If this is possible in SSIS also please let me know.

    For any questions do reply me.

    Thanks

  • Would you mind elaborate the business case?

    Especially the part

    falls b/w 106886 and 380683 but 380683 is greater than 280164.5 so the value is 45.

    Why do you compare the median value to a single Value (106886) instead of the previous running toal (273797)?

    Also, whydo you use the value of the next row that is larger than the row that already exceeded your median value? (I would expect the target value to be 42, not 45).

    Finally, your sample data don't match your result set (value and volume are mixed).

    Please clarify.



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi,

    Thanks for the reply

    Here is the business rule

    280164.5 falls in the range b/w 106886 and 380683

    However the ending range i.e 380683 > than 280164.5, due to which we need to select the next column which matched the criteria.

    However this might not be happening to othe Code's

    Hope this helps.

    And oops! :ermm: Column Mismatched.

    Thanks

  • It still is not really clear:

    280164.5 falls in the range b/w 106886 and 380683.

    But it also falls in the range b/w 179646 and 560329. Why not use that one?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi,

    Yep we can use the last one.

    Can this done in a SQL Query?

    Thanks

  • ashok_raja (3/2/2010)


    Hi,

    Yep we can use the last one.

    Can this done in a SQL Query?

    Thanks

    It depends.

    Usually, a SQL query will rely on some logic requirement(s). If both results are possible, which one has to be used as a base to determin "the next row"?

    Let's assume your next row would look like the following (ignore the wrong math. I didn't bother to calculate the new RunningTotal value).

    ID CODE Value Volume RunningTotal

    1 120 46 379646 660329

    If we would use the logic from your previous statements, the the result would be 45.

    But if we'd also consider your last statement, the result would be either 45 or 46.

    That's possible to do with SQL as well. But the question is: What is the correct answer? And why?



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi,

    Actually 46 will not be there as the next row will be for different ID

    i.e

    Insert into dr_temp(ID,Code,Value,Volume) Values(

    2,'120',31,2311)

    if present then Yes! it matches, I think Let's get the 1st match

    B/w 106886 and 380683. Is this Possible.

    Thanks

  • ashok_raja (3/2/2010)


    Hi,

    Actually 46 will not be there as the next row will be for different ID

    i.e

    Insert into dr_temp(ID,Code,Value,Volume) Values(

    2,'120',31,2311)

    if present then Yes! it matches, I think Let's get the 1st match

    B/w 106886 and 380683. Is this Possible.

    Thanks

    Yes.

    Do you have a clustered index on your table and if yes, what columns are included?

    Is there any correlation between the clustered index and the volume column? (eg. if I sort the table based on the clustered index column, will the values for the volume column per ID be in order as well?)



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi

    Apoligize for the confusion..

    Here is what I want

    ID CODE Value Volume RunningTotal Average

    1 120 40 30778 30778

    1 120 40 24860 55638

    1 120 40 82043 137681

    1 120 41 136116 273797

    1 120 42 106886 380683

    1 120 45 179646 560329 280164.5

    Since 280164.5 falls b/w 273797 and 380683, the result set should be

    1 120 42 106886 380683

    Should be able to find median b/w the running total...

    Hope this clears...

    Thanks

  • ashok_raja (3/3/2010)


    Hi

    Apoligize for the confusion..

    Here is what I want

    ID CODE Value Volume RunningTotal Average

    1 120 40 30778 30778

    1 120 40 24860 55638

    1 120 40 82043 137681

    1 120 41 136116 273797

    1 120 42 106886 380683

    1 120 45 179646 560329 280164.5

    Since 280164.5 falls b/w 273797 and 380683, the result set should be

    1 120 42 106886 380683

    Should be able to find median b/w the running total...

    Hope this clears...

    Thanks

    With your reply, none of my questions in my previous post has been answered. :crying:

    Hard to help...



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

  • Hi,

    There is no index because this is a temp table to find the Median value per Code.

    Hope this clears.

    Thanks

  • Since you're using a temp table - you will want to add a clustered index in order to make the running totals work.

    You should start by reading Jeff's article on the matter (so you can understand why the clustered index is such a hot topic all of a sudden):

    http://www.sqlservercentral.com/articles/T-SQL/68467/[/url]

    Also since you're using temp tables, you should pre-aggregate the data so that there is ONE record per code+value combination. Referring back to your example records, having multiple likes of code=160, value= 40 would make like substantially more complicated than necessary. Praggregating would also allow you to know what your target number is on the running total.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Ok, here's what I came up with.

    It is imporant to understand the quirky update concept as well as the requirements that need to be fulfilled. So, I urge you to carefully read the article Chris mentioned (the same article is referenced in my sample code, too...)

    SET NOCOUNT ON

    --- original data

    CREATE TABLE #dr_temp(

    [PK] [INT] IDENTITY(1,1) NOT NULL,

    [ID] [INT] NOT NULL,

    Code [NVARCHAR](10) NOT NULL,

    [VALUE] [INT] NOT NULL,

    [Volume] [INT] NOT NULL

    )

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',40,30778)

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',40,24860)

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',40,82043)

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',41,136116)

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',42,106886)

    INSERT INTO #dr_temp(ID,Code,VALUE,Volume) VALUES(1,'120',45,179646)

    -- column added to hold RunningTotal values

    ALTER TABLE #dr_temp ADD RunningTotal INT

    -- add clustered index to guarantee the specific order for the running total

    CREATE CLUSTERED INDEX IX_#dr_temp_ID_Volume ON #dr_temp (ID,PK);

    -- perform the "quirky update"

    -- for details, please see Jeff Modens related article:

    -- Link: http://www.sqlservercentral.com/articles/T-SQL/68467/

    DECLARE @PrevID INT

    DECLARE @RunningTotal INT

    UPDATE #dr_temp

    SET

    @RunningTotal = RunningTotal =

    CASE

    WHEN ID = @PrevID THEN @RunningTotal + VolumeELSE Volume

    END,

    @PrevID = ID

    FROM #dr_temp

    WITH (TABLOCKX) OPTION (MAXDOP 1)

    ;WITH cte AS -- calculate the median per ID

    (

    SELECT

    pk,

    ID,

    MedianDiff = (RunningTotal - (MAX(RunningTotal) OVER (PARTITION BY id))/2.0 )

    FROM #dr_temp

    ), cte2 AS -- find the PK values holding values larger than the median and number thos PK's

    (

    SELECT

    pk AS pkSub,

    ID,

    ROW_NUMBER() OVER(PARTITION BY id ORDER BY pk ) AS ROW

    FROM cte

    WHERE MedianDiff > 0

    )

    -- final output: values of the row that holds the 2nd value larger than the median

    SELECT #dr_temp.*

    FROM cte2

    INNER JOIN #dr_temp

    ON cte2.id = #dr_temp.id

    AND pksub=pk

    WHERE ROW=2

    DROP TABLE #dr_temp



    Lutz
    A pessimist is an optimist with experience.

    How to get fast answers to your question[/url]
    How to post performance related questions[/url]
    Links for Tally Table [/url] , Cross Tabs [/url] and Dynamic Cross Tabs [/url], Delimited Split Function[/url]

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply