Select nn% from top & bottom over grouped data with sum

  • I have read through many of the posts on the grouping, but cannot find anything that fits what I need. I have a solution that involves staging tables, % calculations and an aweful lot of work. I wondered if any of you SQL wizards can help me reduce the amount of work required as I need to re-do some reports with a slight change. Here goes.

    DECLARE @Temp table (price real, vol int, country varchar(3))

    insert into @Temp select 2.00,1,'GB'

    insert into @Temp select 2.00,200,'GB'

    insert into @Temp select 2.00,270,'GB'

    insert into @Temp select 2.00,280,'GB'

    insert into @Temp select 2.00,300,'GB'

    insert into @Temp select 2.00,800,'GB'

    insert into @Temp select 2.00,900,'GB'

    insert into @Temp select 2.00,3000,'GB'

    insert into @Temp select 4.00,1,'GB'

    insert into @Temp select 4.00,1,'GB'

    insert into @Temp select 4.00,270,'GB'

    insert into @Temp select 4.00,280,'GB'

    insert into @Temp select 4.00,300,'GB'

    insert into @Temp select 4.00,800,'GB'

    insert into @Temp select 4.00,8800,'GB'

    insert into @Temp select 4.00,6000,'GB'

    insert into @Temp select 9.00,1,'GB'

    insert into @Temp select 9.00,1,'GB'

    insert into @Temp select 9.00,290,'GB'

    insert into @Temp select 9.00,310,'GB'

    insert into @Temp select 9.00,350,'GB'

    insert into @Temp select 9.00,800,'GB'

    insert into @Temp select 9.00,8800,'GB'

    insert into @Temp select 9.00,6000,'GB'

    This doesn't represent the real data which is approx 1,000,000 rows.

    I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.

    Is this possible without several staging tables? All help greatly appreciated.

    TIA

  • dji (8/3/2012)


    I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.

    It could be, if you could elaborate more on the above statment!

    1. Can you take one specific example form the sample data and show us your expected output?

    2. Where does nn% come from ? who provides it?

    3. It would be great if you could give the expected results based off of the sample data you provided.

  • ColdCoffee (8/3/2012)


    dji (8/3/2012)


    I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.

    It could be, if you could elaborate more on the above statment!

    1. Can you take one specific example form the sample data and show us your expected output?

    2. Where does nn% come from ? who provides it?

    3. It would be great if you could give the expected results based off of the sample data you provided.

    I didn't specify originally that the top and bottom rows I am trying to exclude are the highest xx and lowest xx vol column.

    Using the data above, assuming the top 1 and bottom 1 row represent the nn% of rows (this is 5% in my true life example). I would expect;

    2.00 GB 2750

    4.00 GB 7651

    9.00 GB 7751

    So basically I have the sum(vol) per group by price excluding the top 5% and bottom 5% of rows ordered by vol.

  • Here is a general approach, which I'm sure you can improve.

    -- Declare variables to set exclusion ranges

    Declare @LowerOutlier decimal(4,2)

    Declare @UpperOutlier decimal(4,2)

    Set @LowerOutlier=0.05 -- adjust as needed

    Set @UpperOutlier=0.95 -- adjust as needed

    -- use common table expressions to pre-sort the data

    -- "GroupedData" assigns row numbers to each group

    -- "RowCounts" finds the numbers of rows for each group

    ;WITH GroupedData as (

    Select Country, Price, Vol, Row_Number() Over(Partition by Country,Price order by Vol) as rn From @Temp

    ),

    RowCounts as (

    Select Country, Price, Max(rn) as 'MaxRowCount' From GroupedData

    Group by Country, Price)

    -- Now perform selections and calculations

    -- Adjust data-types as needed to improve accuracy

    Select t.Country, t.Price, SUM(t.Vol)

    From @Temp t JOIN GroupedData g ON t.Country=g.Country and t.Price=g.Price and t.Vol=g.Vol

    JOIN RowCounts rc ON g.Country=rc.Country and g.Price=rc.Price

    WHERE

    CAST(g.rn as decimal(4,2))/CAST(rc.MaxRowCount as decimal(4,2))< @UpperOutlier AND

    CAST(g.rn as decimal(4,2))/CAST(rc.MaxRowCount as decimal(4,2))>@LowerOutlier

    Group by t.Country, t.Price

    HTH

    Elliott

  • dji (8/3/2012)


    ColdCoffee (8/3/2012)


    dji (8/3/2012)


    I want to sum the vol (volume) grouped by price and country. BUT I want to exclude the top nn% and bottom nn% from the sum (this is to remove outliers) of each group.

    It could be, if you could elaborate more on the above statment!

    1. Can you take one specific example form the sample data and show us your expected output?

    2. Where does nn% come from ? who provides it?

    3. It would be great if you could give the expected results based off of the sample data you provided.

    I didn't specify originally that the top and bottom rows I am trying to exclude are the highest xx and lowest xx vol column.

    Using the data above, assuming the top 1 and bottom 1 row represent the nn% of rows (this is 5% in my true life example). I would expect;

    2.00 GB 2750

    4.00 GB 7651

    9.00 GB 7751

    So basically I have the sum(vol) per group by price excluding the top 5% and bottom 5% of rows ordered by vol.

    Okay, I must be missing something very basic, but I'm not sure what you mean by excluding the 5% and bottom 5% or rows ordered by volume. Maybe its because it is Friday.

  • Thanks Elliot, trying this out now and will report back 🙂

    Lynn, consider the following list of rows (obviously this is far removed from the real data which is 1,000,000+ rows;

    price, vol, country

    2.0 1 GB

    2.0 200 GB

    2.0 300 GB

    2.0 400 GB

    2.0 50000 GB

    6.0 2 GB

    6.0 200 GB

    6.0 300 GB

    6.0 400 GB

    6.0 69000 GB

    I am wanting to sum the vol amount grouped by country and price. BUT I want to do the sum without the top 5% and bottom 5% of rows in each group of price, sorted in vol order. What I am effectively trying to do is use the inner 90% of data excluding the lowest and highest vol rows as they contain freak data that affects further calculations I want to do on the resultant data set.

  • try this....it should work where you have a variable number of rows for each group

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[temp]') AND type in (N'U'))

    DROP TABLE [dbo].[temp]

    GO

    CREATE TABLE [dbo].[temp](

    [price] [int] NULL,

    [vol] [int] NULL,

    [country] [varchar](3) NULL

    ) ON [PRIMARY]

    GO

    INSERT INTO [dbo].[temp]([price], [vol], [country])

    SELECT 1, 1000, N'GB' UNION ALL

    SELECT 1, 25, N'GB' UNION ALL

    SELECT 1, 26, N'GB' UNION ALL

    SELECT 1, 24, N'GB' UNION ALL

    SELECT 1, 22, N'GB' UNION ALL

    SELECT 1, 15, N'GB' UNION ALL

    SELECT 1, 16, N'GB' UNION ALL

    SELECT 1, 22, N'GB' UNION ALL

    SELECT 1, 24, N'GB' UNION ALL

    SELECT 1, 22, N'GB' UNION ALL

    SELECT 1, 15, N'GB' UNION ALL

    SELECT 1, 16, N'GB' UNION ALL

    SELECT 1, 24, N'GB' UNION ALL

    SELECT 1, 22, N'GB' UNION ALL

    SELECT 1, 15, N'GB' UNION ALL

    SELECT 1, 16, N'GB' UNION ALL

    SELECT 1, 22, N'GB' UNION ALL

    SELECT 1, 16, N'GB' UNION ALL

    SELECT 1, 19, N'GB' UNION ALL

    SELECT 1, 2, N'GB' UNION ALL

    SELECT 2, 144, N'GB' UNION ALL

    SELECT 2, 58, N'GB' UNION ALL

    SELECT 2, 56, N'GB' UNION ALL

    SELECT 2, 55, N'GB' UNION ALL

    SELECT 2, 55, N'GB' UNION ALL

    SELECT 2, 68, N'GB' UNION ALL

    SELECT 2, 91, N'GB' UNION ALL

    SELECT 2, 47, N'GB' UNION ALL

    SELECT 2, 66, N'GB' UNION ALL

    SELECT 2, 66, N'GB' UNION ALL

    SELECT 2, 47, N'GB' UNION ALL

    SELECT 2, 62, N'GB' UNION ALL

    SELECT 2, 55, N'GB' UNION ALL

    SELECT 2, 68, N'GB' UNION ALL

    SELECT 2, 77, N'GB' UNION ALL

    SELECT 2, 66, N'GB' UNION ALL

    SELECT 2, 55, N'GB' UNION ALL

    SELECT 2, 58, N'GB' UNION ALL

    SELECT 2, 59, N'GB' UNION ALL

    SELECT 2, 33, N'GB' UNION ALL

    SELECT 3, 1000, N'GB' UNION ALL

    SELECT 3, 144, N'GB' UNION ALL

    SELECT 3, 91, N'GB' UNION ALL

    SELECT 3, 77, N'GB' UNION ALL

    SELECT 3, 68, N'GB' UNION ALL

    SELECT 3, 68, N'GB' UNION ALL

    SELECT 3, 66, N'GB' UNION ALL

    SELECT 3, 66, N'GB' UNION ALL

    SELECT 3, 66, N'GB' UNION ALL

    SELECT 3, 62, N'GB' UNION ALL

    SELECT 3, 59, N'GB' UNION ALL

    SELECT 3, 58, N'GB' UNION ALL

    SELECT 3, 58, N'GB' UNION ALL

    SELECT 3, 56, N'GB' UNION ALL

    SELECT 3, 55, N'GB' UNION ALL

    SELECT 3, 55, N'GB' UNION ALL

    SELECT 3, 55, N'GB' UNION ALL

    SELECT 3, 55, N'GB' UNION ALL

    SELECT 3, 47, N'GB' UNION ALL

    SELECT 3, 47, N'GB' UNION ALL

    SELECT 3, 33, N'GB' UNION ALL

    SELECT 3, 26, N'GB' UNION ALL

    SELECT 3, 25, N'GB' UNION ALL

    SELECT 3, 24, N'GB' UNION ALL

    SELECT 3, 24, N'GB' UNION ALL

    SELECT 3, 24, N'GB' UNION ALL

    SELECT 3, 22, N'GB' UNION ALL

    SELECT 3, 22, N'GB' UNION ALL

    SELECT 3, 22, N'GB' UNION ALL

    SELECT 3, 22, N'GB' UNION ALL

    SELECT 3, 22, N'GB' UNION ALL

    SELECT 3, 19, N'GB' UNION ALL

    SELECT 3, 16, N'GB' UNION ALL

    SELECT 3, 16, N'GB' UNION ALL

    SELECT 3, 16, N'GB' UNION ALL

    SELECT 3, 16, N'GB' UNION ALL

    SELECT 3, 15, N'GB' UNION ALL

    SELECT 3, 15, N'GB' UNION ALL

    SELECT 3, 5, N'GB' UNION ALL

    SELECT 3, 2, N'GB'

    --=== declare percentage of rows to exclude

    declare @rows decimal(5,2)

    set @rows = 0.05

    ;with cte_Rows as

    (

    SELECT @rows * COUNT(*) AS bottomrows, COUNT(*) - (@rows * COUNT(*) ) AS toprows, price, country

    FROM dbo.temp

    GROUP BY price, country

    )

    ,

    cte_RN as

    (

    SELECT price, country, vol, ROW_NUMBER() OVER (partition BY price, country

    ORDER BY price, country, vol DESC) AS nt

    FROM temp

    )

    SELECT cte_RN.price ,

    cte_RN.country ,

    SUM( cte_RN.vol )AS VolSum

    FROM

    cte_RN INNER JOIN cte_rows ON cte_RN.price = cte_rows.price

    AND cte_RN.country = cte_rows.country

    AND cte_RN.nt > cte_rows.bottomrows

    AND cte_RN.nt <= cte_rows.toprows

    GROUP BY cte_RN.price ,

    cte_RN.country;

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

  • have been seeing how the code would run an a bigger data set.

    pls feedback on whether this gives what I think you are looking for

    regards

    USE [tempdb]

    GO

    --== CREATE SOME DATA TO PLAY WITH

    IF EXISTS (SELECT * FROM sys.objects WHERE object_id = OBJECT_ID(N'[dbo].[temp]') AND type in (N'U'))

    DROP TABLE [dbo].[temp]

    GO

    SELECT TOP 1000000 --=== 1M rows

    Price = ABS(CHECKSUM(NEWID())) % 9 + 1,

    Vol = ABS(CHECKSUM(NEWID())) % 90000 + 1,

    Country = CHAR(Abs(Checksum(Newid())) % 10 + 65)

    into temp

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    CROSS JOIN sys.all_columns ac3

    CREATE CLUSTERED INDEX [IX1] ON [dbo].[temp]

    (

    [Country] ASC,

    [Price] ASC,

    [Vol] DESC

    ) ON [PRIMARY]

    GO

    --=== cludge to skew the data for varying rowcounst

    DELETE TOP (20) PERCENT

    FROM temp

    WHERE (Price IN ('2', '7', '8'))

    DELETE TOP (30) PERCENT

    FROM temp

    WHERE (Country IN ('B', 'C', 'F'))

    --=== RUN CODE TO COMPARE RESULTS...NB only using Country A in code below

    --=== ALLROWS

    SET STATISTICS TIME,IO ON

    SELECT Country ,

    Price ,

    MAX( Vol )AS MaxVol ,

    MIN( Vol )AS MinVol ,

    SUM( Vol )AS VolSum ,

    COUNT( * )AS CntRows

    FROM temp

    GROUP BY Country ,

    Price

    HAVING Country = 'A'

    ORDER BY Country , Price;

    SET STATISTICS TIME,IO ON

    --=== SUM IGNORE TOP?BOTTOM PERCENT as set by @rows variable

    --=== declare percentage of rows to exclude

    declare @rows decimal(5,2)

    set @rows = 0.05 --- ALTER THIS AS REQUIRED

    SET STATISTICS TIME,IO ON

    ;with cte_Rows as

    (

    SELECT TOP 100 PERCENT @rows * COUNT( * )AS bottomrows ,

    COUNT( * ) - @rows * COUNT( * )AS toprows ,

    price ,

    country

    FROM dbo.temp

    GROUP BY country ,

    price

    ORDER BY country , price

    )

    ,

    cte_RN as

    (

    SELECT country ,

    price ,

    vol ,

    ROW_NUMBER( )OVER( PARTITION BY price ,

    country

    ORDER BY country , price , vol DESC )AS nt

    FROM temp

    )

    SELECT cte_RN.Country ,

    cte_RN.Price ,

    MAX(cte_RN.Vol )AS MaxVol ,

    MIN(cte_RN.Vol )AS MinVol ,

    SUM(cte_RN.Vol )AS VolSum ,

    COUNT( * )AS CntRows

    FROM

    cte_RN INNER JOIN cte_rows ON cte_RN.price = cte_rows.price

    AND cte_RN.country = cte_rows.country

    AND cte_RN.nt > cte_rows.bottomrows

    AND cte_RN.nt <= cte_rows.toprows

    WHERE cte_RN.country IN ( 'A' ) ---, 'B', 'C','D','E')

    GROUP BY cte_RN.country, cte_RN.price

    SET STATISTICS TIME,IO OFF

    ________________________________________________________________
    you can lead a user to data....but you cannot make them think
    and remember....every day is a school day

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply