Trying to find a more simplistic way to execute this complicated Aggregation Query

  • Greetings all

    So below I have a query that is returning results as I actually want them. However, I am wondering if there is a way to do this in one select statement, instead of an inner and outer select statement that runs math on the aggregations.

    Essentially all I am looking for is the ExpansionRatio result column. Essentially, there is a SalesID, an AccountID, Month, and year. The data is over 3 years and not all revenue for each combo of sales,accounts, and months are in all 3 years. The Expansion ratio is essentially the sum by month, of average revenue by account (hopefully that made sense)

    So in the end, I am hoping there is a way to do this more simplistically as the real data set I need to do this over several hundred million records and the table joins are much more significant than this query, so I need this particular piece to be as simplistic as possible 🙂

    Use TempDB

    IF EXISTS (SELECT * FROM INFORMATION_SCHEMA.tables where table_name = 'AggTest')

    DROP TABLE dbo.Aggtest;

    CREATE TABLE dbo.AggTest

    (

    SalesID smallint

    ,AccountID smallint

    ,FiscalMonth smallint

    ,FiscalYear nvarchar(4)

    ,Revenue int

    ) ;

    INSERT dbo.AggTest

    values( 1,1,1,'FY12',10)

    ,( 1,3,1,'FY12',20)

    ,( 1,1,1,'FY13',20)

    ,( 1,2,1,'FY13',50)

    ,( 1,3,2,'FY13',1)

    ,( 1,1,1,'FY14',30)

    ,( 1,2,1,'FY14',100)

    ,( 1,3,1,'FY14',20)

    ,( 1,1,2,'FY14',5)

    ,( 1,2,2,'FY14',50)

    ,( 1,3,2,'FY14',499);

    SELECT

    Agg.SalesID

    ,Agg.AccountID

    ,Agg.FiscalMonth

    ,Agg.#ofYears

    ,Agg.Revenue

    ,AvgByAccount

    ,Sum(AvgByAccount) over (partition by FiscalMonth) as totavg

    , (AvgByAccount * 1.000) / (sum(avgbyaccount) over (partition by FiscalMonth)) AS ExpansionRatio

    FROM (

    SELECT

    SalesID

    ,AccountID

    ,FiscalMonth

    ,count(distinct fiscalyear) #ofYears

    ,sum(revenue) Revenue

    ,Sum(revenue)/count(distinct fiscalyear) AvgByAccount

    FROM dbo.AggTest

    GROUP BY

    SalesID

    ,AccountID

    ,FiscalMonth

    ) AS Agg

    Link to my blog http://notyelf.com/

  • You say "..of average revenue by account... " so why are you averaging it based on fiscalyear.

    Sum(revenue)/count(distinct fiscalyear) AvgByAccount

    You could easily replace it with the Average function AVG(revenue) which will average it grouped on SalesID,AccountID and FiscalMonth

    --------------------------------------------------------------------------------------------------
    I am just an another naive wannabe DBA trying to learn SQL Server

  • @Shannon,

    The code that you have is fine for what you have. There is no filter criteria in the code so it's going to do a full table scan, no matter what. You might be able to improve performance a fair bit if you change the clustered index from whatever it is to the columns that you have in the GROUP BY and, of course, in that order. Understanding that it will be a fairly large duplication of data, a non-clustered index based on the GROUP BY columns along with INCLUDES for the other columns (a "covering" index, in no uncertain terms) in the query might be a better alternative especially since it appears that the GROUP BY columns are neither unique nor ever-increasing.

    The big problem will occur when you do the joins you speak of. I have no idea what they may be so there's no way that we can help with that for now but...

    It would also appear that all but the current month of data is truly non-modifiable, historical data. My recommendation would be that you pre-aggregate and store each month's (by the same columns in the GROUP BY with the addition of the fiscal year) worth of data in a summary table. As each month rolls by, you would only have to summarize the month-just-ended and add that data to your summary table. Think of it as a pre-aggregated index. Of course, you'd also need to store whatever data is necessary to join to the other tables (which you currently don't have listed) and, perhaps, do those joins after you calculate and store each month's summary to update the summary with the joined information.

    Considering the nature of all of this, an Indexed View (also known as a materialized view) might be appropriate with the warning that it could slow down current month Inserts/Updates quite a bit.

    As a bit of a side bar, if you really need to do this over "hundreds of millions" of rows, I'd definitely take the pre-aggregated summary route and calculate just one month at time for the original build but there's something else to consider. Backups, DR restores, and index maintenance. With that in mind, consider partitioning the table using either partitioned View (Enterprise Edition and Standard Edition) or paritioned Table (Enterprise Edition only) or just plain ol' "hack'n'stack" methods to put each month of data on a separate file in a separate file group per month. This will keep you from having to backup static data for every month (you'll only need to backup the current month, which will be a HUGE nightly time savings), allow you to do "Online Piece-meal" restores if any part of the table ever goes corrupt, and greatly reduces the time for nightly index maintenance. Depending on the nature of your queries and the criteria used, it can also improve the performance of those queries depending on what your partitioning and query search criteria is.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • BTW... in anticipation of you using any form of partitioning of one month per file per filegroup, if you think it was easy, you probably did it wrong and wasted a huge amount of space per file. If you decide to go the route of partitioning of the base data (regardless of the use of a summary table or not), let us know... I have some tricks that I've used when I've had to partition such mostly-static tables.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for the responses everyone :).

    @sachin; you are right I named that strangely. It should be called, AvgbyYearbyAccount. I called it this because the account is the most granular level in the aggregation and it is averaging it across years of data.

    @jeff; thank you for all the detailed information! In the final queries there are quite a lot of filters so it does execute fairly well (About 10 minutes across 100 million rows so not too bad).

    I have actually done massive partitioning schemes before and it was anything but easy! However, now that you bring that up, it was pretty much a custom developed solution I built, so I am wondering what you have as far as the partitioning goes.

    Also, that is really good information in regards to the indexes, I did not know some of that. I am going to have to go back and check the execution plan and see what it is doing now and see if I can tweak the indexes to make it go faster ;).

    Link to my blog http://notyelf.com/

  • As an idea for an alternative, here is a window function based solution. Still needs a CTE as window function cannot be directly used as a parameter in another window function.

    😎

    USE tempdb;

    GO

    ;WITH REP_BASE AS

    (

    SELECT

    AT.SalesID

    ,AT.AccountID

    ,AT.FiscalMonth

    ,DENSE_RANK() OVER

    (

    PARTITION BY AT.SalesID

    ,AT.AccountID

    ,AT.FiscalMonth

    ORDER BY AT.SalesID

    ,AT.AccountID

    ,AT.FiscalYear

    ) AS NUM_YEAR_FCM

    ,SUM(AT.Revenue)OVER

    (

    PARTITION BY AT.SalesID

    ,AT.AccountID

    ,AT.FiscalMonth

    ) AS ACNT_FM_SUM

    ,ROW_NUMBER() OVER

    (

    PARTITION BY AT.SalesID

    ,AT.AccountID

    ,AT.FiscalMonth

    ORDER BY AT.SalesID ASC

    ,AT.AccountID ASC

    ,AT.FiscalYear DESC

    ) AS ACNT_FM_RID

    FROM dbo.AggTest AT

    )

    SELECT

    RB.SalesID

    ,RB.AccountID

    ,RB.FiscalMonth

    ,RB.NUM_YEAR_FCM AS [#ofYears]

    ,RB.ACNT_FM_SUM AS Revenue

    ,(RB.ACNT_FM_SUM / RB.NUM_YEAR_FCM) AS AvgByAccount

    ,SUM(RB.ACNT_FM_SUM / RB.NUM_YEAR_FCM) OVER

    (

    PARTITION BY RB.FiscalMonth

    ) AS totavg

    ,(RB.ACNT_FM_SUM / RB.NUM_YEAR_FCM * 1.0)

    / SUM((RB.ACNT_FM_SUM / RB.NUM_YEAR_FCM * 1.0)) OVER

    (

    PARTITION BY RB.FiscalMonth

    ) AS ExpansionRatio

    FROM REP_BASE RB

    WHERE RB.ACNT_FM_RID = 1;

    Results

    SalesID AccountID FiscalMonth #ofYears Revenue AvgByAccount totavg ExpansionRatio

    ------- --------- ----------- --------- -------- ------------- ------- --------------------

    1 1 1 3 60 20 115 0.1739130434782608

    1 2 1 2 150 75 115 0.6521739130434782

    1 3 1 2 40 20 115 0.1739130434782608

    1 1 2 1 5 5 305 0.0163934426229508

    1 2 2 1 50 50 305 0.1639344262295081

    1 3 2 2 500 250 305 0.8196721311475409

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply