Calculating running totals for partitions of data

  • I have table named #t1 and three columns. CODE, Column1, and Column2.

    create table #t1 (

    CODE NVARCHAR(20),

    COLUMN1 NUMERIC(18,2),

    COLUMN2 NUMERIC(18,2)

    )

    And i have some data:

    INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)

    VALUES ('432', 0,100),

    ('TOTAL FOR 432',0,100),

    ('4320001',0,250),

    ('4320001',50,0),

    ('4320001',0,140),

    ('4320001',300,0),

    ('TOTAL FOR 4320001',350,390),

    ('432002',200,0),

    ('432002',0,100),

    ('TOTAL FOR 432002',200,100)

    drop table #t1

    I want to have 4 column (named BALANCE). Balance must be column that represent running totals between two columns (Column1 - Column2) for each group of data. For each group total must start from zero.

    Output:

    Take a look picture!

    After total 432 it starts to count again for total 4320001 and again for total 432002..... How can i get this result?

    I'm using MS SQL SERVER 2014

  • We don't use SQL2014 where I work but from a quick search the following link may be of use

    https://msdn.microsoft.com/en-us/library/hh231256.aspx

    That said, I think based on your test data the "Total For" row may cause an issue.

    That said below is a quick attempt at it in 2008 using a self join and no doubt someone will come along with a much more efficient way to do it 😉

    The row count and rank part was the only way I could think to not have Total For included and not affect the running total and ensure it always came at the end of the data for the group.

    DECLARE @SomeData TABLE

    (

    CODE NVARCHAR(20) ,

    COLUMN1 NUMERIC(18, 2) ,

    COLUMN2 NUMERIC(18, 2)

    )

    INSERT INTO @SomeData

    ( CODE, COLUMN1, COLUMN2 )

    VALUES ( N'432', 0, 100 ),

    ( 'TOTAL FOR 432', 0, 100 ),

    ( N'4320001', 0, 250 ),

    ( N'4320001', 50, 0 ),

    ( N'4320001', 0, 140 ),

    ( N'4320001', 300, 0 ),

    ( 'TOTAL FOR 4320001', 350, 390 ),

    ( '432002', 200, 0 ),

    ( '432002', 0, 100 ),

    ( 'TOTAL FOR 432002', 200, 100 );

    WITH DataRanked

    AS ( SELECT sd.CODE ,

    sd.COLUMN1 ,

    sd.COLUMN2 ,

    sd.COLUMN1 - sd.column2 AS PreTotal ,

    ROW_NUMBER() OVER ( PARTITION BY sd.CODE ORDER BY sd.CODE ) AS RCount ,

    RANK() OVER ( PARTITION BY REPLACE(sd.CODE,

    'TOTAL FOR ', '') ORDER BY sd.CODE ) AS FinalSorter

    FROM @SomeData AS sd

    )

    SELECT b.CODE ,

    b.COLUMN1 ,

    b.COLUMN2 ,

    SUM(ISNULL(b2.PreTotal, 0)) AS RTotal

    FROM DataRanked b

    LEFT OUTER JOIN DataRanked b2 ON b.CODE = b2.CODE

    AND b2.RCount <= b.RCount

    GROUP BY b.CODE ,

    b.RCount ,

    b.FinalSorter ,

    b.COLUMN1 ,

    b.COLUMN2

    ORDER BY REPLACE(b.CODE, 'TOTAL FOR ', '') ,

    b.FinalSorter ,

    b.RCount

  • You can use an OVER() to help here, but you have no ordering, so how can you get a running total? How do you know that ( N'4320001', 0, 140 ) comes before ( N'4320001', 300, 0 )? The order of insert has zero impact on how the rows are stored.

  • I did notice no ordering on the data in the table but failed to mention it. My bad.

  • What do you think about this:

    DECLARE @t1 TABLE (

    CODE nvarchar(20),

    COLUMN1 numeric(18, 2),

    COLUMN2 numeric(18, 2)

    )

    INSERT INTO @t1 (CODE, COLUMN1, COLUMN2)

    VALUES ('432', 0, 100),

    ('TOTAL FOR 432', 0, 100),

    ('4320001', 0, 250),

    ('4320001', 50, 0),

    ('4320001', 0, 140),

    ('4320001', 300, 0),

    ('TOTAL FOR 4320001', 350, 390),

    ('432002', 200, 0),

    ('432002', 0, 100),

    ('TOTAL FOR 432002', 200, 100)

    -- CTE

    ;

    WITH CTE

    AS (SELECT

    *,

    ROW_NUMBER() OVER (ORDER BY @@rowcount) RowNum -- Order as per the input

    FROM @t1)

    -- get from CTE

    SELECT

    *,

    SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY RowNum) AS Balance --sum using PARTITION

    FROM CTE

    ORDER BY RowNum

  • Here is a method that is usually faster.

    But first, here is the data, adding a few columns and not inserting those total rows:

    IF OBJECT_ID('tempdb.dbo.#t1') IS NOT NULL DROP TABLE #t1;

    CREATE table #t1 (

    RowID INTEGER IDENTITY PRIMARY KEY CLUSTERED,

    CODE NVARCHAR(20),

    COLUMN1 NUMERIC(18,2),

    COLUMN2 NUMERIC(18,2),

    RunningTotal NUMERIC(18,2)

    );

    INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)

    VALUES ('432', 0,100),

    --('TOTAL FOR 432',0,100),

    ('4320001',0,250),

    ('4320001',50,0),

    ('4320001',0,140),

    ('4320001',300,0),

    --('TOTAL FOR 4320001',350,390),

    ('432002',200,0),

    ('432002',0,100)--,

    --('TOTAL FOR 432002',200,100);

    This next part does a "Quirky Update". See remarks within the code.

    -- this part using a "Quirky Update" - the fastest way to do running totals

    -- but, this has a bunch of rules that need to be followed.

    -- these rules can be found at http://www.sqlservercentral.com/articles/T-SQL/68467/

    DECLARE @Sequence INTEGER, -- for the safety check

    @RowID INTEGER, -- for the anchor column

    @Code NVARCHAR(20),

    @Balance NUMERIC(18,2);

    SET @Sequence = 0;

    SET @Code = N'';

    WITH SafetyCheck AS

    (

    SELECT RowID,

    CODE,

    COLUMN1,

    COLUMN2,

    LineTotal = COLUMN1 - COLUMN2,

    RunningTotal,

    [Sequence] = ROW_NUMBER() OVER (ORDER BY RowID)

    FROM #t1

    )

    UPDATE sc

    SET @RowID = sc.RowID,

    @Balance = CASE WHEN @Code = Code THEN @Balance + LineTotal ELSE LineTotal END, -- reset balance when code changes, otherwise accumulate the running total

    RunningTotal = @Balance,

    @Code = Code,

    @Sequence = CASE WHEN @Sequence+1 = Sequence THEN @Sequence + 1 ELSE 1/0 END -- Safety Check aborts if not being calculated in the proper order

    FROM SafetyCheck sc

    SELECT *

    FROM #t1;

    Here is what I would use on SQL Server 2012 or higher. It's simpler, almost as fast, and doesn't have all of those undocumented (by MS) rules to make it work:

    -- SQL SERVER 2012 version:

    SELECT RowID,

    CODE,

    COLUMN1,

    COLUMN2,

    RunningTotal = SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY CODE ORDER BY RowID)

    FROM #t1

    ORDER BY RowID;

    Finally:

    IF OBJECT_ID('tempdb.dbo.#t1') IS NOT NULL DROP TABLE #t1;

    Both of these methods produce the following results:

    RowID CODE COLUMN1 COLUMN2 RunningTotal

    ----- ------- ------- ------- ------------

    1 432 0.00 100.00 -100.00

    2 4320001 0.00 250.00 -250.00

    3 4320001 50.00 0.00 -200.00

    4 4320001 0.00 140.00 -340.00

    5 4320001 300.00 0.00 -40.00

    6 432002 200.00 0.00 200.00

    7 432002 0.00 100.00 100.00

    Does this look like what you want?

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • View fourth answer and tell me your opinion

  • kdejan87 (9/23/2015)


    What do you think about this:

    DECLARE @t1 TABLE (

    CODE nvarchar(20),

    COLUMN1 numeric(18, 2),

    COLUMN2 numeric(18, 2)

    )

    INSERT INTO @t1 (CODE, COLUMN1, COLUMN2)

    VALUES ('432', 0, 100),

    ('TOTAL FOR 432', 0, 100),

    ('4320001', 0, 250),

    ('4320001', 50, 0),

    ('4320001', 0, 140),

    ('4320001', 300, 0),

    ('TOTAL FOR 4320001', 350, 390),

    ('432002', 200, 0),

    ('432002', 0, 100),

    ('TOTAL FOR 432002', 200, 100)

    -- CTE

    ;

    WITH CTE

    AS (SELECT

    *,

    ROW_NUMBER() OVER (ORDER BY @@rowcount) RowNum -- Order as per the input

    FROM @t1)

    -- get from CTE

    SELECT

    *,

    SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY RowNum) AS Balance --sum using PARTITION

    FROM CTE

    ORDER BY RowNum

    Don't do that! SUM OVER will crush your performance. Even if you remove the ORDER BY at the end of your query (which is not required) you'll see that the query plan produces an ugly sort. With the ORDER BY you get and additional SORT.

    Below is the sample data I was working with. Like Steve Suggested and Wayne did, I added a ROWID so we have something to sort by.

    -- Added ROWID so that we have something to sort by

    -- made it a primary key so that the query plan does not produce a sort

    IF OBJECT_ID('tempdb..#t1') IS NOT NULL DROP TABLE #t1

    GO

    CREATE TABLE #t1

    (

    ROWID int identity primary key,

    CODE NVARCHAR(20),

    COLUMN1 NUMERIC(18,2),

    COLUMN2 NUMERIC(18,2)

    )

    INSERT INTO #t1 (CODE,COLUMN1,COLUMN2)

    VALUES ('432', 0,100),

    ('TOTAL FOR 432',0,100),

    ('4320001',0,250),

    ('4320001',50,0),

    ('4320001',0,140),

    ('4320001',300,0),

    ('TOTAL FOR 4320001',350,390),

    ('432002',200,0),

    ('432002',0,100),

    ('TOTAL FOR 432002',200,100);

    Below is your solution (a simplified version that produces the same query plan as what you posted). I included two correlated subqueries; one with an ORDER BY and one with out. If you run this you'll see a few things:

    SET NOCOUNT ON;

    SET STATISTICS IO ON;

    PRINT 'QUERY 1: SUM + PARTITION';

    SELECT

    *,

    SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY ROWID) AS Balance --sum using PARTITION

    FROM #t1;

    PRINT 'QUERY 2: SUM + PARTITION + ORDER BY ';

    SELECT

    *,

    SUM(COLUMN1 - COLUMN2) OVER (PARTITION BY code ORDER BY ROWID) AS Balance --sum using PARTITION

    FROM #t1

    ORDER BY ROWID -- slows things down...

    PRINT 'QUERY 3: CORRELATED SUBQUERY'

    SELECT *, (SELECT SUM(COLUMN1 - COLUMN2) FROM #t1 t2 WHERE t1.ROWID=t2.ROWID AND T1.CODE=t2.CODE)

    FROM #t1 t1

    PRINT 'QUERY 4: CORRELATED SUBQUERY with sort'

    SELECT *, (SELECT SUM(COLUMN1 - COLUMN2) FROM #t1 t2 WHERE t1.ROWID=t2.ROWID AND T1.CODE=t2.CODE)

    FROM #t1 t1

    ORDER BY ROWID;

    SET STATISTICS IO OFF;

    First, SUM OVER requires a sort (two with the ORDER BY statement) even though the ORDER BY is sorting on the clustered index. The two correlated subqueries make better use of the index. Plus they work on any version of SQL Server.

    Correlated subqueries are NOT the way to go. Wayne showed a couple better solution. I just used them to show how inefficient SUM OVER is.

    Lastly (this is the reason for my edit)... You'll also notice that I turned STATISTICS IO ON. If you examine the statistics for my four queries I posted you'll notice that the SUM OVER solutions both scan #t1 16 times (and that number will keep going up the more rows you add). The correlated subqueries scan the table twice, once for the outer query, once for the inner query. ROW OVER has nearly three times as many reads and that number will keep going up as you add rows.

    In summary, Correlated subquery = bad; SUM OVER = very, very bad.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • @Wayne

    Was this your intended 2012 solution:

    -- SQL SERVER 2012 version:

    SET STATISTICS IO ON;

    SELECT

    RowID,

    CODE,

    COLUMN1,

    COLUMN2,

    RunningTotal =

    SUM(COLUMN1 - COLUMN2) OVER

    (

    PARTITION BY CODE ORDER BY RowID

    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    )

    FROM #t1

    ORDER BY RowID;

    @kdejan87 - using 2012 framing (ROWS BETWEEN UNBOUNDED....) makes SUM OVER okay... Referring back to my queries with Statistics IO on, this solution blows the correlated subquery away

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#t1_________________________________________________________________________________________________________________00000000000D'.

    Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • kdejan87 (9/23/2015)


    View fourth answer and tell me your opinion

    [font="Arial Black"]ALL [/font]the answers are incorrect because there is no column that enforces the order of the data in the original data. Adding a ROWID won't help to do it right. If either in incrementing number or a transaction date/time is not available in the original data, all answers will continue to be incorrect. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • kdejan87,

    Does the table have any kind of column(s) that would indicate the correct order of the transactions?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Alan.B (9/23/2015)


    @Wayne

    Was this your intended 2012 solution:

    -- SQL SERVER 2012 version:

    SET STATISTICS IO ON;

    SELECT

    RowID,

    CODE,

    COLUMN1,

    COLUMN2,

    RunningTotal =

    SUM(COLUMN1 - COLUMN2) OVER

    (

    PARTITION BY CODE ORDER BY RowID

    ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW

    )

    FROM #t1

    ORDER BY RowID;

    @kdejan87 - using 2012 framing (ROWS BETWEEN UNBOUNDED....) makes SUM OVER okay... Referring back to my queries with Statistics IO on, this solution blows the correlated subquery away

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table '#t1_________________________________________________________________________________________________________________00000000000D'.

    Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Alan,

    good catch, I did forget the ROWS clause.

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Jeff Moden (9/23/2015)


    kdejan87 (9/23/2015)


    View fourth answer and tell me your opinion

    [font="Arial Black"]ALL [/font]the answers are incorrect because there is no column that enforces the order of the data in the original data. Adding a ROWID won't help to do it right. If either in incrementing number or a transaction date/time is not available in the original data, all answers will continue to be incorrect. 😉

    +1. This.

Viewing 13 posts - 1 through 12 (of 12 total)

You must be logged in to reply to this topic. Login to reply