RE: Solving the "Running Total" & "Ordinal Rank" Problems (Rewritten)

SSC Guru

Points: 1003888

November 10, 2009 at 10:03 pm

Hugo Kornelis (11/10/2009)
I then changed the options from LOCAL FORWARD_ONLY READ_ONLY to LOCAL STATIC READ_ONLY and shaved off another half a mintue - 3:58

Hi Hugo,

Heh... just some feedback so you know... I think your machine hates cursors worse than I do 😛 Your optimized cursor only took 2:13 (~133 seconds) on my machine. You've gotten it down to only 18 times worse and that's not a slam... well done.

As a side bar, I was never good at writing cursors and I guess it's a good thing I didn't take them up. :hehe:

Back to the subject...

I added some "timer" code to your "Ninja" code for the book so we don't have to do any manual calculations to determine the run time for each section. Here's the code with all changes clearly marked...

DECLARE @RunStartTime DATETIME, --Added

@SectionStartTime DATETIME; --Added

SELECT @RunStartTime = CURRENT_TIMESTAMP, --Added

@SectionStartTime = CURRENT_TIMESTAMP; --Added

--SELECT CURRENT_TIMESTAMP; --Removed

DECLARE @Results TABLE

(AccountID int NOT NULL,

Date datetime NOT NULL,

TransactionDetailID int NOT NULL,

Amount money NOT NULL,

RunningTotal money NULL,

Rnk int NOT NULL,

PRIMARY KEY (Rnk, AccountID));

INSERT INTO @Results(AccountID, Date, TransactionDetailID, Amount,

RunningTotal, Rnk)

SELECT AccountID, Date, TransactionDetailID, Amount, Amount,

RANK() OVER (PARTITION BY AccountID

ORDER BY Date,

TransactionDetailID)

FROM dbo.TransactionDetail;

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds to INSERT';

SET @SectionStartTime = CURRENT_TIMESTAMP; --Modified

DECLARE @Rank int,

@RowCount int;

SET @Rank = 1;

SET @RowCount = 1;

WHILE @RowCount > 0

BEGIN;

SET @Rank = @Rank + 1;

UPDATE nxt

SET RunningTotal = prv.RunningTotal + nxt.Amount

FROM @Results AS nxt

INNER JOIN @Results AS prv

ON prv.AccountID = nxt.AccountID

AND prv.Rnk = @Rank- 1

WHERE nxt.Rnk = @Rank;

SET @RowCount = @@ROWCOUNT;

END;

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Running Total UPDATE';

SET @SectionStartTime = CURRENT_TIMESTAMP; --Modified

UPDATE td

SET AccountRunningTotal = r.RunningTotal,

AccountRunningCount = r.Rnk

FROM dbo.TransactionDetail AS td

INNER JOIN @Results AS r

ON r.TransactionDetailID = td.TransactionDetailID;

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Final UPDATE';

SELECT STR(DATEDIFF(ss,@RunStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Total Run';

On my machine, your good code took the following times...

16 Seconds to INSERT

29 Seconds for Running Total UPDATE

51 Seconds for Final UPDATE

----

96 Seconds for Total Run

That's incredible. Folks now have an alternative not only to the slothfulness of a cursor, but it's fast enough (well, for some folks) to make the future guaranteed stability of the method worth the wait.

Heh... of course, I have to tweek on that code a bit :-D. The first thing I noticed was that there were a bunch of row counts being output. Rumor has it that that costs extra time even if they never make it to the screen. I added a SET NOCOUNT ON to your good code and it chopped about 1 second off the run...

16 Seconds to INSERT

29 Seconds for Running Total UPDATE

50 Seconds for Final UPDATE

----

95 Seconds for Total Run

Surprisingly, the time savings didn't occur on the Running Total UPDATE where all the row counts are generated, so I ran it a couple of times again and the times (on my very quiet test machine) varied from 95 to 97 seconds so it doesn't look like the SET NOCOUNT ON actually helped much here, probably due to the fact that there were only 20,555 lines printed in the message window and half of them were empty string rows.

The 2 UPDATE sections of the code are pretty tight so I couldn't do much there. But, that initial insert and that Table Variable needed a change...

SET NOCOUNT ON

DECLARE @RunStartTime DATETIME, --Added

@SectionStartTime DATETIME; --Added

SELECT @RunStartTime = CURRENT_TIMESTAMP, --Added

@SectionStartTime = CURRENT_TIMESTAMP; --Added

--SELECT CURRENT_TIMESTAMP; --Removed

--DECLARE @Results TABLE

-- (AccountID int NOT NULL,

-- Date datetime NOT NULL,

-- TransactionDetailID int NOT NULL,

-- Amount money NOT NULL,

-- RunningTotal money NULL,

-- Rnk int NOT NULL,

-- PRIMARY KEY (Rnk, AccountID));

--

--INSERT INTO @Results(AccountID, Date, TransactionDetailID, Amount,

-- RunningTotal, Rnk)

--SELECT AccountID, Date, TransactionDetailID, Amount, Amount,

-- RANK() OVER (PARTITION BY AccountID

-- ORDER BY Date,

-- TransactionDetailID)

--FROM dbo.TransactionDetail;

--===== Replaced Create Table/Insert/Select from above

SELECT ISNULL(AccountID,0) AS AccountID, --ISNULL makes a NOT NULL column

Date,

ISNULL(TransactionDetailID,0) AS TransactionDetailID, --ISNULL makes a NOT NULL column

--and strips IDENTITY off

Amount,

Amount AS RunningTotal,

ISNULL(RANK() OVER (PARTITION BY AccountID --ISNULL makes a NOT NULL column

ORDER BY Date,

TransactionDetailID),

0) AS Rnk

INTO #Results

FROM dbo.TransactionDetail;

ALTER TABLE #Results --Added to replace the PK we didn't make during the table build.

ADD PRIMARY KEY (Rnk, AccountID);

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds to INSERT';

SET @SectionStartTime = CURRENT_TIMESTAMP; --Modified

DECLARE @Rank int,

@RowCount int;

SET @Rank = 1;

SET @RowCount = 1;

WHILE @RowCount > 0

BEGIN;

SET @Rank = @Rank + 1;

UPDATE nxt

SET RunningTotal = prv.RunningTotal + nxt.Amount

FROM #Results AS nxt --Changed to temp table

INNER JOIN #Results AS prv --Changed to temp table

ON prv.AccountID = nxt.AccountID

AND prv.Rnk = @Rank- 1

WHERE nxt.Rnk = @Rank;

SET @RowCount = @@ROWCOUNT;

END;

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Running Total UPDATE';

SET @SectionStartTime = CURRENT_TIMESTAMP; --Modified

UPDATE td

SET AccountRunningTotal = r.RunningTotal,

AccountRunningCount = r.Rnk

FROM dbo.TransactionDetail AS td

INNER JOIN #Results AS r --Changed to temp table

ON r.TransactionDetailID = td.TransactionDetailID;

SELECT STR(DATEDIFF(ss,@SectionStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Final UPDATE';

SELECT STR(DATEDIFF(ss,@RunStartTime,CURRENT_TIMESTAMP),4) --Added

+ ' Seconds for Total Run';

Take a look at those times, Hugo...

9 Seconds to INSERT

27 Seconds for Running Total UPDATE

28 Seconds for Final UPDATE

----

64 Seconds for Total Run

NOW we're cooking with gas! Just a bit over a minute to do a million row grouped running total and grouped running count and it was done without the "Quirky Update" and it beat the heck out of the best cursor anyone could hope to write. And, it's only 8 times slower than the "Quirky Update" on my 7 year old desktop. On a real server, it should fly! This is getting interesting...

Hmmmm... I wonder how it treats IO... let's see what profiler says about the "Quirky Update" vs the tweeked Hugo code...

Ack... almost 1,000 times more IO than the "Quirky Update" even if some of it is memory based. Despite the fact that some folks are concerned about MS doing something to the core code of SQL Server (which I don't think to be likely, too expensive), I'll bank on the fact that the "Quirky Update" worked on SyBase (the original source of SQL Server) and SQL Server versions 6.5, 7, 2000, 2005, and 2008 and has made it through all the service packs and other updates for those major versions. Everything prior to 2008 is pretty much locked down so I don't believe we have to worry about them breaking that code. And, if they manage to break some of the core code (UPDATE) in 2008 or further in the future, it's great that, thanks to you, we have a more performant alternative than using a RBAR cursor to solve the running total problem.

For those that have been playing along with Hugo and I, every run except the profiler runs have gone through the "Verify" code just to be sure that nothing was broken along the way.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)