Solving the Running Total and Ordinal Rank Problems (Rewritten)

• Hugo Kornelis (10/7/2016)

Jeff Moden (10/7/2016)

Hugo Kornelis (10/7/2016)

You could wrap the fast but unreliable method ...

Heh... maybe for you but it's never been unreliable for me. You must be doing something wrong. 😉

Are we really going to have that discussion again?

Only if you keep saying that it's unreliable. 😉

Let's just say that you and I have different definitions of "reliable".

Maybe. My definition of "reliable" is that it works correctly when used correctly. You might differ with that last part.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

• Lynn Pettis (10/7/2016)

Hugo Kornelis (10/7/2016)

Jeff Moden (10/7/2016)

Hugo Kornelis (10/7/2016)

You could wrap the fast but unreliable method ...

Heh... maybe for you but it's never been unreliable for me. You must be doing something wrong. 😉

Are we really going to have that discussion again?

Let's just say that you and I have different definitions of "reliable".

Heh... it'll be almost as much fun as talking about Nulls or Bankers Rounding. 😛

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

• Jeff Moden (10/7/2016)

Hugo Kornelis (10/7/2016)

Jeff Moden (10/7/2016)

Hugo Kornelis (10/7/2016)

You could wrap the fast but unreliable method ...

Heh... maybe for you but it's never been unreliable for me. You must be doing something wrong. 😉

Are we really going to have that discussion again?

Only if you keep saying that it's unreliable. 😉

Let's just say that you and I have different definitions of "reliable".

Maybe. My definition of "reliable" is that it works correctly when used correctly. You might differ with that last part.

Just to save everyone a lot of time: Jeff and I had this discussion before. I cannot convince him, he cannot convince me. That's fine.

Anyone who considers using this code should really go back in this discussion (and probably a few others as well) to find the arguments. Remember that copying code off the internet does not change your responsibility. When it's good, you'll get credit (deservedly, because finding the right example and adapting it correctly deserves credit as well). But when it's bad, you'll get the blame.

If you implement this code after reading the aguments Jeff and I exchanged, they you made a conscious decision to accept (in my opinion) some level of risk for the benefit of fast performance with easy code - and you will be able to defend your decision in case it does go wrong.

If you decide to use this code just because Jeff says so - or if you decide to not use this code just because I say so - then you are guilty of being gullible and lazy, and of not taking your werk seriously.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

• on more practical matters - several critical stored procedures for my client, the largest employer in our county - rely on the quirky method to load daily tables - it's 2008 R2 now but at some point is the misty future they'll have to upgrade

will there be a problem?

• Seggerman-675349 (10/8/2016)

on more practical matters - several critical stored procedures for my client, the largest employer in our county - rely on the quirky method to load daily tables - it's 2008 R2 now but at some point is the misty future they'll have to upgrade

will there be a problem?

As with any code that relies on undocumented and unsoppoted behaviour, the answer is that nobody can tell for sure. The existing code can start to behave differently today, even before upgrading. It can also behave differently after an upgrade.

Is it likely to cause issues? If you follow all the guidelines Jeff outlines (which are basically the collection of all situations where so far issues have been observed) then I would see that it is pretty unlikely. But pretty unlikely is not a full guarantee.

In the past your client has apparently decided that it's okay to take some (small but non-zero) risk in order to gain both execution and development speed. At this time they will have to decide if they'll make the same decision again, or not. If the supported system has become more critical and the potential damage of incorrect results has gone up, then perrhaps they will choose to invest the time to convert to using the OVER clause for the running totals, or to another supported solution. If they deem the risk(**) acceptable, then they might decide to keep the current code.

(**) Remember that risk is defined as "chance of failure occuring" mulitplied by "damage caused by the failure". An upgrade increases the chance of failure, especially when using undocumented behaviour. The potential damage will have to be assessed by you or your superiors.

Hugo Kornelis, SQL Server/Data Platform MVP (2006-2016)
Visit my SQL Server blog: https://sqlserverfast.com/blog/
SQL Server Execution Plan Reference: https://sqlserverfast.com/epr/

• Seggerman-675349 (10/8/2016)

on more practical matters - several critical stored procedures for my client, the largest employer in our county - rely on the quirky method to load daily tables - it's 2008 R2 now but at some point is the misty future they'll have to upgrade

will there be a problem?

None that I've seen of heard of. Some folks have done testing all the way through 2016 and it hasn't changed.

Of course, any upgrade brings risks. For example, the use of FOR XML PATH('') to concatenate values (a seriously documented and supported method to do something not necessarily intended) has broken for me due to changes in the optimizer not once but 3 times now.

Shifting gears a bit, though (and this much I do agree with Hugo and others on)... if you're using SQL Server 2012 and up, there's no longer a need to use the undocumented method known as the "Quirky Update" unless you absolutely can't afford any degradation of performance, no matter how little. My recommendation would be to go with what's supported if possible. If not possible, then be sure to include the "safety check" in the code. If you don't know what that is, post back. It's not difficult to add if you don't have it in place already.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

• thanks

I suspect it's going to be a while before it gets upgraded from 2008 R2 and I will have long since retired by then

for now it's a 20 minute process instead of an hour and a half (I shudder to think what it would have taken with the original code - a cursor within a cursor processing 3 million records in an early morning job)

• First I'll applogize for having edited this comment so much.  I hope it didn't cause headaches for those who monitor.  I thoroughly appreciate all of you.

I've found a way to solve a problem for which I can find no solution other than the Quirky Update method even in V-2017.  Unless someone knows a solution that performs at least within an order of magnitude of the one below it seems the usefulness of Quirky Update will not disappear anytime soon.

It's a gaps and islands problem except I need to allow/ignore gaps smaller than a specific size and duplicates must be ok.  For example 1,2,2,3,5,8,10 needs to collesce into 2 groups: 1-5 and 8-10.  In that example I'm ignoring gaps of 1.  So 2 will cause a split but typically I'd ignore bigger gaps.  Below gaps as large as 9 are ignored but 10 will cause a new island to form.

If another method exists I'm totally interested.  Also, given the validity check discussed earlier doesn't work in this case I'm also interested in any ideas on that!

`DROP TABLE #t;GOCREATE TABLE #t (CreateDate date not null,TufpID int not null,Cnt int not null, FuzzyGroup int NULL);ALTER TABLE #t ADD CONSTRAINT PK_temp PRIMARY KEY CLUSTERED (CreateDate,TufpID);`

-- The source table has an index (TufpID,CreateDate) which is backwards so
-- This takes 40 seconds to write 4.4M rows from a source of 70M rows.
-- I'm only showing this to give a better idea of the actual source data.
INSERT INTO #T
SELECT X.CreateDate
,X.TufpID
,Cnt        = COUNT(*)
,NULL
FROM SourceTable X
GROUP BY X.CreateDate
,X.TufpID
ORDER BY CreateDate,TufpID;      -- SELECT TOP 100 * FROM #T ORDER BY CreateDate,TufpID;

-- This takes 3 seconds to update the 4.4M rows created above
DECLARE @FuzzFactor int = 10;  -- Ignore gaps smaller than this.
DECLARE @Group int = -@FuzzFactor*2;
DECLARE @Prior int = -@FuzzFactor*2;
DECLARE @CDate date;
UPDATE #T
SET @Group = FuzzyGroup = CASE WHEN t.TufpID - @PRIOR < @FuzzFactor AND t.CreateDate = @CDate
THEN @Group ELSE t.TufpID END
,@CDate = CASE WHEN @CDate = t.CreateDate THEN @CDate ELSE t.CreateDate END
,@Prior = CASE WHEN @Prior = t.TufpID-1   THEN @Prior + 1 ELSE t.TufpID END
FROM #t t WITH (TABLOCKX) OPTION(MAXDOP 1);       -- SELECT TOP 100 * FROM #t ORDER BY CreateDate,TufpID;

-- Sumarize fuzzy islands
WITH X AS (
SELECT CreateDate
,RowCnt     = SUM(Cnt)
,TufpIDCnt  = COUNT(*)
FROM #t
GROUP BY CreateDate,FuzzyGroup
), Y AS (
SELECT CreateDate
,RowCnt     = SUM(RowCnt)
,Clusters   = COUNT(*)
,MinTufps   = MIN(TufpIDCnt)
,MaxTufps   = MAX(TufpIDCnt)
,AvgTufps   = AVG(TufpIDCnt)
,TufpIDCnt  = SUM(TufpIDCnt)
FROM X
GROUP BY CreateDate
)
SELECT *
,IdsPerCluster = TufpIDCnt/Clusters
,AvgAsPct      = (AvgTufps*100)/TufpIDCnt
,MaxAsPct      = (MaxTufps*100)/TufpIDCnt
FROM Y
WHERE RowCnt > 10000
ORDER BY RowCnt desc;
Also, I realize this particular example can't have duplicates but other more complex cases do.

Thanks again for the technique!!

Viewing 8 posts - 301 through 307 (of 307 total)