The Set-Based Limit

Question

The Set-Based Limit

Viewing 15 posts - 16 through 30 (of 74 total)

You must be logged in to reply to this topic. Login to reply

Ninja's_RGR'us SSC Guru Points: 294069 More actions · Answer 1

I've played in concerts back in high school... and if and when you hit the wrong note, you can't correct it without starting the whole thing over... quite similar to changing the code :).

peter-757102 SSCertifiable Points: 6877 More actions · Answer 2

Jeff,

You be happy to know that your understanding of set based code and mine are 100% the same, going by what you posted in this threat. But still I cannot subscribe to your conclusions and no doubt you will be interested to know why as that might highlight something not often brought up! I see myself a little as a bridge person here as I am a thoroughbred developer with over 20 years under my belt and am quite deep into SQL as well. Maybe not to the level of some of the DBAs here, but I do like SQL server a lot and it offers amazing performance most of the time. That said, its implementation has a serious flaw and just ignoring it because I am a "fan" and like SQL will not make it go away!

So here it goes...I will try to articulate the problem as best I can....I owe you that much (we all do)!

The problem as I see it is SQL Servers ** pardon me ** horrible function support. As soon as you use functions (even deterministic scalar ones), pure set based solutions seems to break down performance wise. There is an unreasonable cost there to keep set based solutions readable when they involve even moderate amounts of "just beyond simple" logic that is best broken down in multiple lines of code to ease understanding and improve expressiveness.

Lack of good function support is a problem because is prevents formulating repeating logic into human readable deterministic scalar functions that can be used in queries like any other logic encapsulated in build-in functions. The current way to keep performance up is to write the logic as complex expressions, much like a formula, that simulate the logic and can be used in set based operations without the performance hit functions take for no good reason.

It would be preferable to rewrite code such as this

select

case

when st.a > st.b then st.a

else st.b

end as BiggestNumberInRow

from

dbo.SomeTable as st

where

case

when st.a > st.b then st.a

else st.b

end < 100

into code such as

select

dbo.maxNum( st.a, st.b ) as BiggestNumberInRow

from

dbo.SomeTable as st

where

dbo.maxNum( st.a, st.b ) < 100

Without having to incur a performance hit as the 2nd snippet of code is much easier to understand. It does demonstrate clearly how much more readable SQL code becomes when you use functions in SQL to hide details that derail you from understanding what the query as a whole tries to accomplish. And as soon as you need something even a little beyond what the build in functions offer you will run into situations like this one.

Here is an example of formula like notations I recently used in a project.

select

count(1) as TotalConsumers

, sum( sign( isnull( l.IdConsumer, 0 ) ) ) as ActivatedConsumers

, sum( 1 - sign( isnull( l.IdConsumer, 0 ) )) as PendingConsumers

-- l.IdConsumer is null

, sum( ( 1 - sign( isnull( l.IdConsumer, 0 ) ) )

* sign( isnull( l.EstateAgentMailSend, 0 ) ) ) as IncompleteConsumers

-- l.IdConsumer is null and l.EstateAgentMailSend is not null

, sum( sign( isnull( l.IdConsumer, 0 ) )

* sign( isnull( l.EstateAgentMailSend, 0 ) ) ) as CompletedConsumers

-- l.IdConsumer is not null and l.EstateAgentMailSend is not null

from

import.feedlog as l

where

l.idestateagent = 20

The code touches each record only once like proper set based code does. It uses formula like expressions to prevent 'case when ... then ... else ... end' code. The intention of the logic I wrote as comment at the end of each line is totally lost if you just read the formulas. Making a modification in such code is a pain if you do not fully understand what is going on or have moderate SQL knowledge.

BTW this actual code I wrote for a colleague that first used a query with a sub selects for each aggregate...obviously touching every record multiple times. Seeing this I think it is hard to uphold that set based code is by definition more elegant and easier to write and understand.

Here a older production case where I decided to use a deterministic scalar functions to encapsulate some logic, which written as formulas would be incomprehensible.

-- Returns the day of week, with monday being day 0, sunday being day 6!

--

create function dbo.dayOfWeek( @day as datetime ) returns int

with schemabinding as

begin

declare @delta as int set @delta = datediff( d, {d '2007-01-01'}, @day ) % 7

return case when @delta < 0 then 7 + @delta else @delta end

end

go

-- Returns the number of week days (mo..fr) in a date range!

--

create function dbo.fnWeekDaysInDateRange( @firstday as datetime, @lastday as datetime ) returns int

with schemabinding as

begin

-- Widen date range to encompas full calendar weeks

declare @firstMonday as datetime

set @firstMonday = dateadd( d, 0 - dbo.dayOfWeek( @firstday ), @firstday )

declare @lastMonday as datetime

set @lastMonday = dateadd( d, 7 - dbo.dayOfWeek( @lastday ), @lastday )

declare @excessdays as int

declare @daysdiff as int

-- Compensate for extended days at start of daterange

set @daysdiff = datediff( d, @firstMonday, @firstday )

set @excessdays = case when @daysdiff > 5 then 5 else @daysdiff end

-- Compensate for extended days at end of daterange

set @daysdiff = datediff( d, @lastday, @lastMonday )

set @excessdays = @excessdays + case

when @daysdiff >= 3

then @daysdiff - 2

when @daysdiff != 0

then 1

else 0 end

-- Calculate total number of workdays!

return (datediff( ww, @firstMonday, @lastMonday ) * 5) - @excessdays + 1

end

go

These functions are then used in some views and many queries. I show this code as the encapsulated logic cannot realistically be deployed without functions. These particular functions are deterministic, meaning that for any specific input the outcome will always be the same and these do no IO at all. As such they should be evaluated at close the same speed as build in functions in an optimized system.

Unfortunately it performs not as good as it could due to the way functions are implemented in SQL Server...but they demonstrate a realistic need for better function support. You do not want to rewrite these functions into one big formula to evade any SQL Server ** implementation ** performance limitation. It can be done, but I lack the time now to do so, the result however will be much faster and you wouldn't understand how it does its work at all no matter how many detailed comments I would add to the code! Let alone you would be able to maintain it or spot a problem!

I think I made enough of a point to have a meaningful discussion about the role of functions in all this. The common perception is that functions are slow due to them being procedural, but I tell you...build in functions are also procedural. And many SQL tricks deployed to make fast set based code rely heavily on inefficient use of build in relatively fast functions. This IMHO makes in many cases where set based code is victorious over other coding the claim that this is due to it not containing procedural parts.....a shaky claim at best.

There is more to this then meets the eye...it is benchmarking after all. I seen many arguments where it is claimed that on a old server this solution beats that other solution. Seldom I see the counter argument that benchmarking on older hardware is a bad way to test how it will run on newer hardware. Characteristics of servers change over generations, memory is dirt cheap and plenty full these days. Addressing large amounts of it is easy with 64 bit systems too. Disks are about to undergo a revolution with the advance of solid state disks...current server cpu's are incredible fast and their power often still goes underutilized. Testing on hardware from last generation is shaky, testing on 2 or more generations old hardware or on a laptop is just fooling oneself.

Someone on this forum made a sharpish comment a while back that I believe deserves more attention. It was about that servers under load respond differently then when testing on a nearly dedicated sever such as a a development sever. Proper set based code following the definition Jeff mentioned will work best across different server loads. This is because each set based operation can be sort of 'streamed' and resource use in general is low and can be throttled by the server. The worst that can happen when the query plan is good is that temporary results need to be stored in the tempdb.

Which brings us to the optimizer....complex set based code relies heavily on the optimizer to perform well. If it goes tits, prepare for performance far worse then you can imagine based on normal experience. You need to do serious testing to make sure it works as efficient as you expect...check the query plan, IO statistics and time statistics when evaluating several solutions. Really complex operations are often best separated in several set based steps. The optimizer is pretty good but not almighty and it will let you down at some point if you rely on it blindly. As an added advantage you can put in knowledge about the job/data that an optimizer can simply not extract form the statement or from statistics (or it simply gives up trying too soon).

It would be nice is Microsoft would give us more tools to help us optimize. Like giving some control to simulate conditions a query runs under (memory, locks, optimizer aggressiveness). It would make tuning solutions a lot easier and also let us explore how far the optimizer can be pushed to extract the most out of it.

I written to much already I believe.....shoot!

peter-757102 SSCertifiable Points: 6877 More actions · Answer 3

To Jeff mostly,

To bolster my claim about function overhead being the great killer in this debate.....I read your article about the function that you wrote for determining the number of workdays in a week. I read it AFTER my previous post and did some quick benchmarks.

I used two functions:

1. One for calculating the day of the week in a deterministic way (no side effects)

2. One doing the actual work.

Your solution used one function that did everything on its own. Guess what....your solution was faster on my system (roughly 9 vs 15 seconds on 100.000 date ranges).

Then I switched to one function as well, by duplicating the work from my first function to the second calling function instead of doing the two calls. This function then became faster still (roughly 7 vs 9 seconds on 100.000 date ranges).

Thus for performing the exact same work my original solution wend from 15 to 7 seconds!!! The time difference totally attributable to the function call overhead! This IMHO proves the whole point about functions being implemented crap by SQL Server. And showcases exactly why in many instances set based code is unreasonable much faster then procedural.

As I said before I can rewrite the code to fit on one line and do it all and it would be even faster (maybe I will, just to stun you all) .... but would anyone still comprehend it?

New code:

create function dbo.fnWeekDaysInDateRange_solo( @firstday as datetime, @lastday as datetime ) returns int

with schemabinding as

begin

declare @delta as int

declare @excessdays as int

declare @daysdiff as int

declare @firstMonday as datetime

declare @lastMonday as datetime

-- Widen date range to encompas full calendar weeks

set @delta = datediff( d, {d '2007-01-01'}, @firstday ) % 7

set @firstMonday = dateadd( d, 0 - case when @delta < 0 then 7 + @delta else @delta end, @firstday )

set @delta = datediff( d, {d '2007-01-01'}, @lastday ) % 7

set @lastMonday = dateadd( d, 7 - case when @delta < 0 then 7 + @delta else @delta end, @lastday )

-- Compensate for extended days at start of daterange

set @daysdiff = datediff( d, @firstMonday, @firstday )

set @excessdays = case when @daysdiff > 5 then 5 else @daysdiff end

-- Compensate for extended days at end of daterange

set @daysdiff = datediff( d, @lastday, @lastMonday )

set @excessdays = @excessdays + case when @daysdiff >= 3 then @daysdiff - 2 when @daysdiff != 0 then 1 else 0 end

-- Calculate total number of workdays!

return (datediff( ww, @firstMonday, @lastMonday ) * 5) - @excessdays + 1

end

go

Inline table valued functions fair better until they overwhelm the optimizers capacity of course...this version does the same job in 3-4 seconds and needs to be used in conjunction with cross apply.

create function dbo.fnWeekDaysInDateRange_solo_inline( @firstday as datetime, @lastday as datetime ) returns table

as

return

(

with

-- Widen date range to encompas full calendar weeks - to first monday

firstDeltaQ (N) as ( select datediff( d, {d '2007-01-01'}, @firstday ) % 7 )

, firstMondayQ (N) as

(

select

dateadd( d, 0 - case when delta.N < 0 then 7 + delta.N else delta.N end, @firstday )

from

firstDeltaQ as delta

)

-- Widen date range to encompas full calendar weeks - to next monday

, lastDeltaQ (N) as ( select datediff( d, {d '2007-01-01'}, @lastday ) % 7 )

, lastMondayQ (N) as

(

select

dateadd( d, 7 - case when delta.N < 0 then 7 + delta.N else delta.N end, @lastday )

from

lastDeltaQ as delta

)

, excessdaysQ ( N ) as

(

select

-- Compensate for extended days at start of daterange

case

when datediff( d, firstMonday.N, @firstday ) > 5 then 5

else datediff( d, firstMonday.N, @firstday )

end

-- Compensate for extended days at end of daterange

+ case

when datediff( d, @lastday, lastMonday.N ) >= 3 then datediff( d, @lastday, lastMonday.N ) - 2

when datediff( d, @lastday, lastMonday.N ) != 0 then 1

else 0

end

from

firstMondayQ as firstMonday

cross join lastMondayQ as lastMonday

)

select

-- Calculate total number of workdays!

(datediff( ww, firstMonday.N, lastMonday.N ) * 5) - excessdays.N + 1 as n

from

firstMondayQ as firstMonday

cross join lastMondayQ as lastMonday

cross join excessdaysQ as excessdays

)

;

go

Annoying smileys all over my code, why?

BTW. I was mentioning seconds, but the last function does the job in like 0.8 seconds the rest is moving the results over to the client (grid mode). Corrected for this time (a little over 3 seconds in my test cases), the inline function is like 6 times faster then your original scalar function and 12 times faster then my own scalar original.

The overhead of scalar functions is as you can see monster like!

jcrawf02 SSC-Insane Points: 24198 More actions · Answer 4

Two comments -

Steve - interesting model of the solar system you have posted in the article, haven't seen that one before 😀

Everyone else (except Darth & Ninja, you guys are lost already) - Jeff's right, if you hit a wrong note on the piano you have to just play it off like it's a right one. Not learning a more effective way to do something because it's harder or takes longer just means that you're never going to be as good as you could be. That said, we all have to play the instruments that we have in our repertoire, so if you have to get something done now and don't have the luxury of learning something new, get it done.

Then rebuild it using the new way, and learn what the differences are.

---------------------------------------------------------
How best to post your question[/url]
How to post performance problems[/url]
Tally Table:What it is and how it replaces a loop[/url]

"stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

GabyYYZ SSCertifiable Points: 7913 More actions · Answer 5

I guess I object to cursors because of their awkward looking syntax more than anything, although their performance compared side by side with equivalent rewritten set-based or even conventional procedural loops (e.g. while @rownum end), is also bad.

Databases quickly become complex beasts, no matter how simple the requirements or program used (SQL Server, Oracle, MySQL, etc.) It helps that a solid design at the beginning be a worthy goal, because later on, god help us when we have to do a join on a billion rows and the query optimizer decides to do a full table scan.

Gaby

Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

GabyYYZ SSCertifiable Points: 7913 More actions · Answer 6

Here's a scenario where I had to use both procedural and set-based solutions (although I did avoid cursors). I get a request from an analyst every week or two to do an update on at least 40,000 rows. It's simple enough query where one of the DateChanged columns is set to the current date. That change fires a trigger (which updates five other tables). I typically insert into a temp table via a whole bunch of inserts or bcp. The table looks like:

create table update_table

(

batchnum int default cast((rand()*1000) as int) % 5,

rowval varchar(13)

)

This gives five batches of more or less equal size, with batch numbers of 0 - 4. The reason for batches is this query is done on a live production server and I would like to rollback anything that may fail and not have to backtrack across all 40,000 + rows. I then run a short while loop like so:

declare @ctr int

select @ctr = 0

while @ctr < 5

begin

update MasterDataTable

set DateUpdatedColumn = getdate() where rowID in

(select rowID from update_table where batchnum = @ctr)

select @ctr = @ctr + 1

end

I think this is a perfectly acceptable combination of procedural and setbased solutions. I don't want to run the update all at once for safety and perhaps audit reasons, but I still want to do it in large enough chunks to not have it slow down the server. I guess this could be extrapolated to billions of rows as well. You would have to decide on an acceptable batchsize (in my case between 9 and 10 thousand rows for this server).

Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 7

I definitely agree that we are often hitting the limits of set-based processing. A lot of the tips and tricks that fly around here often have to do with "when to cheat" and NOT use "true" Set-Based processing.

We could get quite a bit further along if we could simply expand support to sets other than just naive sets. Ordered sets, remembered sets, etc... all have their own mathematical theory backing them up, and could vastly grow the expressiveness limit of just the sets we apply today (and in critical areas Naive Set theory is lacking).

Of course - that will require an entirely new optimizer/execution engine to get us there.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Michael Ross-468051 Say Hey Kid Points: 689 More actions · Answer 8

On Hints, I have to agree with the article that states that this is part of procedural programming and should not be considered set-based processing. The difference here is the state of the environment. In "optimal" set-based solutions, the data in the set either qualifies for the operation, or it does not. This lends set-based solutions to be favored when the environment (in this case, the database, including both DDL and DML) is in a state of flux.

Procedural operations, on the other hand, include step-by-step operations that often rely on the fact that the previous operation was successful, and each step is making assumptions as to conditions that exist at the time of execution. This is largely why I disagree with Jeff in that even though each row is touched once, and only once, the series of set-based statements still constitutes procedural programming. This is a good application of procedural programming, but it is procedural programming all the same.

Hints merely tie into my argument for "state of the environment". Hints are certainly too far into the details, and rely on the environment to remain constant in order to maintain their effectiveness. This does not make them bad. Their use simply has to be managed, and that effort should be weighed against their benefit.

Kit Brandner SSC Enthusiast Points: 147 More actions · Answer 9

I've had countless arguments over the set-based versus procedural approaches, both for and against depending on the solution that was needed. Most of the time when I was arguing against set-based solutions, I ended up being wrong, wholly or partially. Because in databases, you will most likely have to accommodate dealing with multiple records simultaneously, and procedural solutions aren't typically expansive enough to deal with a variable number of records.

Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 10

Kit Brandner (12/22/2008)
Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

Dang... to bad it's proprietary... I'd like to take a crack at that.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Jones - SSC Editor SSC Guru Points: 734459 More actions · Answer 11

And I'd like to see you try.

I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

TimothyAWiseman SSCrazy Eights Points: 8819 More actions · Answer 12

Kit Brandner (12/22/2008)
Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

That would definitely be interesting. I can't see how a set based approach would be slower. At worst, I would expect them to be exactly the same as the optimizer just did a loop in the background and ran the function over and over.

---
Timothy A Wiseman
SQL Blog: http://timothyawiseman.wordpress.com/

Steve Jones - SSC Editor SSC Guru Points: 734459 More actions · Answer 13

If you had to get some outside input for each one, that might not lend itself to a set based solution, especially if it's older. However if it were SQLCLR re-written, perhaps not.

Kit Brandner SSC Enthusiast Points: 147 More actions · Answer 14

Steve Jones - Editor (12/22/2008)
And I'd like to see you try.
I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

Yeah, it uses the RAND() function. Nothing too special, but I found set-based methods would hang when trying to generate 1000+ records. Perhaps my conclusion was a bit short-sighted, but after spending A LOT of time with various set-based procedures, I wasn't able to get anything that even came close to the procedural alternative. The machine that this database is on is only running S2K; possibly if it were moved to another server with 2K5 on it there would be other options available to me (like SQL CLR), but since the application that uses it is in full-swing such a migration is unfortunately not of priority.

Jeff Moden SSC Guru Points: 1003869 More actions · Answer 15

Steve Jones - Editor (12/22/2008)
And I'd like to see you try.
I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

Straight-up numeric pin codes are a breeze... for example... here's some code to generate 1000 PIN's... one type doesn't allow leading zeroes, the other does...

SELECT ABS(CHECKSUM(NEWID()))%8999+1000, --No leading zeroes

RIGHT('0000'+CAST(ABS(CHECKSUM(NEWID()))%9999+1 AS VARCHAR(4)),4) --Allows leading zeroes

FROM dbo.Tally

WHERE N <= 1000

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)