The Set-Based Limit

  • I've played in concerts back in high school... and if and when you hit the wrong note, you can't correct it without starting the whole thing over... quite similar to changing the code :).

  • Jeff,

    You be happy to know that your understanding of set based code and mine are 100% the same, going by what you posted in this threat. But still I cannot subscribe to your conclusions and no doubt you will be interested to know why as that might highlight something not often brought up! I see myself a little as a bridge person here as I am a thoroughbred developer with over 20 years under my belt and am quite deep into SQL as well. Maybe not to the level of some of the DBAs here, but I do like SQL server a lot and it offers amazing performance most of the time. That said, its implementation has a serious flaw and just ignoring it because I am a "fan" and like SQL will not make it go away!

    So here it goes...I will try to articulate the problem as best I can....I owe you that much (we all do)!

    The problem as I see it is SQL Servers ** pardon me ** horrible function support. As soon as you use functions (even deterministic scalar ones), pure set based solutions seems to break down performance wise. There is an unreasonable cost there to keep set based solutions readable when they involve even moderate amounts of "just beyond simple" logic that is best broken down in multiple lines of code to ease understanding and improve expressiveness.

    Lack of good function support is a problem because is prevents formulating repeating logic into human readable deterministic scalar functions that can be used in queries like any other logic encapsulated in build-in functions. The current way to keep performance up is to write the logic as complex expressions, much like a formula, that simulate the logic and can be used in set based operations without the performance hit functions take for no good reason.

    It would be preferable to rewrite code such as this

    select

    case

    when st.a > st.b then st.a

    else st.b

    end as BiggestNumberInRow

    from

    dbo.SomeTable as st

    where

    case

    when st.a > st.b then st.a

    else st.b

    end < 100

    into code such as

    select

    dbo.maxNum( st.a, st.b ) as BiggestNumberInRow

    from

    dbo.SomeTable as st

    where

    dbo.maxNum( st.a, st.b ) < 100

    Without having to incur a performance hit as the 2nd snippet of code is much easier to understand. It does demonstrate clearly how much more readable SQL code becomes when you use functions in SQL to hide details that derail you from understanding what the query as a whole tries to accomplish. And as soon as you need something even a little beyond what the build in functions offer you will run into situations like this one.

    Here is an example of formula like notations I recently used in a project.

    select

    count(1) as TotalConsumers

    , sum( sign( isnull( l.IdConsumer, 0 ) ) ) as ActivatedConsumers

    , sum( 1 - sign( isnull( l.IdConsumer, 0 ) )) as PendingConsumers

    -- l.IdConsumer is null

    , sum( ( 1 - sign( isnull( l.IdConsumer, 0 ) ) )

    * sign( isnull( l.EstateAgentMailSend, 0 ) ) ) as IncompleteConsumers

    -- l.IdConsumer is null and l.EstateAgentMailSend is not null

    , sum( sign( isnull( l.IdConsumer, 0 ) )

    * sign( isnull( l.EstateAgentMailSend, 0 ) ) ) as CompletedConsumers

    -- l.IdConsumer is not null and l.EstateAgentMailSend is not null

    from

    import.feedlog as l

    where

    l.idestateagent = 20

    The code touches each record only once like proper set based code does. It uses formula like expressions to prevent 'case when ... then ... else ... end' code. The intention of the logic I wrote as comment at the end of each line is totally lost if you just read the formulas. Making a modification in such code is a pain if you do not fully understand what is going on or have moderate SQL knowledge.

    BTW this actual code I wrote for a colleague that first used a query with a sub selects for each aggregate...obviously touching every record multiple times. Seeing this I think it is hard to uphold that set based code is by definition more elegant and easier to write and understand.

    Here a older production case where I decided to use a deterministic scalar functions to encapsulate some logic, which written as formulas would be incomprehensible.

    -- Returns the day of week, with monday being day 0, sunday being day 6!

    --

    create function dbo.dayOfWeek( @day as datetime ) returns int

    with schemabinding as

    begin

    declare @delta as int set @delta = datediff( d, {d '2007-01-01'}, @day ) % 7

    return case when @delta < 0 then 7 + @delta else @delta end

    end

    go

    -- Returns the number of week days (mo..fr) in a date range!

    --

    create function dbo.fnWeekDaysInDateRange( @firstday as datetime, @lastday as datetime ) returns int

    with schemabinding as

    begin

    -- Widen date range to encompas full calendar weeks

    declare @firstMonday as datetime

    set @firstMonday = dateadd( d, 0 - dbo.dayOfWeek( @firstday ), @firstday )

    declare @lastMonday as datetime

    set @lastMonday = dateadd( d, 7 - dbo.dayOfWeek( @lastday ), @lastday )

    declare @excessdays as int

    declare @daysdiff as int

    -- Compensate for extended days at start of daterange

    set @daysdiff = datediff( d, @firstMonday, @firstday )

    set @excessdays = case when @daysdiff > 5 then 5 else @daysdiff end

    -- Compensate for extended days at end of daterange

    set @daysdiff = datediff( d, @lastday, @lastMonday )

    set @excessdays = @excessdays + case

    when @daysdiff >= 3

    then @daysdiff - 2

    when @daysdiff != 0

    then 1

    else 0 end

    -- Calculate total number of workdays!

    return (datediff( ww, @firstMonday, @lastMonday ) * 5) - @excessdays + 1

    end

    go

    These functions are then used in some views and many queries. I show this code as the encapsulated logic cannot realistically be deployed without functions. These particular functions are deterministic, meaning that for any specific input the outcome will always be the same and these do no IO at all. As such they should be evaluated at close the same speed as build in functions in an optimized system.

    Unfortunately it performs not as good as it could due to the way functions are implemented in SQL Server...but they demonstrate a realistic need for better function support. You do not want to rewrite these functions into one big formula to evade any SQL Server ** implementation ** performance limitation. It can be done, but I lack the time now to do so, the result however will be much faster and you wouldn't understand how it does its work at all no matter how many detailed comments I would add to the code! Let alone you would be able to maintain it or spot a problem!

    I think I made enough of a point to have a meaningful discussion about the role of functions in all this. The common perception is that functions are slow due to them being procedural, but I tell you...build in functions are also procedural. And many SQL tricks deployed to make fast set based code rely heavily on inefficient use of build in relatively fast functions. This IMHO makes in many cases where set based code is victorious over other coding the claim that this is due to it not containing procedural parts.....a shaky claim at best.

    There is more to this then meets the eye...it is benchmarking after all. I seen many arguments where it is claimed that on a old server this solution beats that other solution. Seldom I see the counter argument that benchmarking on older hardware is a bad way to test how it will run on newer hardware. Characteristics of servers change over generations, memory is dirt cheap and plenty full these days. Addressing large amounts of it is easy with 64 bit systems too. Disks are about to undergo a revolution with the advance of solid state disks...current server cpu's are incredible fast and their power often still goes underutilized. Testing on hardware from last generation is shaky, testing on 2 or more generations old hardware or on a laptop is just fooling oneself.

    Someone on this forum made a sharpish comment a while back that I believe deserves more attention. It was about that servers under load respond differently then when testing on a nearly dedicated sever such as a a development sever. Proper set based code following the definition Jeff mentioned will work best across different server loads. This is because each set based operation can be sort of 'streamed' and resource use in general is low and can be throttled by the server. The worst that can happen when the query plan is good is that temporary results need to be stored in the tempdb.

    Which brings us to the optimizer....complex set based code relies heavily on the optimizer to perform well. If it goes tits, prepare for performance far worse then you can imagine based on normal experience. You need to do serious testing to make sure it works as efficient as you expect...check the query plan, IO statistics and time statistics when evaluating several solutions. Really complex operations are often best separated in several set based steps. The optimizer is pretty good but not almighty and it will let you down at some point if you rely on it blindly. As an added advantage you can put in knowledge about the job/data that an optimizer can simply not extract form the statement or from statistics (or it simply gives up trying too soon).

    It would be nice is Microsoft would give us more tools to help us optimize. Like giving some control to simulate conditions a query runs under (memory, locks, optimizer aggressiveness). It would make tuning solutions a lot easier and also let us explore how far the optimizer can be pushed to extract the most out of it.

    I written to much already I believe.....shoot!

  • To Jeff mostly,

    To bolster my claim about function overhead being the great killer in this debate.....I read your article about the function that you wrote for determining the number of workdays in a week. I read it AFTER my previous post and did some quick benchmarks.

    I used two functions:

    1. One for calculating the day of the week in a deterministic way (no side effects)

    2. One doing the actual work.

    Your solution used one function that did everything on its own. Guess what....your solution was faster on my system (roughly 9 vs 15 seconds on 100.000 date ranges).

    Then I switched to one function as well, by duplicating the work from my first function to the second calling function instead of doing the two calls. This function then became faster still (roughly 7 vs 9 seconds on 100.000 date ranges).

    Thus for performing the exact same work my original solution wend from 15 to 7 seconds!!! The time difference totally attributable to the function call overhead! This IMHO proves the whole point about functions being implemented crap by SQL Server. And showcases exactly why in many instances set based code is unreasonable much faster then procedural.

    As I said before I can rewrite the code to fit on one line and do it all and it would be even faster (maybe I will, just to stun you all) .... but would anyone still comprehend it?

    New code:

    create function dbo.fnWeekDaysInDateRange_solo( @firstday as datetime, @lastday as datetime ) returns int

    with schemabinding as

    begin

    declare @delta as int

    declare @excessdays as int

    declare @daysdiff as int

    declare @firstMonday as datetime

    declare @lastMonday as datetime

    -- Widen date range to encompas full calendar weeks

    set @delta = datediff( d, {d '2007-01-01'}, @firstday ) % 7

    set @firstMonday = dateadd( d, 0 - case when @delta < 0 then 7 + @delta else @delta end, @firstday )

    set @delta = datediff( d, {d '2007-01-01'}, @lastday ) % 7

    set @lastMonday = dateadd( d, 7 - case when @delta < 0 then 7 + @delta else @delta end, @lastday )

    -- Compensate for extended days at start of daterange

    set @daysdiff = datediff( d, @firstMonday, @firstday )

    set @excessdays = case when @daysdiff > 5 then 5 else @daysdiff end

    -- Compensate for extended days at end of daterange

    set @daysdiff = datediff( d, @lastday, @lastMonday )

    set @excessdays = @excessdays + case when @daysdiff >= 3 then @daysdiff - 2 when @daysdiff != 0 then 1 else 0 end

    -- Calculate total number of workdays!

    return (datediff( ww, @firstMonday, @lastMonday ) * 5) - @excessdays + 1

    end

    go

    Inline table valued functions fair better until they overwhelm the optimizers capacity of course...this version does the same job in 3-4 seconds and needs to be used in conjunction with cross apply.

    create function dbo.fnWeekDaysInDateRange_solo_inline( @firstday as datetime, @lastday as datetime ) returns table

    as

    return

    (

    with

    -- Widen date range to encompas full calendar weeks - to first monday

    firstDeltaQ (N) as ( select datediff( d, {d '2007-01-01'}, @firstday ) % 7 )

    , firstMondayQ (N) as

    (

    select

    dateadd( d, 0 - case when delta.N < 0 then 7 + delta.N else delta.N end, @firstday )

    from

    firstDeltaQ as delta

    )

    -- Widen date range to encompas full calendar weeks - to next monday

    , lastDeltaQ (N) as ( select datediff( d, {d '2007-01-01'}, @lastday ) % 7 )

    , lastMondayQ (N) as

    (

    select

    dateadd( d, 7 - case when delta.N < 0 then 7 + delta.N else delta.N end, @lastday )

    from

    lastDeltaQ as delta

    )

    , excessdaysQ ( N ) as

    (

    select

    -- Compensate for extended days at start of daterange

    case

    when datediff( d, firstMonday.N, @firstday ) > 5 then 5

    else datediff( d, firstMonday.N, @firstday )

    end

    -- Compensate for extended days at end of daterange

    + case

    when datediff( d, @lastday, lastMonday.N ) >= 3 then datediff( d, @lastday, lastMonday.N ) - 2

    when datediff( d, @lastday, lastMonday.N ) != 0 then 1

    else 0

    end

    from

    firstMondayQ as firstMonday

    cross join lastMondayQ as lastMonday

    )

    select

    -- Calculate total number of workdays!

    (datediff( ww, firstMonday.N, lastMonday.N ) * 5) - excessdays.N + 1 as n

    from

    firstMondayQ as firstMonday

    cross join lastMondayQ as lastMonday

    cross join excessdaysQ as excessdays

    )

    ;

    go

    Annoying smileys all over my code, why?

    BTW. I was mentioning seconds, but the last function does the job in like 0.8 seconds the rest is moving the results over to the client (grid mode). Corrected for this time (a little over 3 seconds in my test cases), the inline function is like 6 times faster then your original scalar function and 12 times faster then my own scalar original.

    The overhead of scalar functions is as you can see monster like!

  • Two comments -

    Steve - interesting model of the solar system you have posted in the article, haven't seen that one before 😀

    Everyone else (except Darth & Ninja, you guys are lost already) - Jeff's right, if you hit a wrong note on the piano you have to just play it off like it's a right one. Not learning a more effective way to do something because it's harder or takes longer just means that you're never going to be as good as you could be. That said, we all have to play the instruments that we have in our repertoire, so if you have to get something done now and don't have the luxury of learning something new, get it done.

    Then rebuild it using the new way, and learn what the differences are.

    ---------------------------------------------------------
    How best to post your question[/url]
    How to post performance problems[/url]
    Tally Table:What it is and how it replaces a loop[/url]

    "stewsterl 80804 (10/16/2009)I guess when you stop and try to understand the solution provided you not only learn, but save yourself some headaches when you need to make any slight changes."

  • I guess I object to cursors because of their awkward looking syntax more than anything, although their performance compared side by side with equivalent rewritten set-based or even conventional procedural loops (e.g. while @rownum end), is also bad.

    Databases quickly become complex beasts, no matter how simple the requirements or program used (SQL Server, Oracle, MySQL, etc.) It helps that a solid design at the beginning be a worthy goal, because later on, god help us when we have to do a join on a billion rows and the query optimizer decides to do a full table scan.

    Gaby

    Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

  • Here's a scenario where I had to use both procedural and set-based solutions (although I did avoid cursors). I get a request from an analyst every week or two to do an update on at least 40,000 rows. It's simple enough query where one of the DateChanged columns is set to the current date. That change fires a trigger (which updates five other tables). I typically insert into a temp table via a whole bunch of inserts or bcp. The table looks like:

    create table update_table

    (

    batchnum int default cast((rand()*1000) as int) % 5,

    rowval varchar(13)

    )

    This gives five batches of more or less equal size, with batch numbers of 0 - 4. The reason for batches is this query is done on a live production server and I would like to rollback anything that may fail and not have to backtrack across all 40,000 + rows. I then run a short while loop like so:

    declare @ctr int

    select @ctr = 0

    while @ctr < 5

    begin

    update MasterDataTable

    set DateUpdatedColumn = getdate() where rowID in

    (select rowID from update_table where batchnum = @ctr)

    select @ctr = @ctr + 1

    end

    I think this is a perfectly acceptable combination of procedural and setbased solutions. I don't want to run the update all at once for safety and perhaps audit reasons, but I still want to do it in large enough chunks to not have it slow down the server. I guess this could be extrapolated to billions of rows as well. You would have to decide on an acceptable batchsize (in my case between 9 and 10 thousand rows for this server).

    Gaby________________________________________________________________"In theory, theory and practice are the same. In practice, they are not." - Albert Einstein

  • I definitely agree that we are often hitting the limits of set-based processing. A lot of the tips and tricks that fly around here often have to do with "when to cheat" and NOT use "true" Set-Based processing.

    We could get quite a bit further along if we could simply expand support to sets other than just naive sets. Ordered sets, remembered sets, etc... all have their own mathematical theory backing them up, and could vastly grow the expressiveness limit of just the sets we apply today (and in critical areas Naive Set theory is lacking).

    Of course - that will require an entirely new optimizer/execution engine to get us there.

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • On Hints, I have to agree with the article that states that this is part of procedural programming and should not be considered set-based processing. The difference here is the state of the environment. In "optimal" set-based solutions, the data in the set either qualifies for the operation, or it does not. This lends set-based solutions to be favored when the environment (in this case, the database, including both DDL and DML) is in a state of flux.

    Procedural operations, on the other hand, include step-by-step operations that often rely on the fact that the previous operation was successful, and each step is making assumptions as to conditions that exist at the time of execution. This is largely why I disagree with Jeff in that even though each row is touched once, and only once, the series of set-based statements still constitutes procedural programming. This is a good application of procedural programming, but it is procedural programming all the same.

    Hints merely tie into my argument for "state of the environment". Hints are certainly too far into the details, and rely on the environment to remain constant in order to maintain their effectiveness. This does not make them bad. Their use simply has to be managed, and that effort should be weighed against their benefit.

  • I've had countless arguments over the set-based versus procedural approaches, both for and against depending on the solution that was needed. Most of the time when I was arguing against set-based solutions, I ended up being wrong, wholly or partially. Because in databases, you will most likely have to accommodate dealing with multiple records simultaneously, and procedural solutions aren't typically expansive enough to deal with a variable number of records.

    Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

  • Kit Brandner (12/22/2008)


    Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

    Dang... to bad it's proprietary... I'd like to take a crack at that.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • And I'd like to see you try.

    I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

  • Kit Brandner (12/22/2008)


    Consider the one exception I've encountered - generating a variable number of PIN codes according to a proprietary algorithm. There are easily several set-based approaches, but they run exponentially slower that the procedural solution implemented, especially as the number of records increases.

    That would definitely be interesting. I can't see how a set based approach would be slower. At worst, I would expect them to be exactly the same as the optimizer just did a loop in the background and ran the function over and over.

    ---
    Timothy A Wiseman
    SQL Blog: http://timothyawiseman.wordpress.com/

  • If you had to get some outside input for each one, that might not lend itself to a set based solution, especially if it's older. However if it were SQLCLR re-written, perhaps not.

  • Steve Jones - Editor (12/22/2008)


    And I'd like to see you try.

    I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

    Yeah, it uses the RAND() function. Nothing too special, but I found set-based methods would hang when trying to generate 1000+ records. Perhaps my conclusion was a bit short-sighted, but after spending A LOT of time with various set-based procedures, I wasn't able to get anything that even came close to the procedural alternative. The machine that this database is on is only running S2K; possibly if it were moved to another server with 2K5 on it there would be other options available to me (like SQL CLR), but since the application that uses it is in full-swing such a migration is unfortunately not of priority.

  • Steve Jones - Editor (12/22/2008)


    And I'd like to see you try.

    I wonder how generating pin codes would be faster procedurally? Is it some randomness being introduced somehow?

    Straight-up numeric pin codes are a breeze... for example... here's some code to generate 1000 PIN's... one type doesn't allow leading zeroes, the other does...

    SELECT ABS(CHECKSUM(NEWID()))%8999+1000, --No leading zeroes

    RIGHT('0000'+CAST(ABS(CHECKSUM(NEWID()))%9999+1 AS VARCHAR(4)),4) --Allows leading zeroes

    FROM dbo.Tally

    WHERE N <= 1000

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 16 through 30 (of 74 total)

You must be logged in to reply to this topic. Login to reply