How to avoid using scalar User Defined Functions?

  • Following the horrendous performance of Scalar UDFs demonstrated in this thread

    http://www.sqlservercentral.com/Forums/Topic1719045-9-6.aspx#bm1723248

    (Scalar UDF 10x slower than writing the code, long hand, in the SELECT) I now have the worry that their widespread usage in my code needs to be refactored 🙁

    What's the best way to go about this?

    Here's an example, but I guess pretty much any single-line non-table-valued scalar-function would do

    CREATE FUNCTION dbo.FNSafeString

    (

    @strValuevarchar(8000),-- STRING Value to convert to SAFE string

    @intModeint = 0-- 0=NULL as '[NULL]', 1=NULL as ''

    )

    RETURNS varchar(8000)

    WITH SCHEMABINDING

    AS

    BEGIN

    RETURN COALESCE(@strValue,

    CASE WHEN @intMode = 0

    THEN '[NULL]'

    ELSE ''

    END

    )

    END

    Its a little bit too big to really want to have to write it in each place where I need to use it, and its just big enough that it is worth centralising, and it might just be that I could code it more efficiently / different / handle more "edge cases" in the future.

    I do have even more straightforward UDFs - just to avoid typing something "Lengthy"

    CREATE FUNCTION dbo.FNDateMidnight

    (

    @dtDateTimedatetime-- Date/Time Value to adjust

    )

    RETURNS datetime

    WITH SCHEMABINDING

    AS

    BEGIN

    RETURNDATEADD(Day, DATEDIFF(Day, 0, @dtDateTime)+1, 0)

    END

    Perhaps I should consider changing them to inline table valued functions, or CROSS APPLY them? but I can't think of a way to centralised them and re-use the code WITHOUT a function, can I?

    All I am really after is something like a preprocessor directive.

  • Kristen,

    Yes, an inline table-valued function would be your desired usage for code reuse. However, as you pointed out, not having the function at all is even better - but that comes with a mantenance nightmare of it's own if the logic needs to be changed. Furthermore, it's not as simple as just changing the SF to an ITVF - you need to change the way that the query calls it.

    Check out the performance test that I did (similar to what you did in the linked thread) at my blog[/url].

    Wayne
    Microsoft Certified Master: SQL Server 2008
    Author - SQL Server T-SQL Recipes


    If you can't explain to another person how the code that you're copying from the internet works, then DON'T USE IT on a production system! After all, you will be the one supporting it!
    Links:
    For better assistance in answering your questions
    Performance Problems
    Common date/time routines
    Understanding and Using APPLY Part 1 & Part 2

  • Finding everything that calls your SF is going to be difficult. You can use sys.sql_modules for everything in the database, but if you have any code outside the database, that's where you're going to run into problems. If you have a centralized database of functions, that's where the real pain begins.

    You can probably change the function itself easily enough, but like Wayne said, you have to change the way it's called. Refactoring could be a relative simple or painful process, depending on scope and from where it's called.

  • Write the simpler stuff in-line in the select, write the data-accessing stuff as a table-valued (inline) function. Yes, you're going to duplicate code. Your choice, duplicate code or poor performance

    Scalar functions are one of the first things I look at when I'm doing client system audits, removing them often gives a huge performance improvement, though I tend to focus more on data-accessing ones.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Thanks all, not what I wanted to hear! Magic Bullet would have been much more palatable 🙂

    Some of our functions (like "round date to midnight") are probably only used to adjust a MAX RANGE @Parameter which is then used in a query. I can live with that.

    Some of the "Safe String" functions - to allow concatenation without a NULL being propagated - are used often, but it might be that they are mostly used in Error messages (i.e. thus rarely executed) e.g. where we deliberately want to concatenate "Error" + dbo.FNInt2StringSafe(@MayBeNullNumeric) + " occurred on " + dbo.FNDate2StringSafe(@MayBeNullDate) as we wouldn't normally be doing that in a SELECT (Application end usually deals with formatting) so it MIGHT not be as bad as I first thought ...

    We also have a uniform name for our UDFs "XX_FN_" where "XX" is only 2 or 3 choices, and we always have "_FN_". All our code is in external files so setting a Programmers Editor loose searching the source code files will help determine how bad the problem is.

    It seems a bit nuts to use a Table Valued Function to get a simple string manipulation!

    Any chance of a pre-processor directive in future SQL versions do you think? That would help with my code centralisation whilst giving me the same performance as inline coding.

    Perhaps **I** should write one? I wonder if:

    I have a function called "XX_PRE_FN_SomeName". The "PRE" indicates that it is a preprocessor directive. I will create the actual scalar function, so the code will work as-is, but I can use some fancy Find & Replace to actually replace the function call with some inline-code.

    dbo.XX_PRE_FN_StringOrBlank(@MightBeNullString)

    would be easy to globally replace with

    ISNULL(@MightBeNullString, '')

    I could do that using sys.sql_modules or on my individual source code files immediately prior to producing a Release Script for QA testing (which would then go into Production). [We do other processing at that time - e.g. remove all whitespace and comments.]

    Perhaps a pre-processor utility already exists?

  • The real question is whether the UDFs are causing any performance problems. 10x sounds horrible, but in your system does that result in a query that runs 10 minutes vs. 1 minute or is it (more likely) a run-time difference measured in micro- or milliseconds?

    Are the functions used in the Where clause and thus may be executed against a high percentage of the source data, or are they in the Select clause and only applied to the final result set? Or - as you noted on one of them - are they only used once per query to prep a parameter value?

    I usually vote in favor of centralizing common functions for ease of maintenance - unless they cause a measurable performance issue, and then you may need to refactor them or replicate in-line code. But I wouldn't start refactoring everything just because a given UDF may be 10x slower than in-line code.

  • I tend to avoid both scalar and table valued functions, because it can be a bottleneck when used within a SELECT statement or tight loop (not that I code loops either). Where I tend to leverage scalar functions is within conditional branching statements of the batch itself, mainly to abstract the implementation details of some business logic or lookups that is commonly re-used across multiple procedures.

    For example:

    if IsCustomerActive( @CustomerID ) = 1

    and IsProductInStock( @ProductID ) = 1

    begin

    ...

    end

    else

    ...

    If you have queries with computed columns that are repeated across multiple stored procedures, applications, or reports, perhaps something like masking personal identifiers or fudging dates and times, then consider abstracting the underlying table with a view or add computed columns to the table. That way the computation is inline, but it's also reusable.

    Also, computed table columns can be persisted, which means that the computation is performed once when the row is initially inserted, and then it can be referenced anywhere else without repeating the expression or having the expense of re-computation.

    "Do not seek to follow in the footsteps of the wise. Instead, seek what they sought." - Matsuo Basho

  • Kim Crosser (9/28/2015)


    The real question is whether the UDFs are causing any performance problems. 10x sounds horrible, but in your system does that result in a query that runs 10 minutes vs. 1 minute or is it (more likely) a run-time difference measured in micro- or milliseconds?

    Are the functions used in the Where clause and thus may be executed against a high percentage of the source data, or are they in the Select clause and only applied to the final result set? Or - as you noted on one of them - are they only used once per query to prep a parameter value?

    I usually vote in favor of centralizing common functions for ease of maintenance - unless they cause a measurable performance issue, and then you may need to refactor them or replicate in-line code. But I wouldn't start refactoring everything just because a given UDF may be 10x slower than in-line code.

    Just because it isn't a problem right now, it doesn't mean that it won't be a problem in the future. If Kristen starts refactoring the functions now, there could be a nice long process without pressure and enough time to test instead of the nightmare of having to fix the performance issues as soon as possible. It will also prevent the problem to grow.

    Luis C.
    General Disclaimer:
    Are you seriously taking the advice and code from someone from the internet without testing it? Do you at least understand it? Or can it easily kill your server?

    How to post data/code on a forum to get the best help: Option 1 / Option 2
  • Luis Cazares (9/28/2015)

    Just because it isn't a problem right now, it doesn't mean that it won't be a problem in the future. If Kristen starts refactoring the functions now, there could be a nice long process without pressure and enough time to test instead of the nightmare of having to fix the performance issues as soon as possible. It will also prevent the problem to grow.

    I didn't propose ignoring the UDFs - but we all have finite time available, and a mass refactoring of code to eliminate all UDFs is likely a giant waste of time.

    A function that is used once per query to prep a parameter can be optimized to infinity with virtually zero overall effect on a system.

    Further - refactoring working queries just because a function *could* be slower is an excellent opportunity to introduce new bugs.

    Identify the places where a UDF is applied a LOT - such as in a Where clause (shudder!) and work on those. On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.

    Sometime - "good enough is"...

  • Kim Crosser (9/28/2015)


    On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.

    Lets not guess... Post a couple such functions and let's find out.

    Sometime - "good enough is"...

    ...until someone uses it somewhere else and that IS not only the nature of functions, but their primary purpose.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/28/2015)


    Kim Crosser (9/28/2015)


    On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.

    Lets not guess... Post a couple such functions and let's find out.

    I hope it doesn't use FORMAT inside the scalar function. 😛

  • Luis Cazares (9/28/2015)


    Just because it isn't a problem right now, it doesn't mean that it won't be a problem in the future. If Kristen starts refactoring the functions now, there could be a nice long process without pressure and enough time to test instead of the nightmare of having to fix the performance issues as soon as possible. It will also prevent the problem to grow.

    Seriously, Kristen, getting back to your original post, Luis is right. Take a good, long look at them and determine if they can be translated into ITVFs. Yes, it's going to be an arduous task to find them, fix them and then fix everything that calls them. But like Luis said, the task will be a whole lot better now than having to do it when people are screaming about performance problems in production.

  • Ed Wagner (9/28/2015)


    Jeff Moden (9/28/2015)


    Kim Crosser (9/28/2015)


    On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.

    Lets not guess... Post a couple such functions and let's find out.

    I hope it doesn't use FORMAT inside the scalar function. 😛

    Now THAT would be funny!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • All I am really after is something like a preprocessor directive.

    A preprocessor would be awesome! The one thing I'd like for it, aside from being a decent preprocessor with a good feature set, is that it be made mainstream enough that it found common usage with many SQL shops, otherwise, it could be an odd tool that would saddle an organization with code that a replacement couldn't quickly get up to speed in supporting it.

    I used the M4 macro processor for a bit to manipulate some boilerplate code, but while I like the idea of a macro processor, the rather severe downside (in my opinion) is that its not a skillset often associated with T-SQL skills, and I soon felt I could be burdening the company with odd looking code, so I ditched it.

    If the SQL community could rally around a preprocessor that afforded us some more modern coding techniques, maybe even just for inlining duplicate code, that would be pretty darn useful!

    I think developers could get up to speed with this sort of thing, but T-SQL folks, I'm not so sure of.

  • Jeff Moden (9/28/2015)


    Kim Crosser (9/28/2015)


    On the other hand, UDFs used to do some fancy formatting of an output result value aren't likely to have any significant impact on performance.

    Lets not guess... Post a couple such functions and let's find out.

    Sometime - "good enough is"...

    ...until someone uses it somewhere else and that IS not only the nature of functions, but their primary purpose.

    Example where I would use a UDF for formatting - Phone numbers.

    Given a table where the phone numbers are stored as:

    ...

    CC int, -- Country Code

    NPA int, -- Numbering plan area/region code

    NXX varchar(9), -- Exchange code

    SNUM varchar(9) not null, -- Subscriber number

    EXT varchar(9), -- Extension

    ...

    (Yes - they could be INT values, except that a lot of people and companies like to use the keyboard alpha strings instead.)

    Customer wants phone numbers to display in results as:

    if CC is not null and CC <> 1 ("US"), display as "+CC"

    if NPA is not null, display as "(NPA)"

    if NXX is not null, display as "NXX-"

    display SNUM

    if EXT is not null, display as " xEXT"

    Yes - you can write an inline Isnull/Coalesce/Case expression to format this, and repeat it in every column in every query that accesses the data, or you can write one simple UDF.

    So - which is easier to maintain?

    select

    ...

    case isnull(CC,1) when 1 then '' else '+' + convert(varchar(5),CC) + ' ' end

    + coalesce('(' + convert(varchar(5),NPA) + ') ','')

    + coalesce(EXCH + '-','')

    + coalesce(SNUM,'') -- yeah, I know it is a not null column, but I am paranoid...

    + coalesce(' x' + EXT,'') as nicenum

    ...

    or

    select

    ...

    myUDFNicePhone(CC,NPA,NXX,SNUM,EXT) as nicenum

    I loaded a table with 800,000 records (with different field values) and queried it both ways multiple times. The function version took 3.4-4.5 seconds to process the 800,000 records, while the inline expression took 1.1-1.8 seconds.

    Thus, the function averaged 1.6-3.4 seconds slower over 800,000 records, or an average of 4.25 microseconds per record slower (or less).

    In real life, I have had to implement functions like this to handle multiple foreign telephone formats, where some country formats have dashes and some just spaces, and some have spaces at fixed intervals, while others can vary.

    A similar issue arises with Postal (Zip) codes and formatting of address lines, which vary in interesting ways in different countries. Yes, you could write a big complex Case/Coalesce expression and then copy/paste it wherever you wanted to output these values, or you can write (and fully debug) one UDF and use it where needed. I know which of those customer systems I would rather support when some country decides to change its postal code format (like when the US went to Zip+4).

    Code should be efficient and clean, but IMO replicating complex expressions in multiple locations should only be done when necessary - not just because you can squeeze a few more microseconds out of a query that is already performing well.

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply