How to Make Scalar UDFs Run Faster (SQL Spackle)

  • milldyce (8/1/2012)


    A bit of a tangent, but I remember a number of queries for which response times exploded when trying to view the Execution Plan.

    Requesting an actual execution plan causes SQL Server to insert extra (invisible) operators into the executable to measure the things like actual row counts, rebinds, rewinds...and so on. The output from all this data collection is converted into XML at the end of execution and returned to the client. Understandably, all these extra operators and activity generate significant overhead.

  • Jeff Moden (9/23/2012)


    SQL Kiwi (9/23/2012)


    The C# code will indeed be compiled down to just one machine-code instruction (which may take less than one CPU cycle to execute)...

    Are you sure, Paul? IIRC, even a simple JMP instruction takes at least 2 clock cycles.

    I was referring to the fact that modern processors employ tricks like prefetch, pipelining, branch prediction, speculative execution, out-of-order execution...and so on...meaning that multiple instructions can be retired per clock. An simple instruction that multiplies by two (which may well be optimized to a left shift as Tom mentions) may be timed at 1 clock but might well take less than that. e.g. http://stackoverflow.com/questions/692718/how-to-find-cpu-cycle-for-an-assembly-instruction

  • L' Eomot Inversรฉ (9/23/2012)


    I think you misunderstood me Paul - my fault for not expressing myself more clearly.

    rragnmo seemed to me to be talking about two distinct thigs - a chunk of pure C# measured on its own (his first paragraph) and a CLR function embedded into SQL (his second paragraph). I was concerned only with his second paragraph. When an attempt is made to do the whole thing in c# (or Fortran, or any other procedural language) there's a risk that the optimiser notices that the first one million minus one function calls are irrelevant so it leaves them out. This doesn't happen with SQL, currently (but it might in the furture) so generally there';s an attempt to prevent the optimiser from doing this to avoid meaningless comparisons. My "one multiplication" was for the whole process, which needed only one multiplication (probably actually a left shift, depending on the target machine, as it's a multiplication by 2) instead of a million multiplications. If that sort of optimisation was happening, we would be comparing a forest with a single tree.

    Oh I see! Yes, sorry I did misread your question. You're right that this sort of optimization doesn't happen now, and I doubt it ever will - unless SQL Server's interpreted T-SQL language is completed replaced by .NET (unlikely, perhaps sadly).

  • Right. Working back through the comments completed ๐Ÿ™‚

    Thank you Jeff for so many mentions in your fine article!

  • SQL Kiwi (9/23/2012)


    Jeff Moden (9/23/2012)


    SQL Kiwi (9/23/2012)


    The C# code will indeed be compiled down to just one machine-code instruction (which may take less than one CPU cycle to execute)...

    Are you sure, Paul? IIRC, even a simple JMP instruction takes at least 2 clock cycles.

    I was referring to the fact that modern processors employ tricks like prefetch, pipelining, branch prediction, speculative execution, out-of-order execution...and so on...meaning that multiple instructions can be retired per clock. An simple instruction that multiplies by two (which may well be optimized to a left shift as Tom mentions) may be timed at 1 clock but might well take less than that. e.g. http://stackoverflow.com/questions/692718/how-to-find-cpu-cycle-for-an-assembly-instruction

    Thanks for the explanation, Paul. It's been a long time (6502 days) since I've even gone near machine language. The processors certainly have come a long way at that level.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (9/23/2012)


    Thanks for the explanation, Paul. It's been a long time (6502 days) since I've even gone near machine language.

    Me too! My first ever computer was a Commodore VIC-20 (1MHz 8-bit 6502 CPU) I bought with paper-round money.

  • SQL Kiwi (9/23/2012)


    Right. Working back through the comments completed ๐Ÿ™‚

    Thank you Jeff for so many mentions in your fine article!

    The articles you wrote on CROSS APPLY are the best on the net IMHO. Thank you for writing them. They give the rest of us a good spot to refer folks to on the subject.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Yes, I was indeed talking about two different things. One is the CLR-in-SQL function, and the other is a pure C# application.

    I think the C# application was fairly well protected against being optimized out. I didn't realize I never attached it; I can see if I have it around. I inspected the generated IL and the runtime execution traces. I also observed that it scaled linearly with the number of iterations. Just to be certain, I also accumulate a value in the loop body, and print out that accumulated value at the end.

    In my experience, the CLR JIT optimizer rarely optimizes out entire loop bodies. It just does not go through that work to do the analysis, likely under the assumption that it has little benefit for real world (non-benchmark) cases. (The Oracle JVM actually does do such eliminations fairly often. In fact, I find the Oracle JVM to perform stronger optimizations across the board. However, with C# and the CLR, you have pointers and non-heap types and similar things, and thus have more ability to optimize what you find necessary. Basically, as a rule of thumb, performance almost always follows: simple C# < java < optimized C#.)

    It seems like there are further limits on really making the SQL CLR functionality match the performance of pure CLR code for numerical computation. For example, when passing over a large set of primitive values, it is not fast to make any call at all for each and every entry. It would commonly be faster to first shuttle the values into a primitive array and then execute on that, for reasons of memory locality and call overhead. Further, SQL Server could in theory persist such copies at will, as long as it treated them as a cache and invalidated them. For data warehousing scenarios this would work well, since updates are rare, and that is where high-performance numerical computation is common. Unfortunately, this kind of usage has not been a priority. I suppose there just isn't much customer demand for a unified system... As it is, I would be cautious about overdoing any complex computations that are not core SQL Server scenarios. For many such cases of numerical computation, the optimal design (often by orders of magnitude) will be: dump the data from SQL Server, compute, spew data into SQL Server.

    Now, optimizing large-scale extraction and import is a different topic, and is also far from ideal performance with standard mechanisms...

    I wish we didn't have to consider all of this complexity. There is no reason why we should have to worry about making CLR versions of TSQL functions that could be created by a template transform, to get high performance. We also shouldn't have to worry that the best we can do for some large-scale computation on a data warehouse might be 1000x slower than what we could do if we dump the data, even when highly-optimized. This really reduces the flexibility and simplicity of systems built around SQL Server for a wide variety of cases. Of course, again, these are not the core customer cases, so...

    I assume the standalone C# code I was measuring was something like (although more JIT warming, etc. would be desirable):

    using System;

    using System.Diagnostics;

    namespace Test

    {

    class Program

    {

    static void Main(string[] args)

    {

    int[] data = new int[10000000];

    for (int i = 0; i < data.Length; i++)

    {

    data = i;

    }

    int acc = 0;

    acc += Mult(data);

    acc = 0;

    Stopwatch time = Stopwatch.StartNew();

    acc += Mult(data);

    time.Stop();

    Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

    acc = 0;

    time.Restart();

    acc += Mult(data);

    time.Stop();

    Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

    acc = 0;

    time.Restart();

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    time.Stop();

    Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

    }

    public static int Mult(int[] data)

    {

    int res = 0;

    foreach (int d in data)

    {

    //res += Mult(d);

    res = Mult(d);

    }

    return res;

    }

    public static int Mult(int d)

    {

    return d * 2;

    }

    }

    }

    (Incidentally, it is out of scope for this forum, but to back up my aside above... Here is the rough code in Java:

    public class Test

    {

    public static void main(String[] args)

    {

    int[] data = new int[10000000];

    for (int i = 0; i < data.length; i++)

    {

    data = i;

    }

    int acc = 0;

    acc += Mult(data);

    acc = 0;

    long time = System.nanoTime();

    acc += Mult(data);

    time = System.nanoTime() - time;

    System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

    acc = 0;

    time = System.nanoTime();

    acc += Mult(data);

    time = System.nanoTime() - time;

    System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

    acc = 0;

    time = System.nanoTime();

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    acc += Mult(data);

    time = System.nanoTime() - time;

    System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

    }

    public static int Mult(int[] data)

    {

    int res = 0;

    for (int d : data)

    {

    //res += Mult(d);

    res = Mult(d);

    }

    return res;

    }

    public static int Mult(int d)

    {

    return d * 2;

    }

    }

    Unlike the C# code, this always takes 0 ms to execute. It appears to simply analyze the code and conclude that only the last iteration of the loop will matter. If you change the Mult function to accumulate instead, it no longer can do that. Still, it then takes only 6 ms to execute, which is much faster than the C# version.)

  • rragno (9/23/2012)


    Unlike the C# code, this always takes 0 ms to execute. It appears to simply analyze the code and conclude that only the last iteration of the loop will matter. If you change the Mult function to accumulate instead, it no longer can do that. Still, it then takes only 6 ms to execute, which is much faster than the C# version.)

    So the Java optimiser is somewhat better than the C# optimiser.

    Of course I would have expected pre-compiled code in either of these languages to run in 0 ms for the version with the accumulation as well, so these numbers suggest that optimisation techniques (for procedural languages like these) have deteriorated badly in the last four decades, or everything is doing recompilation all the time, or people have decide that some of the possible optimisations are rather pointless.

    And it seems pretty clear that the C# optimisation is really bad.

    Tom

  • SQL Kiwi (9/23/2012)


    Removed. (I quoted Jeff partially quoting Barry without reading all the way back to see that Barry had already referenced the document I wanted to mention).

    Hey! I got something write! ๐Ÿ˜€

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (9/25/2012)


    SQL Kiwi (9/23/2012)


    Removed. (I quoted Jeff partially quoting Barry without reading all the way back to see that Barry had already referenced the document I wanted to mention).

    Hey! I got something write right! ๐Ÿ˜€

    (Doggone it! :angry:)

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Jeff Moden (8/5/2012)


    mmilodragovich (8/1/2012)


    Thanks for the detailed and METHODICAL analysis. I once was troubleshooting a performance problem in a large, complicated application and it came down to a "set quoted identifer on" statement in a scalar function. The statement was not needed, was probably generated by a developer who used a right-click-on-object to generate the "create" statement. The function was being called massive number of times so even the slightest performance difference was hugely magnified. I've been wary ever since but know that knowledge is power, so thanks for the article.

    You're welcome and thanks for the feedback. I've had similar not-easy-to-explain things happen in this wonderful world of T-SQL.

    Actually, it's not so hard to explain.

    We all forget time to time that SQL is an interpreter, not a compiler.

    Therefore any piece if code is at least re-evaluated (worst case scenario - recompiled) every time it's invoked.

    Scalar function is applied in the final recordset.

    One time for an each entry a returned row.

    If you try to trace a query with a scalar functuion in it you'll see this:

    SP:StartingSELECT @Result = dbo.TimesTwo(SomeNumber) FROM #TestTable ;

    SP:StmtStartingRETURN @SomeINT * 2

    SP:StmtCompletedRETURN @SomeINT * 2

    SP:CompletedSELECT @Result = dbo.TimesTwo(SomeNumber) FROM #TestTable ;

    repeated as many times as number of rows returned by the query.

    At the same time, table function (parametrised view) in JOIN statement is invoked ONCE - so reavaluated (recompiled) once.

    That's where the performance difference comes from.

    SQL Server is just being busy creatind-recreating the scalar function in memory, that's all.

    In real life it should not make much difference, unless your system (querying) design is really stupid.

    There is no way a user wants to see more than a hundred rows on UI.

    Therefore you should never have million rows scenario in real-life application.

    And if you apply a WHERE clause to all those queries leaving under 1000 rows in the final recordsets you'll see that that difference is really minor, less than 5%.

    Not worth any attention, really.

    Same case as explicitly dropping #tables in a SP code - if you do a test run on only create/drop operations in a SP and execute it 10000 than overheade be really noticable.

    But in a real life procedures the impact will be totally unnoticable. Well, there will a negative impact on performance when you explicitely drop a # table at the end of a SP, but it's totally forgivable for sake of one's style (not my style! :-))

    In real life you should not have millin-row SELECT statements, so overhead of a scalar function can be ignored, if it's more comfortable to code.

    Unless you're doing some kind of Data Transformation task, but even then - it's one-off, and even a big overhead can be tolarated.

    Views used for reporting - yes, that's where you should be concerned.

    And not using UDF's in SELECT part of the queries must be a mandatory rule.

    _____________
    Code for TallyGenerator

  • Ah, Sergiy... my ol' friend. Long time no see. Thank you for stopping by and thanks for the great explanations.

    Yeah... I agree... most folks using front end code aren't going to notice any difference between using a Scalar UDF or an iTVF. Chances are, many won't even be able to measure the differences because of the extremely low rowcounts especially since I showed them why not to use SET STATISTICS to measure Scalar UDF performance.

    I do try to avoid Scalar UDFs, though, because most of the stuff I do is on staging tables where the cumulative times of multiple functions across many columns does start to make a difference especially when loading hundreds of large files in such a manner. 5% performance differences can really start to stack up in batch code.

    For GUI code, I write it as if it were large volume batch code because so many people will "leverage" any code they can find in a schedule pinch for their own batch code. If the iTVFs turn out to be a bit more complicated than some will understand, a well placed comment here and there in the code takes care of the problem. If not, then they probably shouldn't be trying to modify the code to begin with. ๐Ÿ™‚

    Of course, it's still possible to write crap code in an iTVF. iTVFs aren't a panancea.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Can't believe it.

    Turn out the SQL2000 solution is not that straight forward.

    Or, should I say, is very limited.

    If we change the function a little bit:

    CREATE FUNCTION dbo.TimesN

    (@SomeINT int, @Multiplier int)

    RETURNS TABLE WITH SCHEMABINDING AS

    RETURN SELECT Doubled = @SomeINT * @Multiplier

    GO

    this still works fine:

    SELECT * FROM dbo.TimesN(10, 2)

    Returns 20 as expected.

    But if you try to use it against a table :

    DECLARE @Result INT

    --===== Begin measuring duration using GETDATE

    DECLARE @StartTime DATETIME;

    SELECT @StartTime = GETDATE()

    --===== Test the code using the "throw-away" variable

    SELECT @Result = (SELECT Doubled FROM dbo.TimesN(SomeNumber, 2))

    FROM #TestTable

    --===== "Stop the timer" and report the duration

    PRINT DATEDIFF(ms,@StartTime,GETDATE())

    You are getting this:

    Msg 155, Level 15, State 1, Line 7

    'SomeNumber' is not a recognized OPTIMIZER LOCK HINTS option.

    Does not like 2 or more parameters, and that's it.

    Same code works perfectly in SQL2008.

    _____________
    Code for TallyGenerator

  • Hi Jeff,

    I know this is an old article reprinted and you might not see this, but I thought I'd ask anyway.

    If you're using a Scalar UDF to define a Computed Column, can you still use a iTVF?

    --------------------------------------
    When you encounter a problem, if the solution isn't readily evident go back to the start and check your assumptions.
    --------------------------------------
    Itโ€™s unpleasantly like being drunk.
    Whatโ€™s so unpleasant about being drunk?
    You ask a glass of water. -- Douglas Adams

Viewing 15 posts - 76 through 90 (of 102 total)

You must be logged in to reply to this topic. Login to reply