RE: How to Make Scalar UDFs Run Faster (SQL Spackle)

Valued Member

Points: 65

September 23, 2012 at 4:15 pm

Yes, I was indeed talking about two different things. One is the CLR-in-SQL function, and the other is a pure C# application.

I think the C# application was fairly well protected against being optimized out. I didn't realize I never attached it; I can see if I have it around. I inspected the generated IL and the runtime execution traces. I also observed that it scaled linearly with the number of iterations. Just to be certain, I also accumulate a value in the loop body, and print out that accumulated value at the end.

In my experience, the CLR JIT optimizer rarely optimizes out entire loop bodies. It just does not go through that work to do the analysis, likely under the assumption that it has little benefit for real world (non-benchmark) cases. (The Oracle JVM actually does do such eliminations fairly often. In fact, I find the Oracle JVM to perform stronger optimizations across the board. However, with C# and the CLR, you have pointers and non-heap types and similar things, and thus have more ability to optimize what you find necessary. Basically, as a rule of thumb, performance almost always follows: simple C# < java < optimized C#.)

It seems like there are further limits on really making the SQL CLR functionality match the performance of pure CLR code for numerical computation. For example, when passing over a large set of primitive values, it is not fast to make any call at all for each and every entry. It would commonly be faster to first shuttle the values into a primitive array and then execute on that, for reasons of memory locality and call overhead. Further, SQL Server could in theory persist such copies at will, as long as it treated them as a cache and invalidated them. For data warehousing scenarios this would work well, since updates are rare, and that is where high-performance numerical computation is common. Unfortunately, this kind of usage has not been a priority. I suppose there just isn't much customer demand for a unified system... As it is, I would be cautious about overdoing any complex computations that are not core SQL Server scenarios. For many such cases of numerical computation, the optimal design (often by orders of magnitude) will be: dump the data from SQL Server, compute, spew data into SQL Server.

Now, optimizing large-scale extraction and import is a different topic, and is also far from ideal performance with standard mechanisms...

I wish we didn't have to consider all of this complexity. There is no reason why we should have to worry about making CLR versions of TSQL functions that could be created by a template transform, to get high performance. We also shouldn't have to worry that the best we can do for some large-scale computation on a data warehouse might be 1000x slower than what we could do if we dump the data, even when highly-optimized. This really reduces the flexibility and simplicity of systems built around SQL Server for a wide variety of cases. Of course, again, these are not the core customer cases, so...

I assume the standalone C# code I was measuring was something like (although more JIT warming, etc. would be desirable):

using System;

using System.Diagnostics;

namespace Test

{

class Program

{

static void Main(string[] args)

{

int[] data = new int[10000000];

for (int i = 0; i < data.Length; i++)

{

data = i;

}

int acc = 0;

acc += Mult(data);

acc = 0;

Stopwatch time = Stopwatch.StartNew();

acc += Mult(data);

time.Stop();

Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

acc = 0;

time.Restart();

acc += Mult(data);

time.Stop();

Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

acc = 0;

time.Restart();

acc += Mult(data);

time.Stop();

Console.WriteLine("Time: " + time.ElapsedMilliseconds + " ms [" + acc + "]");

}

public static int Mult(int[] data)

{

int res = 0;

foreach (int d in data)

{

//res += Mult(d);

res = Mult(d);

}

return res;

}

public static int Mult(int d)

{

return d * 2;

}

(Incidentally, it is out of scope for this forum, but to back up my aside above... Here is the rough code in Java:

public class Test

{

public static void main(String[] args)

{

int[] data = new int[10000000];

for (int i = 0; i < data.length; i++)

{

data = i;

}

int acc = 0;

acc += Mult(data);

acc = 0;

long time = System.nanoTime();

acc += Mult(data);

time = System.nanoTime() - time;

System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

acc = 0;

time = System.nanoTime();

acc += Mult(data);

time = System.nanoTime() - time;

System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

acc = 0;

time = System.nanoTime();

acc += Mult(data);

time = System.nanoTime() - time;

System.out.println("Time: " + (time / 1000000) + " ms [" + acc + "]");

}

public static int Mult(int[] data)

{

int res = 0;

for (int d : data)

{

//res += Mult(d);

res = Mult(d);

}

return res;

}

public static int Mult(int d)

{

return d * 2;

}

Unlike the C# code, this always takes 0 ms to execute. It appears to simply analyze the code and conclude that only the last iteration of the loop will matter. If you change the Mult function to accumulate instead, it no longer can do that. Still, it then takes only 6 ms to execute, which is much faster than the C# version.)