User Defined Functions Investigation

Question

User Defined Functions Investigation

Viewing 15 posts - 16 through 30 (of 53 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 1

Here's the late night (quiet server) run times for 25K rows...

Formula in SELECT
(25000 row(s) affected)
00:00:00:080 ------------------------------ Function in SELECT
(25000 row(s) affected)
00:00:00:157 ------------------------------ Formula in WHERE
(25000 row(s) affected)
00:00:00:593 ------------------------------ Function in WHERE
(25000 row(s) affected)
00:00:00:580 ------------------------------ ============================== Function in SELECT
(25000 row(s) affected)
00:00:00:123 ------------------------------ Formula in SELECT
(25000 row(s) affected)
00:00:00:140 ------------------------------ Function in WHERE
(25000 row(s) affected)
00:00:00:580 ------------------------------ Formula in WHERE
(25000 row(s) affected)
00:00:00:547 ------------------------------

And here's this morning's run for a 1/4 million (250K) rows...

Formula in SELECT
(250000 row(s) affected)
00:00:01:453 ------------------------------ Function in SELECT
(250000 row(s) affected)
00:00:01:470 ------------------------------ Formula in WHERE
(250000 row(s) affected)
00:00:05:530 ------------------------------ Function in WHERE
(250000 row(s) affected)
00:00:06:170 ------------------------------ ============================== Function in SELECT
(250000 row(s) affected)
00:00:01:390 ------------------------------ Formula in SELECT
(250000 row(s) affected)
00:00:01:410 ------------------------------ Function in WHERE
(250000 row(s) affected)
00:00:05:513 ------------------------------ Formula in WHERE
(250000 row(s) affected)
00:00:06:890 ------------------------------
You'll notice that the Function frequently wins out in speed over the Clear Code (Formula) as previously noted in this thread by myself and Sergiy. It kinda destroy's Carls hypothesis and refutes Greg's note about UDF's "always being harmful". It certainly supports what Sergiy has been saying and I believe the following is the finding of this study:

The myth that UDF's have any significant impact on performance is just that, a myth and that fact has been proven within the bounds of this brief study. Poorly written code is what saps performance, not the mere use of UDF's, Scalar or not.

To summarily dismiss the use of UDF's without having done these initial proofs would have been a horrible mistake especially since a well written library of UDF's to do even the simplest of tasks/conversions can save Developers many hours of, if nothing else, typing and embedding documentation. Properly named and intuitive-to-use UDF's add clarity to code, reduce "code clutter", and keep Developer's from having to reinvent the wheel when in the heat of battle also know as a "rush project". It is a fact, as Greg did mention somewhat, that careful planning and the strategic deployment of UDF's to assist Developers in their efforts can certainly lend itself to the speedy development efforts. However, not all things can be thought of before-hand and the ad-hoc development of needed UDF's (submitted for code review/approval) can only cause the more expediant completion of any given project.

Carl, you started this and have been amazingly quiet... are you just taking notes and waiting for the dust to settle or what?

Sergiy, man this has been fun! Thanks for going back and forth with me on these tests. And, yeah, I changed my mind... I'll probably make a UDF that does nothing but strip the time off a date and add it to the "library".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Greg Grow SSCrazy Points: 2386 More actions · Answer 2

Jeff,

How can you disprove that UDFs are NOT harmful? Let me see. If the cost of a query exceeds to high and parallelism is used, a Scalar UDF will prevent parallelism, could be bad. I am not saying that parallelism is always good thing, but I have seen large ETL processes suffer at the hands of Scalar UDFs. Also, another point is the FACT that UDFs have an internal cursor to them. That means that you are saying that an explicit CURSOR around large datasets is OK. Or, CURSORS are ok. I would rather say cursors have a time and place. Be smart and play it safe. This is what I mean by Strategic planning with UDFs. The way that I see it, there is no dispute over Scalar UDFs being harmful. They are. Now, you may not see any direct impact, but nonetheless they have influence over performance. Go ahead and use your Scalar UDFs on large datasets, and one day if the set becomes large enough, you will be re-evaluating those UDFs.

I know the MS guys can speak to this more clearly.

I agree, this is fun reading everyones thoughts and seeing what you came up with. Most of my day consists of evaluating Scalar UDFs, and we just went through all of our code performing Performance Evaluations and most on the ETL process were rewritten to use straight T-SQL vs. Scalar UDFs. I have seen UDFs cause so much harm that by removing one UDF, changed the performance of load script from 4 hours to 50 minutes.

Thanks

~ Greg

Carl Federl One Orange Chip Points: 25384 More actions · Answer 3

"Carl, you started this and have been amazingly quiet... are you just taking notes and waiting for the dust to settle or what?"

Someone has been out sick for the last 2 days, so I got stuck with the "on-call" and production support. Production support mostly means answering foolish questions like:

"Please reset my database password. The answers are:

1. You are attempting to connect to the wrong server.

2. You are using Windows authentication so please call the help desk.

Second favorite request: "Can I have the SA password?"

I will have some results later using a benchmark approach to simulate an OLTP environment. This is done by running 24 simultaneous isql.exe command files versus an 8 CPU Cluster with the output going to files. There are 3 groups of 8 command files that do:

select with no "Date Only" as a baseline

select with UDF

select with In-Line calculation

Cheers

SQL = Scarcely Qualifies as a Language

Carl Federl One Orange Chip Points: 25384 More actions · Answer 4

UDFs and the impact on performance.

Abstract:

Most SQL Database Management Systems include the capability of allowing the user to implement an algorithm by writing their own functions either in a SQL dialect or using a host language. These functions are referred to as User Defined Functions. A series of test using Microsoft SQL Server 2000 were performed that simulates a high OLTP work-load on a high-end server class machine and the performance of these SQL statement without the algorithm, with the algorithm written in-line within the statement and as a UDF. The thruput and resource utilization of the alternatives are then analyzed.

Backgound:

Most comparisons that have been previously performed concentrate on comparing elapsed time when the needed algorithm is written in-line using vendor provided function versus encapsulating the vendor provided functions within a UDF. Additionally, the tests were performed sequentially which allows environmental factors, such as an un-detected work load on the DBMS, to make the results questionable. This comparison will track elapsed time, logical reads and CPU utilization with the various alternative SQL statements running in parallel over a fixed time span. The SQL Server Tracing facility is used to capture these statistics for each statement. To insure that only the only the steady state resource utilization is analyzed, the first 15 seconds of start-up and the last 15 seconds of shut-down of the workload run are not included in the analysis.

Environment:

The DBMS used is Microsoft SQL Server 2000 Enterprise Edition with Service Pack 4 running under Windows 2003 Server Enterprise Edition. The server hardware is an 8 CPU Xeon 3.0 GHz with 4 GB of memory. No other process where using the server or the DBMS and pre and post tests showed CPU utilization of less than one percent. Disk drives were removed as a factor in the tests by forcing the tables accessed to be resident in memory. To allow others to run the same benchmark, the SQL Server demo database named Northwind was used which contains the table "Orders" having 830 rows and three datetime columns.

Test Case:

The test case used an algorithm to remove the time portion of a column that holds both the date and time. For example, given a column with the value of '2005-12-07 15:46:34.373', the algorithm should return '2005-12-07 00:00:00.000'.

The SQL Server algorithm is convert(datetime, convert(int, @DateTime - 0.5 ) )

The User Defined Function SQL source is:

create function dbo.DateOnly

( @DateTimedatetime

)

RETURNS datetime

AS

BEGIN

RETURN convert(datetime, convert(int, @DateTime - 0.5 ) )

END

go

Test Run SQL statements:

Three sets of test run scripts were created.

1.Select from the tables without using any algorithm.

select 'TC-NOT'

,Orders.OrderId

,Orders.OrderDate

,Orders.RequiredDate

,Orders.ShippedDate

from dbo.Orders

2.Select from the table with the algorithm in-line within the SQL Statement executed.

select 'TC-SQL'

,Orders.OrderId

,convert(datetime, convert(int,Orders.OrderDate - 0.5)) AS OrderDate

,convert(datetime, convert(int,Orders.RequiredDate - 0.5))AS RequiredDate

,convert(datetime, convert(int,Orders.ShippedDate - 0.5)) AS ShippedDate

from dbo.Orders

3.Select from the table with the algorithm within a UDF.

select 'TC-UDF'

,Orders.OrderId

,dbo.DateOnly( Orders.OrderDate )AS OrderDate

,dbo.DateOnly( Orders.RequiredDate )AS RequiredDate

,dbo.DateOnly( Orders.ShippedDate )AS ShippedDate

from dbo.Orders

Workload scripts:

Three sets of work load scripts were written so that each SQL statement would begin at the same times and run for a fixed duration of two minutes. The UDF work load scripts has the below source and the other two scripts are similar. The "WAITFOR TIME" command causes a pause until the time is reached and then the remaining SQL statements are executed.

WAITFOR TIME '13:10:00'

declare @EndTSdatetime

set @EndTs= dateadd(mi , 2 , current_timestamp )

WHILE current_timestamp< @EndTs

begin

select 'TC-UDF'

,Orders.OrderId

,dbo.DateOnly( Orders.OrderDate )AS OrderDate

,dbo.DateOnly( Orders.RequiredDate )AS RequiredDate

,dbo.DateOnly( Orders.ShippedDate )AS ShippedDate

from dbo.Orders

end

go

Simultaneous execution of work load scripts.

For each work load script, four Windows command files were created to invoke the script using the SQL Server ISQL utility where the output is written to file. See the Microsoft SQL Server Books OnLine for the meaning of the parameters. Each command file specifies a different output file name.

isql.exe -E -SUSAG551IOMT1-4 -dNorthwind -h-1 -w1000 -i"R:\UDF\Source\Extend_Orders_DateOnly_UDF.sql" -o"R:\UDF\OutputFiles\Extend_Orders_DateOnly_01.txt"

Resource Utilization Statistics:

The below statisicts use the exclamation mark as the delimiter:

Method!Base!SQL!UDF!SQL - Base!UDF- Base

Execution!2401!2280!1880!!

Total Duration!718685!712109!715065!!

Total Reads!62426!59280!52640!!

Total CPU!3235!4342!58779!!

Avg Duration MS!299.327!312.329!380.354!13.00!81.03

Average Reads!26.000!26.000!28.000!0.00!2.00

Average CPU!1.347!1.904!31.265!0.56!29.92

Analysis:

The use of a UDF instead of in-line SQL resulted in a 21 % increase in duration and a 15 fold increase in CPU Utilization.

Recommendation:

SQL Server User Defined function should be avoided due to the huge performance impact especially the effect on CPU utilization.

SQL = Scarcely Qualifies as a Language

Greg Grow SSCrazy Points: 2386 More actions · Answer 5

Thanks Carl, good work. Anyone who has been performance tuning with SQL Server and has worked with Scalar UDFs would reckognize the performance hit. It is amazing how harmful they can be. Careful planning should be used when using Scalar UDFs.

Carl, do you think you can run a test of a SQL Server View and an In-line table UDF? I have seen similar execution plans between the two and cannot see a negative effect on performance. Unless, you have 20 joins, lots of column output, and nested sub-queries.

Thanks

Greg

Carl Federl One Orange Chip Points: 25384 More actions · Answer 6

I have at least 5 other sets of comparisons to run to be included in the final article and the results make the DATEONLY UDF look good. A lot of the time spent so far was on the infrastucture such as setting up traces and confirming that the results reflect the actual resource utilization.

1. Deterministic UDFs vs In-Line SQL in the SELECT section of SQL statement. These will include math, date and string algorithms.

2. Non-Deterministic UDFs vs In-Line SQL in the SELECT section of SQL statement. These include single row, the SUM aggregation, the MAX aggregation and the LIST.

3. Deterministic UDFs vs In-Line SQL in the WHERE section of SQL statement

4. Non-Deterministic UDFs vs In-Line SQL in the WHERE section of SQL statement

5. Deterministic UDFs vs In-Line SQL as a CHECK Constraint

6. Non-Deterministic UDFs as CHECK Constraint vs In-Line SQL in a Trigger

Cheers.

For references, I have found:

Ken Henderson's "The Guru's Guide to SQL Server Stored Procedures, XML and HTML" which offers some interesting test cases. Since all of the source is on a CD, this is very convinient.

Joseph Gama's "TSQL functions" at http://www.planet-source-code.com/vb/scripts/ShowCode.asp?txtCodeId=502&lngWId=5 These UDFs appear to be all deterministic.

Andrew Novick's UDF library at http://www.novicksoftware.com/

SQL = Scarcely Qualifies as a Language

Sergiy SSC Guru Points: 110209 More actions · Answer 7

Do you actually believe that???

Average CPU!1.347!1.904!31.265!0.56!29.92

Analysis:

The use of a UDF instead of in-line SQL resulted in a 21 % increase in duration and a 15 fold increase in CPU Utilization.

You are making common beginner's mistake:

you do not take into cinsideration the simple fact that process of collecting statistics is taking time and resources as well.

SQL Server does not compute statistics on built-in functions. Just because they are not subject for investigations.

But every execution of UDF adds extra record into statistics log.

It's not UDF to utilize your CPU, it's your statistics!

Your conlusions are wrong because your experiment is badly desined.

In real system you do not record statistics on every executed statement, so UDF will not have any noticable impact on performance.

_____________
Code for TallyGenerator

Carl Federl One Orange Chip Points: 25384 More actions · Answer 8

Regarding your comments on the trace invalidating the statistics.

"But every execution of UDF adds extra record into statistics log."

It depends on how the trace is configured.

The trace used to capture the statistics only includes Event ID 45 - SQL:StmtCompleted.

If the trace had included Event Id 41 - SP:StmtCompleted, then you would be correct that each of statements within the UDF would be additionally written to the trace file. But this is not occuring.

Additionally:

1. Separate tests were run without the trace being active that captured the count of the number of SQL statement run within the two minute window for each of the 12 connections. These counts match the count found by the trace.

2. The resource used by the trace are at the server level not at the statement level. Since all of the SQL statements that are being benchmarked are running at the same time, they are all be equally impacted by the resources used by the trace.

I stand by the statistics reported and the conclusions.

SQL = Scarcely Qualifies as a Language

Carl Federl One Orange Chip Points: 25384 More actions · Answer 9

An additional comment on tracing.

If one uses the SQL Profilier to capture the trace information, I have found that this tool does have an effect on the overall performance of the server.

That is why the trace is set up to run in the background and write to a file, not to a GUI.

SQL = Scarcely Qualifies as a Language

Sergiy SSC Guru Points: 110209 More actions · Answer 10

Switch off all SQL statistics and run Performance Monitor from Windows "Administrative Tools"

And then try to catch your 15 times difference in CPU load.

_____________
Code for TallyGenerator

Jeff Moden SSC Guru Points: 1004679 More actions · Answer 11

Greg Wrote: Go ahead and use your Scalar UDFs on large datasets, and one day if the set becomes large enough, you will be re-evaluating those UDFs.

Allow me to quote from one of my earlier posts: "AND, each test was on 2 MILLION records using the same table." I really appreciate your concerns but my question would be, how big is large enough? An observation might also be that if you're processing more than that using OLTP in a single batch, perhaps a change in processing schedule is in order. That, not withstanding, I'd be happy to run a test with whatever size record set you can provide once you've shown the results from the same tests. I would also ask that you share the test code so we can run them and make our own evaluation of your tests.

Carl wrote: Additionally, the tests were performed sequentially which allows environmental factors, such as an un-detected work load on the DBMS, to make the results questionable.

Not in my case. As I stated previously... Here's the results from my test runs on a stand-alone single-user non-networked server on my little ol' 1.8Ghz, 2GB ram, IDE disk, desktop server running SQL Server 2000 Developer's Edition with SP3a on an XP-sp2 box... I ran it a couple of times to be sure... very similar results on all runs... Only the later runs were done on a multi-user server. In all cases, the comparisons were reversed within each run to detect that possibility. That, notwithstanding, I'll be happy to run parallel tests on a server and a stand alone box.

Shifting gears a bit folks, I don't give a rat's patooti about whether or not SQL decides to use an internal cursor or not. In many cases, those internal cursors beat the clear code for performance. Am I suggesting that someone write cursors or that I condone the use of developer made cursors? Not on your life...

I can't speak of the parallelism problems that Greg is warning us about because I've not done any testing for the alleged fault, yet. You can bet I will.

Sergiy, once again, I agree with you... to measure something is to change it. That's why I wrapped each instance of code to be measured in simple elapsed time code instead of allowing embedded features to do the measurements in an unpredictable fashion... I have noticed that setting statistics times and doing traces can and frequently does change the elapsed time of otherwise good code. I will however, be doing some testing with tracing and stats on just to see how much they affect identical code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Carl Federl One Orange Chip Points: 25384 More actions · Answer 12

I have re-done the same performance comparison but captured resource utilization from the sysprocess table instead of using a trace.

On the average CPU Milliseconds with sysprocess versus trace

For UDFs, 29.642 vs 31.265 for a 5% decrease in CPU without the trace.

For SQLs, 1.720 vs 1.904 for a 10% decrease in CPU without the trace.

For None, 1.518 vs 1.347 for a 12% INCREASE in CPU without the trace.

The conclusions remain the same, that in a two minute window:

1. Execution counts was 3,160 for the Inline SQL versus 2,417 for the UDF. This equates to at 24% drop in thruput.

2. Average Cpu usage in milliseconds was 1.72 for the Inline SQL was versus 29.642 for the UDF. This equates to a 17 times more CPU resource needed.

DENY CREATE FUNCTION TO PUBLIC !!!

SQL Scripts:

CREATE TABLE dbo.ProcessResources

( Condition varchar (255) NOT NULL

, Spid smallint NOT NULL

, SQLCnt int NOT NULL

, CpuMs int NOT NULL

, PhysicalIo int NOT NULL

)

go

-- Revised SQL Batch to capture statistics from the sysprocesses table.

WAITFOR TIME '9:49:00'

declare @EndTSdatetime

,@SQLCntinteger

set@SQLCnt = 0

set @EndTs= dateadd(mi , 2 , current_timestamp )

while current_timestamp< @EndTs

begin

select 'TC-UDF'

,Orders.OrderId

,dbo.DateOnly( Orders.OrderDate )AS OrderDate

,dbo.DateOnly( Orders.RequiredDate )AS RequiredDate

,dbo.DateOnly( Orders.ShippedDate )AS ShippedDate

from dbo.Orders

SET@SQLCnt = @SQLCnt + 1

end

Insert into udf_traces.dbo.ProcessResources

(Condition,spid,SQLCnt , CpuMs,PhysicalIo)

select 'TC-UDF'

,spid

,@SQLCnt

, CPU

,Physical_IO

from master.dbo.sysprocesses

where spid = @@spid

go

-- Summarize by test case

select condition, SUM(SQLCnt), SUM(CPUMS) , SUM(PhysicalIo)

FROM UDF_Traces.dbo.ProcessResources

Group by Condition

go

-- Show the Resources used by each Process

SELECT Condition, Spid, SQLCnt, CpuMs, PhysicalIo

FROM UDF_Traces.dbo.ProcessResources

order by Condition, spid

The details for each process:

CaseSPIDSQLCPUPIO

TC-NOT643282500

TC-NOT652364530

TC-NOT663836560

TC-NOT674687340

TC-NOT685499210

TC-NOT695529070

TC-NOT702503280

TC-NOT713945470

TC-SQL723194690

TC-SQL732484690

TC-SQL743926880

TC-SQL751402650

TC-SQL763094840

TC-SQL774637810

TC-SQL783075630

TC-SQL792394380

TC-UDF80458135470

TC-UDF81517152500

TC-UDF8225373120

TC-UDF83510148750

TC-UDF8427282190

TC-UDF85474138130

TC-UDF8622268440

TC-UDF87383117030

SQL = Scarcely Qualifies as a Language

Carl Federl One Orange Chip Points: 25384 More actions · Answer 13