The Case for Scalar-valued, User-defined Functions in T-SQL

,

Scalar-valued, user-defined functions (sUDFs) in T-SQL are often maligned, with people normally pointing to their lack of performance.  This is a reputation that is unfortunately well-deserved.  Anyone that knows me also knows of my preference for in-line, table-valued functions (iTVFs), because they’ll generally perform better.  In fact, I’ve written and published articles on several utility iTVFs that are an essential part of my own, personal T-SQL tool kit.

Before anyone goes and makes generalized assertions on the performance of sUDFs, one must be familiar with one quirk of measuring their performance.  This was well documented by SQL MVP Jeff Moden in his SQL Spackle article whimsically entitled How to Make Scalar UDFs Run Faster, where he doesn’t really “make them run faster,” but rather shows how to more accurately measure their true performance.

Regardless of their performance, or lack thereof, there remains some use cases where the ability to create a sUDF is essential, because no other method will truly suffice as a replacement.  So today I’d like to look at a few of these cases, because perhaps you haven’t seen some of them before.  For each of our use cases, we’ll construct a somewhat contrived, simple example, but where we can we’ll also try to show you something that is perhaps a little more useful in real-world practice.

An Example Table and Some Test Data

In order to expound upon our sUDF use cases, we’ll need to create a test table and populate it with some sample data.

  IF OBJECT_ID(N'dbo.SampleTableforSUDFUseCases', N'U') IS NOT NULL
      DROP TABLE dbo.SampleTableforSUDFUseCases;
  CREATE TABLE dbo.SampleTableforSUDFUseCases
      (
        ID INT PRIMARY KEY
      , C1 INT NOT NULL
      ); 
   GO
  DECLARE @NoRows INT = 10000;
  WITH    Tally ( n )
            AS ( SELECT TOP ( @NoRows )
                      ROW_NUMBER() OVER ( ORDER BY ( SELECT
                                                          NULL
                                                   ) )
                  FROM
                      sys.all_columns a
                      CROSS JOIN sys.all_columns b
               )
      INSERT INTO dbo.SampleTableforSUDFUseCases
              ( ID
              , C1
              )
              SELECT
                      n
                  ,   1 + ABS(CHECKSUM(NEWID())) % 50
                  FROM
                      Tally;
   GO

This creates 10,000 rows of sample data where column C1 contains a random number between 1 and 50.

Using a sUDF in a DEFAULT CONSTRAINT

For an example of using a sUDF in a DEFAULT CONSTRAINT, we’ll first need to construct a function for this purpose.  This function will calculate the statistical mode.  To put it simply, the mode in a set of sample data is the most frequently occurring value.

We can construct a sUDF that calculates mode:

  CREATE FUNCTION dbo.CalculateMode ( )
  RETURNS INT
  AS
      BEGIN
          DECLARE @ModeOfC1 INT = ( SELECT TOP 1
                                          C1
                                      FROM
                                          ( SELECT
                                                  C1
                                              ,   C = COUNT(C1)
                                              FROM
                                                  dbo.SampleTableforSUDFUseCases
                                              GROUP BY
                                                  C1
                                          ) a
                                      ORDER BY
                                          C DESC
                                  );
          RETURN @ModeOfC1;
      END;

Note that it is possible that in our 10,000 sample points there might be several actual values that share the same highest frequency count, so we’ll assume that any of them can be used to represent the mode.

Let’s run a quick query to show us the current mode in our sample data (note that if you’re following along you’ll get a different value for C1 and C).

SELECT TOP 1 C1, C
FROM
(
    SELECT C1, C=COUNT(C1)
    FROM dbo.SampleTableforSUDFUseCases
    GROUP BY C1
) a
ORDER BY C DESC;
-- Results (our mode is 20 and there are 226 rows where C1 is 20)
C1 C
20 226

So now let’s use our sUDF to create a named, DEFAULT CONSTRAINT for column C1.

ALTER TABLE dbo.SampleTableforSUDFUseCases
ADD CONSTRAINT Tst_df_c1 DEFAULT (dbo.CalculateMode()) FOR C1;
GO

We can then insert a new row and see that the result that was inserted (since it was not specified on the INSERT statement) defaulted to the current mode.

INSERT INTO dbo.SampleTableforSUDFUseCases (ID) 
 SELECT 10001;
SELECT *
 FROM dbo.SampleTableforSUDFUseCases 
 WHERE ID = 10001; -- Results ID C1 10001 20

We have just demonstrated a way to set a non-deterministic DEFAULT CONSTRAINT for a column in a table.

Note that an UPDATE statement that uses DEFAULT for the value being set will also call our sUDF to assign the mode to the updated row(s).  For example:

UPDATE dbo.SampleTableforSUDFUseCases
SET C1 = DEFAULT
WHERE ID = 10000;

Using a sUDF in a Computed Column

For our second example of using a sUDF, let’s create a FUNCTION that calculates a running total.  May the gods of SQL performance forgive me this transgression, but I’ll get back to that in a minute.

  IF OBJECT_ID(N'dbo.RunningTotal', N'FN') IS NOT NULL
      DROP FUNCTION dbo.RunningTotal;
    GO
  CREATE FUNCTION dbo.RunningTotal ( @ID INT )
  RETURNS INT
  AS
      BEGIN
          DECLARE @RT INT = ( SELECT
                                      SUM(C1)
                                  FROM
                                      dbo.SampleTableforSUDFUseCases
                                  WHERE
                                      ID <= @ID
                            );
          RETURN @RT; 
      END;

We can now add a computed column to our table (RT) and call this value to show the running total (ordered by the ID column).

ALTER TABLE dbo.SampleTableforSUDFUseCases
 ADD RT AS (dbo.RunningTotal(ID)); GO
SELECT TOP 5 ID, C1, RT
 FROM dbo.SampleTableforSUDFUseCases
 ORDER BY ID; 
-- Results 
ID   C1    RT 
1    8     8 
2    19    27 
3    14    41 
4    26    67 
5    43    110

Here is where the pundits are gonna thrash me for using such a poor-performing method of calculating a running total!  Let me state categorically and for the record, this is not how I would calculate a running total in actual practice.  This is merely a demonstration of a technique (using a sUDF in a computed column).  If I were to actually calculate a running total, my preference would be:

Just for fun though, we’re going to create another FUNCTION, this time a schema bound iTVF that will also calculate a running total.

IF OBJECT_ID (N'dbo.RunningTotal2', N'TVF') IS NOT NULL
     DROP FUNCTION dbo.RunningTotal2; 
GO
CREATE FUNCTION dbo.RunningTotal2(@ID INT) RETURNS TABLE WITH SCHEMABINDING 
RETURN
     SELECT RT2=SUM(C1)
     FROM dbo.SampleTableforSUDFUseCases
     WHERE ID <= @ID;

The reason we’re going to do this (besides for fun) is to demonstrate the supposition that sUDFs perform more slowly than an equivalent iTVF.  Try the following test harness.

DECLARE @RunningTotal   BIGINT = 0
       ,@StartDT        DATETIME = GETDATE()
       ,@DumpVar        BIGINT;
SELECT @DumpVar = RT2 FROM dbo.SampleTableforSUDFUseCases CROSS APPLY dbo.RunningTotal2(ID);
SELECT ElapsedMSiTVF = DATEDIFF(millisecond, @StartDT, GETDATE()); 
SELECT @StartDT = GETDATE();
SELECT @DumpVar = RT FROM dbo.SampleTableforSUDFUseCases;
SELECT ElapsedMSsUDF = DATEDIFF(millisecond, @StartDT, GETDATE());

When I ran this I got the following results on my 10,000 (or so) rows of data.

ElapsedMSiTVF 6676

ElapsedMSsUDF 7436

This demonstrates not only that the iTVF version of our running totals function is faster, but also the correct way to calculate timing results for a sUDF (as noted again by Jeff Moden in the article I mentioned in the second paragraph).

I recently ran across a really interesting case for using a sUDF in a computed column.  SQL MVP Steve Jones, who’s probably better known as the editor of the SQL Server Central web site, recently posted to his Voice of the DBA blog site this article: “T-SQL Tuesday #65 – Learning Computed Columns for XML.”

In that short blog, Steve shows how you can create a sUDF to use in a computed column, which shreds a bit of information out of an XML column.  Rather than reproduce that here, I’ll let you read his blog.  Suffice it to say, I thought that was a pretty darned cool use case!

Using a sUDF in a CHECK CONSTRAINT

For our third and final use case for sUDFs, let’s create one more function.

IF OBJECT_ID (N'dbo.PctVarianceFromMedian', N'FN') IS NOT NULL
     DROP FUNCTION dbo.PctVarianceFromMedian; 
GO
CREATE FUNCTION dbo.PctVarianceFromMedian(@C INT) RETURNS FLOAT 
AS 
BEGIN
     DECLARE @Median FLOAT =
         (
             SELECT AVG(0.+b.C1)
             FROM
             (
                 SELECT c1=(c+1)/2
                     ,c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
                 FROM
                 (
                     SELECT c=COUNT(*)
                     FROM dbo.SampleTableforSUDFUseCases
                 ) a
             ) a
             JOIN
             (
                 SELECT C1, rn=ROW_NUMBER() OVER (ORDER BY C1)
                 FROM dbo.SampleTableforSUDFUseCases
             ) b ON b.rn BETWEEN a.c1 AND a.c2
         );
    RETURN 100.*ABS(@C-@Median)/@Median; 
END

This function first calculates the median for our sample data using a technique I describe in An Even Faster Method of Calculating the Median on a Partitioned Heap.  Note that this might not be the fastest way to calculate median for this INDEXing case (see Best approaches for grouped median by SQL MVP Aaron Bertrand for other ways), but it will suffice for our purposes.

After calculating the median, it compares this against the argument to the function doing a percentage over/under type calculation.  We’ll now set up a CHECK CONSTRAINT on C1 that will ensure that any new value inserted for C1 is within ±10% of the calculated median.

ALTER TABLE dbo.SampleTableforSUDFUseCases
 ADD CONSTRAINT Tst_ck_InsertedValueExceedsVarianceFromMedian
 CHECK (10>dbo.PctVarianceFromMedian(C1));
GO

We can run a script to calculate the median four our sample data, using the same technique:

SELECT Median=AVG(0.+b.C1)
FROM
(
    SELECT c1=(c+1)/2
        ,c2=CASE c%2 WHEN 0 THEN 1+c/2 ELSE (c+1)/2 END
    FROM
    (
        SELECT c=COUNT(*)
        FROM dbo.SampleTableforSUDFUseCases
    ) a
) a
JOIN
(
    SELECT C1, rn=ROW_NUMBER() OVER (ORDER BY C1)
    FROM dbo.SampleTableforSUDFUseCases
) b ON b.rn BETWEEN a.c1 AND a.c2;
-- Results
Median
26.000000

Now we’ll try to insert a new sample data item (12) that is outside of the allowable 10% range (calculated as: 100*ABS(12-26)/26 = 53.8%), and we’ll get an error indicating that the insertion violates our newly imposed CHECK CONSTRAINT.

INSERT dbo.SampleTableforSUDFUseCases (ID, C1)
 VALUES (10002, 12);
-- Resulting error
The ALTER TABLE statement conflicted with the CHECK constraint "Tst_ck_InsertedValueExceedsVarianceFromMedian". The conflict occurred in database "YOUR-DATABASE", table "dbo.SampleTableforSUDFUseCases", column 'C1'.

So using sUDFs we can impose more elaborate checking on the data we insert into table columns.

I once saw a very interesting use case for this approach having to do with hierarchies that are stored in adjacency lists.  I’ll apologize in advance for not having a link to the original source for this.

To demonstrate this case, we’ll refer you to yet another article by Jeff Moden called Hierarchies on Steroids #1: Convert an Adjacency List to Nested Sets.  If you run the stored procedure he provides there named BuildSmallEmployeeTable, it builds a small adjacency list hierarchy, which once you select results from it you’ll see this.

SELECT EmployeeID, ManagerID, EmployeeName
FROM dbo.Employee;
-- Results
EmployeeID   ManagerID   EmployeeName
2            26          Lynne
3            26          Bob
6            17          Eric
7            3           Vivian
8            3           Bill
12           8           Megan
14           8           Kim
17           2           Butch
18           39          Lisa
20           3           Natalie
21           39          Homer
26           NULL        Jim
39           26          Ken
40           26          Marge

I’ll also refer you now to the very nice graphic Jeff provides in his article showing the four levels that this hierarchy/adjacency list represents.

In any adjacency list that you intend for any useful purpose, when inserting or updating nodes you must ensure that you do not introduce a cyclic reporting relationship.  In other words (for example) if manager A has employee B, employee B cannot also be the manager of A.  That’s a pretty simple cycle but they can become much more convoluted the deeper your hierarchy becomes.

So let’s create a sUDF that will (hopefully) check for cyclic reporting relationships.  Since it utilizes a recursive Common Table Expression (rCTE), you may want to use the MAXRECURSION OPTION if you’ve got a hierarchy that is more than 100 levels deep (the default).

CREATE FUNCTION dbo.CheckManagerCycle (@managerID INT) RETURNS INT AS BEGIN
    DECLARE @cycleExists INT = 0;
    WITH RecurseHierarchy as     (         SELECT EmployeeID, ManagerID, EmployeeName         FROM dbo.Employee          WHERE EmployeeID = @managerID
        UNION ALL
        SELECT a.EmployeeID, a.ManagerID, a.EmployeeName         FROM dbo.Employee a         JOIN RecurseHierarchy b         ON b.ManagerID = a.EmployeeID and a.EmployeeID <> @managerID     )     SELECT @cycleExists = count(*)     FROM RecurseHierarchy      WHERE ManagerID = @managerID;
    RETURN @cycleExists;
END

We can then call this sUDF in the following CHECK CONSTRAINT, where WITH CHECK causes it to run for all data currently saved in the table.

ALTER TABLE dbo.Employee WITH CHECK
ADD CONSTRAINT ManagerCycleCheck CHECK ((dbo.CheckManagerCycle(ManagerID)=0));
GO

Since executing the above did not fail, we’ll assume the hierarchy has no cyclic reporting relationships at the outset.

Looking again at Jeff’s nice hierarchy graphic, we see that if we change Bob’s (whose EmployeeID = 3) manager from Jim (EmployeeID = 26) to Megan (EmployeeID = 12), we’ll introduce a cycle that should not be allowed to occur because Megan already reports to Bob through Bill (EmployeeID = 8).

In SQL terms, that change (UPDATE) is shown below.

UPDATE dbo.Employee
SET ManagerID = 12
WHERE EmployeeID = 3;
-- Resulting error
The UPDATE statement conflicted with the CHECK constraint "ManagerCycleCheck". The conflict occurred in database " YOUR-DATABASE", table "dbo.Employee", column 'ManagerID'.

The failure of this seems to verify that this function works as advertised.

At this point I need to let you know that I have not thoroughly tested this sUDF/CHECK CONSTRAINT against all possible operations that could cause a cyclic reporting relationship, as I would if I was going to introduce it into a production system.  However I will say that it does seem to work correctly for the limited cases that I have tested.  Certainly it worked for our example case.  So before you use it, be sure to run it through its paces using your adjacency list hierarchy.

Still, it does make for quite an interesting use case for a sUDF in a CHECK CONSTRAINT.

Learning Recap and Conclusions

Today we have learned that sUDFs can be used in DEFAULT CONSTRAINTs, CHECK CONSTRAINTs and to calculate the value for a computed column.  In fact, using a sUDF for these purposes is about the only way you’ll be able to introduce rather complex calculations for any of those scenarios.

While performance of sUDFs may not be as swift as for other methods, if they are your only recourse to get the job done, then you probably won’t want to hesitate to use them.

I’d also like to acknowledge and thank SQL MVPs Jeff Moden, Aaron Bertrand and Steve Jones for their prior work, which greatly helped in the development of this article.  I pride myself on learning best of breed solutions from the best of the best.

You may need to clean up your sandbox if you’ve been following along, so here is a SQL script you can use to do that (except for the Employee table, where Jeff’s article will take care of you).

DROP FUNCTION dbo.RunningTotal2;
ALTER TABLE dbo.SampleTableforSUDFUseCases DROP CONSTRAINT Tst_df_c1;
ALTER TABLE dbo.SampleTableforSUDFUseCases DROP COLUMN RT;
DROP FUNCTION dbo.RunningTotal;
DROP FUNCTION dbo.PctVarianceFromMedian;
DROP FUNCTION dbo.CalculateMode;
DROP TABLE dbo.SampleTableforSUDFUseCases;

Follow me on Twitter: @DwainCSQL and check out my blog: SQL Tips for the Up and Comers

© Copyright Dwain Camps 21 Apr 2015.  All rights reserved.

Rate

4.95 (21)

Share

Share

Rate

4.95 (21)