Why Scalar Functions Can Be Costly

  • Comments posted to this topic are about the item Why Scalar Functions Can Be Costly

    Mickey Stuewe
    Sr Database Developer
    My blog
    Follow me on twitter: @SQLMickey
    Connect with me on LinkedIn
    --------------------------------------------------------------------------
    I laugh loudly and I laugh often. Just ask anyone who knows me.

  • Thank you Mickey for this article, good Monday morning reading.

    The part I find hard to agree with is "When to Use Scalar Functions", particularly when the example given can easily be converted into a much more efficient iTVFN with hardly any changes in the syntax. In my opinion, the only time scalar UDF should be used is when that is the only choice, such as on calculated columns etc..

    😎

    The following example shows that the iTVFN version of the function is roughly 15 times faster than the scalar version, the former adding roughly 0.0006 milliseconds per row while the latter adds 0.009 milliseconds per row on my old i5 laptop. In other words the iTVFN takes 0.6 seconds on a 1,000,000 row sample set and the scalar 9 seconds on the same set.

    Test data

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.USP_CREATE_PERSON_ADDRESS') IS NOT NULL DROP PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS;

    GO

    CREATE PROCEDURE dbo.USP_CREATE_PERSON_ADDRESS

    (

    @SAMPLE_SIZE INT

    ,@MSGLANGID SMALLINT = 1033 -- Default to English

    )

    AS

    /******************************************************************************

    2015 Eirikur Eiriksson

    ---------------------------------------

    Create "realistic" sample data set (SDP_ID, FirstName, LastName, Address)

    in any of the languages found in sys.syslanguages.

    ---------------------------------------

    Parameters

    1. @SAMPLE_SIZE INT, number of rows produced, max 10^9

    2. @MSGLANGID SMALLINT, msglangid of the language to use from the

    sys.syslanguages table.

    ---------------------------------------

    Note: to view the applicable languages and corresponding msglangid run the

    following query.

    ---------------------------------------

    SELECT

    SL.alias

    ,SL.msglangid

    FROM sys.syslanguages SL;

    ---------------------------------------

    SELECT DISTINCT

    SL.msglangid

    FROM sys.syslanguages SL;

    ---------------------------------------

    Output structure:

    ---------------------------------------

    SDP_ID BIGINT NOT NULL

    FirstName NVARCHAR(50) NOT NULL

    LastName NVARCHAR(50) NOT NULL

    [Address] NVARCHAR(100) NOT NULL

    ---------------------------------------

    *******************************************************************************/

    WITH T(N) AS (SELECT N FROM (VALUES (0),(0),(0),(0),(0),(0),(0),(0),(0),(0)) AS X(N))

    , NUMS(N) AS (SELECT TOP (@SAMPLE_SIZE) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS N

    FROM T T1,T T2,T T3,T T4,T T5,T T6,T T7,T T8,T T9)

    ,NAME_STRING(SXNAME) AS

    (

    SELECT

    (SELECT

    REPLACE( -- 32

    REPLACE( -- 34 "

    REPLACE( -- 37 %

    REPLACE( -- 39 '

    REPLACE( -- 40 (

    REPLACE( -- 41 )

    REPLACE( -- 42 *

    REPLACE( -- 44 ,

    REPLACE( -- 45 -

    REPLACE( -- 46 .

    REPLACE( -- 47 /

    REPLACE( -- 58 :

    REPLACE( -- 59 ;

    REPLACE( -- 61 =

    REPLACE( -- 64 @

    REPLACE( -- 95 _

    REPLACE( -- 124 |

    SM.text

    ,NCHAR(32),N'') -- 32

    ,NCHAR(34),N'') -- 34 "

    ,NCHAR(37),N'') -- 37 %

    ,NCHAR(39),N'') -- 39 '

    ,NCHAR(40),N'') -- 40 (

    ,NCHAR(41),N'') -- 41 )

    ,NCHAR(42),N'') -- 42 *

    ,NCHAR(44),N'') -- 44 ,

    ,NCHAR(45),N'') -- 45 -

    ,NCHAR(46),N'') -- 46 .

    ,NCHAR(47),N'') -- 47 /

    ,NCHAR(58),N'') -- 58 :

    ,NCHAR(59),N'') -- 59 ;

    ,NCHAR(61),N'') -- 61 =

    ,NCHAR(64),N'') -- 64 @

    ,NCHAR(95),N'') -- 95 _

    ,NCHAR(124),N'') -- 124 |

    FROM sys.messages SM

    WHERE SM.language_id = @MSGLANGID

    FOR XML PATH(''), TYPE).value('.[1]','NVARCHAR(4000)') AS SXNAME

    )

    ,BASE_DATA AS

    (

    SELECT

    NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

    + LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS FirstName

    ,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

    + LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,5 + ABS(CHECKSUM(NEWID())) % 10)) AS LastName

    ,NCHAR(65 + ABS(CHECKSUM(NEWID())) % 25)

    + LOWER(SUBSTRING(NS.SXNAME,ABS(CHECKSUM(NEWID())) % 3800,7 + ABS(CHECKSUM(NEWID())) % 10))

    + NCHAR(32) + CONVERT(NVARCHAR(4),1 + (ABS(CHECKSUM(NEWID())) % 1000),0) AS [Address]

    FROM NUMS NM

    CROSS APPLY NAME_STRING NS

    )

    SELECT

    ROW_NUMBER() OVER

    (

    ORDER BY (SELECT NULL)

    ) AS SDP_ID

    ,BD.FirstName

    ,BD.LastName

    ,BD.Address

    FROM BASE_DATA BD

    ;

    GO

    IF OBJECT_ID(N'dbo.SAMPLE_Person') IS NOT NULL DROP TABLE dbo.SAMPLE_Person;

    CREATE TABLE dbo.SAMPLE_Person

    (

    SDP_ID BIGINT NOT NULL CONSTRAINT PK_DBO_PERSON_PERSON_ID PRIMARY KEY CLUSTERED

    ,FirstName NVARCHAR(50) NOT NULL

    ,LastName NVARCHAR(50) NOT NULL

    ,[Address] NVARCHAR(100) NOT NULL

    );

    GO

    INSERT INTO dbo.SAMPLE_Person (SDP_ID,FirstName,LastName,[Address])

    EXEC dbo.USP_CREATE_PERSON_ADDRESS 1000000,1033;

    The two functions

    USE tempdb;

    GO

    SET NOCOUNT ON;

    IF OBJECT_ID(N'dbo.ITVFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.ITVFN_PROPER_CASE;

    GO

    CREATE FUNCTION dbo.ITVFN_PROPER_CASE

    (

    @INPUT NVARCHAR(50)

    )

    RETURNS TABLE

    WITH SCHEMABINDING

    AS

    RETURN

    SELECT

    STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))) AS PROPER_CASED

    ;

    GO

    IF OBJECT_ID(N'dbo.SFN_PROPER_CASE') IS NOT NULL DROP FUNCTION dbo.SFN_PROPER_CASE;

    GO

    CREATE FUNCTION dbo.SFN_PROPER_CASE

    (

    @INPUT NVARCHAR(50)

    )

    RETURNS NVARCHAR(50)

    WITH SCHEMABINDING

    AS

    BEGIN

    RETURN

    (SELECT

    STUFF(LOWER(@INPUT),1,1,UPPER(LEFT(@INPUT,1))))

    ;

    END

    GO

    Test harness

    DECLARE @NCHAR_BUCKET_01 NVARCHAR(50) = N'';

    DECLARE @NCHAR_BUCKET_02 NVARCHAR(50) = N'';

    DECLARE @timer TABLE (T_TEXT VARCHAR(100) NOT NULL, T_TS DATETIME2(7) NOT NULL DEFAULT (SYSDATETIME()));

    INSERT INTO @timer(T_TEXT) VALUES ('DRY RUN');

    SELECT

    @NCHAR_BUCKET_01 = SP.FirstName

    ,@NCHAR_BUCKET_02 = SP.LastName

    FROM dbo.SAMPLE_Person SP;

    INSERT INTO @timer(T_TEXT) VALUES ('DRY RUN');

    INSERT INTO @timer(T_TEXT) VALUES ('SCALAR');

    SELECT

    @NCHAR_BUCKET_01 = dbo.SFN_PROPER_CASE(SP.FirstName)

    ,@NCHAR_BUCKET_02 = dbo.SFN_PROPER_CASE(SP.LastName)

    FROM dbo.SAMPLE_Person SP;

    INSERT INTO @timer(T_TEXT) VALUES ('SCALAR');

    INSERT INTO @timer(T_TEXT) VALUES ('ITVFN');

    SELECT

    @NCHAR_BUCKET_01 = X.PROPER_CASED

    ,@NCHAR_BUCKET_02 = Y.PROPER_CASED

    FROM dbo.SAMPLE_Person SP

    CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS X

    CROSS APPLY dbo.ITVFN_PROPER_CASE(SP.FirstName) AS Y;

    INSERT INTO @timer(T_TEXT) VALUES ('ITVFN');

    -- Test results

    SELECT

    T.T_TEXT

    ,DATEDIFF(millisecond,MIN(T.T_TS),MAX(T.T_TS)) AS DURATION

    FROM @timer T

    GROUP BY T.T_TEXT

    ORDER BY DURATION ASC;

    Results

    T_TEXT DURATION

    ---------- -----------

    DRY RUN 230

    ITVFN 547

    SCALAR 8994

    Edit: typo

  • Small correction required - UDFs came in as part of 2000 didn't they?

  • bjh1977 (1/4/2016)


    Small correction required - UDFs came in as part of 2000 didn't they?

    I think they started earlier with stored procs ('90): sql 6.0

  • 1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx

    2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the good article.

  • I'm not understanding here, doesn't the summing operation still have to occur for each row in the "cross apply" version?

    edit: Another thing that bothers me about the article is that the author implies that the "execution" count is the ultimate measure of T-SQL performance goodness, whereas, as shown by Eirikur Eiriksson, even with the "better" execution count with that second scalar UDF example, theres obviously more to the story here.

  • Eirikur's response rendered what I was going to write mostly redundant. But I do have some minor points:

    1) If you 'obfuscate' something, you make it unclear or difficult to interpret. The execution plan does not do that: instead it completely hides the multi-execution issue.

    2) It would be interesting to see how you created/configured that extended events session. Your 'how to identify the cost' comment made me think that you were going to find a way to identify the 'real' cost of using scalar functions within an execution plan.

    3) (typo) I think it's 'formatted' rather than 'formated'.

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • TheSQLGuru (1/4/2016)


    1) UDFs existed at least as early as SQL 2000: https://technet.microsoft.com/en-us/library/aa175085(v=sql.80).aspx

    2) This article fails to mention an EXCEPTIONALLY important flaw (among many others that aren't covered such as preventing the acquisition of good statistics) and that is single-threading. Concatenate a string like the article says using a UDF and your big-fact-table-hitting-query is going to be running without parallelism. http://sqlblog.com/blogs/paul_white/archive/2011/12/23/forcing-a-parallel-query-execution-plan.aspx

    Both Kevin and Eirikur beat me to it. As Eirikur demonstrated, an inline table valued function is almost always the way to go. Killing the possibility of a parallel query plan is the #1 problem work scalr UDF's.

    That said, great work Micky. I particularly like how you used Extended Events to show what was going on under the hood.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • The example given selects rows based on the result of a mathematical operation performed on two columns within each row. This derived result does not exist on its own; it must be calculated for each row, so it actually seems to make sense that each row must be read. It appears that the engine then creates a tally table with the same number of rows, includes the result of the calculation as a key column, then scans this table for qualifying results.

    But if the selection criteria was based on an existing column value - and if this column was properly indexed - would that not reduce the number of reads to just the number of qualifying rows? Even if the function then performed a similar computation using existing columns to derive a new result, it seems to me that the rows read should still be held to just "qualifying" rows, avoiding a full-table scan.

    My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

  • lucien.jacquet (1/4/2016)


    My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

    NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • TheSQLGuru (1/4/2016)


    lucien.jacquet (1/4/2016)


    My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

    NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.

    Not even for CHECK constraints?

    If you haven't even tried to resolve your issue, please don't expect the hard-working volunteers here to waste their time providing links to answers which you could easily have found yourself.

  • Phil Parkin (1/4/2016)


    TheSQLGuru (1/4/2016)


    lucien.jacquet (1/4/2016)


    My point is not to contradict your assertion that scalar functions "can be" expensive, but to suggest that your example is a worst-case scenario. Is it possible that your recommendation to never query a table or view inside of a scalar function might be overly broad?

    NO, IT IS ABSOLUTELY NOT OVERLY BROAD TO SAY NEVER QUERY A TABLE OR VIEW INSIDE OF A SCALAR FUNCTION!!!!!!!! It is actually more appropriate to say NEVER EVER USER SCALAR UDFS - PERIOD!! There are few absolutes in the world, but that is one of them for me. There is almost ALWAYS a way to avoid them. Please find and read my "Death by UDF" chapter in the SQL Server MVP Deep Dives 2 book to learn more about why they must be avoided.

    Not even for CHECK constraints?

    That's about the only exception

    😎

    Worse still are the multi-statement table-valued functions, in fact there is no excuse or justification for using those.

  • bjh1977 (1/4/2016)


    Small correction required - UDFs came in as part of 2000 didn't they?

    That's correct.

    😎

  • I remember that Simon Sabin did an article called "Scalar functions are evil", but whenever I needed to send someone a link to it I'd Google for "Simon Sabin is evil" 😛

Viewing 15 posts - 1 through 15 (of 53 total)

You must be logged in to reply to this topic. Login to reply