Long execution times when including simple SQL functions into a join

  • Florian Reischl (5/20/2009)


    @J-F:

    I stopped your function at two minutes.

    Greets

    Flo

    Ouch... And I thought it would perform better... Do you have the exec plan Flo, just for reference.

    Thanks,

    Cheers,

    J-F

  • First of all, thanks every one for your help so far. Its clear that I need to learn a few more things about performance of functions.

    Let me post some findings after using the SET STATISTICS. I simplified the problem to a small query calling a ID_field and phone from a table that is 170mln records (and yes, we receive data extracts at 6-8 week intervals requiring me to redo all the validations on data such as phone numbers etc.).

    See queries at bottom of message

    Time stats:

    Query 1: CPU time = 655 ms, elapsed time = 406 ms.

    Query 2 (Scaler): CPU time = 11562 ms, elapsed time = 22861 ms.

    Query 3 (CLR, C#): CPU time = 5281 ms, elapsed time = 7296 ms.

    IO stats:

    Query 1: Scan count 1, logical reads 0, physical reads 0, read-ahead reads 23786, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Query 2: Scan count 1, logical reads 16824, physical reads 0, read-ahead reads 21784, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Query 3: Scan count 1, logical reads 0, physical reads 0, read-ahead reads 22378, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Discussion:

    The data indicates that functions carry a substantial performance hit. The CLR appears most efficient but it will still take ~18x longer processing time. In the literature, I found that the processing time scale linearly with the number of records. My queries without any functions routinely take about 20 hours, this wold indicate that adding a function would extent the processing time from 20 to 360 hours (that is 15 days)! Any thoughts about that assumption??

    One more tidbit, when invoking a scalar function and just pass the input to the output, the CPU processing time increased the query ~12x (data not shown), that was quite an eye opener.

    I have not tried the tables-valued function that was presented because I need to learn about the m first, just never looked into it and will do now.

    Thanks for your help guys, looking forward to your comments.

    Queries

    --SET STATISTICS IO on

    SET STATISTICS time on

    --Query 1; baseline (no function)

    SELECT TOP 100000

    mita_orig_id,

    isnull(PHONE,'') as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    dbcc freeproccache

    dbcc dropcleanbuffers

    -- Query 2; Scalar-valued function

    SELECT TOP 100000

    mita_orig_id,

    PDW_DL_DATA.DBO.VALID_PHONE(isnull(PHONE,'')) as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    dbcc freeproccache

    dbcc dropcleanbuffers

    -- Query 3; CLR function (C#)

    SELECT TOP 100000

    mita_orig_id,

    PDW_DL_DATA.DBO.UDF_VALID_PHONE(isnull(PHONE,'')) as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    SET STATISTICS IO off

    SET STATISTICS time off

  • J-F Bergeron (5/21/2009)


    Florian Reischl (5/20/2009)


    @J-F:

    I stopped your function at two minutes.

    Greets

    Flo

    Ouch... And I thought it would perform better... Do you have the exec plan Flo, just for reference.

    Thanks,

    I just tried again on my home system which performs much better than my box at work. Now the inline function has almost same duration as the scalar function but the CPU usage is much more.

    Here my duration results:

    ---====== SQL scalar ===================

    SQL Server Execution Times:

    CPU time = 19890 ms, elapsed time = 21106 ms.

    ---====== SQL inline ===================

    SQL Server Execution Times:

    CPU time = 65084 ms, elapsed time = 21283 ms.

    ---====== CLR scalar ===================

    SQL Server Execution Times:

    CPU time = 16926 ms, elapsed time = 5782 ms.

    The execution plans are attached.

    I will post the CLR function and the test environment in a separate post.

  • @Pieter

    Since you didn't answer my question for C# or VB.Net I use C# because this is my preference 😀

    Ensure that you use "Release" compiled assembly. The "Debug" build is about 30% slower.

    The CLR function:

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using System.Text;

    using System.Text.RegularExpressions;

    using System.Xml;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    private static readonly Regex _phoneRegEx = new Regex("[(]|[)]|[-]|[ ]", RegexOptions.Compiled);

    [Microsoft.SqlServer.Server.SqlFunction()]

    public static SqlString ufn_clr_ValidPhone(

    [SqlFacet(MaxSize = 4000)]

    SqlString strIn

    )

    {

    if (strIn.IsNull)

    return new SqlString("");

    string str = _phoneRegEx.Replace(strIn.Value, "");

    string item;

    if (str.Length == 11)

    {

    if (str[0] == '1')

    str = str.Substring(1);

    else

    return new SqlString("");

    }

    if (str.Length != 10)

    return new SqlString("");

    if (str[0] == '0' || str[0] == '1')

    return new SqlString("");

    if (str[1] == '9')

    return new SqlString("");

    if (str.Substring(1, 2) == "11")

    return new SqlString("");

    if (str[3] == '0' || str[3] == '1')

    return new SqlString("");

    if (str.Substring(4, 2) == "11")

    return new SqlString("");

    item = str.Substring(0, 3);

    if (item == "456" || item == "500" || item == "555" || item == "600" || item == "700" || item == "710" || item == "800" || item == "822" || item == "833" || item == "844" || item == "855" || item == "866" || item == "877" || item == "888" || item == "898" || item == "900" || item == "976" || item == "999")

    return new SqlString("");

    item = str.Substring(3);

    if (item == "2222222" || item == "3333333" || item == "4444444" || item == "5555555" || item == "6666666" || item == "7777777" || item == "8888888" || item == "9999999")

    return new SqlString("");

    return new SqlString(str);

    }

    };

    I currently use a RegEx just to remove the special characters. I will try to cover the complete function within a reg-ex. I'm no great RegEx guy but I'll try. Are there any Perl pros? :laugh:

    To reduce the query execution time to the real phone validation I moved the functions to the WHERE clause.

    [font="Courier New"]SET NOCOUNT ON

    GO

    ---======================================

    -- -> Create some test data for phone number validation

    --DROP TABLE PhoneNumbers

    IF (OBJECT_ID('dbo.PhoneNumbers') IS NULL)

    BEGIN

       CREATE TABLE dbo.PhoneNumbers

       (

          Id INT NOT NULL IDENTITY,

          Phone VARCHAR(1000),

          PRIMARY KEY CLUSTERED

             (Id)

       )

       --TRUNCATE TABLE PhoneNumbers

       INSERT INTO PhoneNumbers

          SELECT TOP(1000000)

                CASE WHEN ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) % 10 = 0 THEN '0' ELSE '1' END +

                '(' + LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 3) + ') ' +

                LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 3) + '-' +

                LEFT(CONVERT(VARCHAR(20), ABS(CHECKSUM(NEWID()))), 4)

             FROM MASTER.sys.all_columns c1

                CROSS JOIN MASTER.sys.all_columns c2

       ALTER INDEX ALL ON PhoneNumbers REBUILD

    END

    -- <- Create some test data for phone number validation

    ---======================================

    GO

    DBCC FREEPROCCACHE WITH NO_INFOMSGS

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

    GO

    ---====== SQL scalar ===================

    PRINT '---====== SQL scalar ==================='

    DECLARE @count INT

    SET STATISTICS TIME ON

    SELECT

          @count = COUNT(*)

       FROM PhoneNumbers

       WHERE

          dbo.Valid_Phone(Phone) != ''

    SET STATISTICS TIME OFF

    PRINT ''

    PRINT ''

    GO

    DBCC FREEPROCCACHE WITH NO_INFOMSGS

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

    GO

    ---====== SQL inline ===================

    PRINT '---====== SQL inline ==================='

    DECLARE @count INT

    SET STATISTICS TIME ON

    SELECT

          @count = COUNT(*)

       FROM PhoneNumbers

          CROSS APPLY dbo.Valid_Phone_Inline(Phone) l

       WHERE

          l.OutputPhone != ''

    SET STATISTICS TIME OFF

    PRINT ''

    PRINT ''

    GO

    DBCC FREEPROCCACHE WITH NO_INFOMSGS

    DBCC DROPCLEANBUFFERS WITH NO_INFOMSGS

    GO

    ---====== CLR scalar ===================

    PRINT '---====== CLR scalar ==================='

    DECLARE @count INT

    SET STATISTICS TIME ON

    SELECT

          @count = COUNT(*)

       FROM PhoneNumbers

       WHERE

          dbo.ufn_clr_ValidPhone(Phone) != ''

    SET STATISTICS TIME OFF

    PRINT ''

    PRINT ''

    [/font]

    @J-F: Thanks a lot for the phone number samples since I had no idea about US phone numbers!

    Greets

    Flo

  • Hi Pieter

    Pieter (5/21/2009)


    First of all, thanks every one for your help so far. Its clear that I need to learn a few more things about performance of functions.

    Always welcome. Those kind of threads are the most interesting in my opinion 🙂

    Time stats:

    Query 1: CPU time = 655 ms, elapsed time = 406 ms.

    Query 2 (Scaler): CPU time = 11562 ms, elapsed time = 22861 ms.

    Query 3 (CLR, C#): CPU time = 5281 ms, elapsed time = 7296 ms.

    Could you please try my CLR function?

    Queries

    --SET STATISTICS IO on

    SET STATISTICS time on

    --Query 1; baseline (no function)

    SELECT TOP 100000

    mita_orig_id,

    isnull(PHONE,'') as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    dbcc freeproccache

    dbcc dropcleanbuffers

    -- Query 2; Scalar-valued function

    SELECT TOP 100000

    mita_orig_id,

    PDW_DL_DATA.DBO.VALID_PHONE(isnull(PHONE,'')) as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    dbcc freeproccache

    dbcc dropcleanbuffers

    -- Query 3; CLR function (C#)

    SELECT TOP 100000

    mita_orig_id,

    PDW_DL_DATA.DBO.UDF_VALID_PHONE(isnull(PHONE,'')) as new_phone

    FROM dbo.input_table

    WHERE LEN(PRIMARY_GIVEN_NAME) 0 AND LEN(SURNAME) 0 AND LEN(MTC_ADDRESS) 0 AND LEN(ZIP) 0

    SET STATISTICS IO off

    SET STATISTICS time off

    Maybe there are some further issues with your statement. Just to be sure could you please try to replace your "LEN(x) 0" with "x ''" statements and try to call the functions without the ISNULL()?

    Thanks

    Flo

  • For some info on scalar functions, including the difficulting in seeing how they perform

    http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/21/2009)


    For some info on scalar functions, including the difficulting in seeing how they perform

    http://sqlinthewild.co.za/index.php/2009/04/29/functions-io-statistics-and-the-execution-plan/

    Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.

    Maybe the performance could be optimized with a direct CTE which removes the special characters and a does the SUBSTRING checks. I did not yet try this.

    I just did not yet get an answer why Pieter doesn't add a new column to his table which contains the already formatted phone numbers.

    Greets

    Flo

  • Florian Reischl (5/21/2009)


    Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.

    True but, as far as I know, it will still be getting called on each execution (use profiler to check) and there's still overhead on each call. No where near as significant as when there's data queried, but still there.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (5/21/2009)


    Florian Reischl (5/21/2009)


    Gail, generally you are correct. But there is a huge difference between Pieters scalar function and the one you use in your blog. It does not query any data.

    True but, as far as I know, it will still be getting called on each execution (use profiler to check) and there's still overhead on each call. No where near as significant as when there's data queried, but still there.

    Completely agreed.

  • I've tried using scalar functions that don't access data and Gail is correct that most of the time they perform poorly.

    I've tried to break this query down for the mil rows supplied further up as test data.

    I'm trying to split out all the substring lefts etc and do comparisons at the end on a CTE and it performs very badly at the moment but still working on it 🙂

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • IS this query only supposed to return the correct phone numbers?

    If this is case I think I will work on a short circuit case statement in the where clause to do that might make it faster than having to do a function ...

    ----------------------------------------------
    Try to learn something about everything and everything about something. - Thomas Henry Huxley

    :w00t:
    Posting Best Practices[/url]
    Numbers / Tally Tables[/url]

    SQL-4-Life
  • J-F,

    Although I haven't worked with table-value functions, I am not sure that they work for my scenario. The validated phone number needs to be linked up to the original data, therefore at minimum I need a result giving me ID_field and validated_phone.

  • Florian,

    thanks for submitting the CLR (C# is fine) I compared it against mine and a baseline query without the function and found the following, see below. Looks like we use similar CPU times and you have more read-ahead reads. I am still trying to figure what that means.

    I will however have to continue this tomorrow, but I look forward to the comments.

    Thanks you all for your patience.

    Pieter

    -- Baseline query (no functions)

    CPU time = 388 ms, elapsed time = 487 ms.

    -- Piet's CLR

    CPU time = 5341 ms, elapsed time = 6771 ms.

    -- Florian Reischl CLR

    CPU time = 4978 ms, elapsed time = 6887 ms.

    -- Baseline query (no functions)

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 1879,

    lob logical reads 0,

    lob physical reads 0,

    lob read-ahead reads 0.

    -- Piet's CLR

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 3090,

    lob logical reads 0,

    lob physical reads 0,

    lob read-ahead reads 0.

    -- Florian Reischl CLR

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 22506,

    lob logical reads 0,

    lob physical reads 0,

    lob read-ahead reads 0.

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Text.RegularExpressions;

    // This function will validate phone numbers and correct minor problems.

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString UDF_VALID_PHONE(String pInput)

    {

    String vInput; //Place holder for the input

    String vOutput; //Place holder for the output

    String AreaCode;

    String PhoneNumber;

    // Assign input variable

    vInput = pInput;

    // Remove punctuations

    vInput = vInput.Replace("(", "");

    vInput = vInput.Replace(")", "");

    vInput = vInput.Replace("-", "");

    vInput = vInput.Replace(".", "");

    vInput = vInput.Replace(" ", "");

    //Validate the input string, remove 1 for long distance

    if (vInput.Length == 11)

    {

    // If input length is 11 and start with 1 then remove the first byte

    vInput = vInput.Substring(1, 10);

    }

    // further evaluate the string. If its too long or too short make it blank

    if (vInput.Length > 11)

    {

    vInput = "";

    }

    else if (vInput.Length < 10)

    {

    vInput = "";

    }

    //Validate the improved string that is 10 bytes

    else if ((vInput.Length == 10) && (Regex.IsMatch(vInput, @"[0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9][0-9]") == false))

    {

    // if input length = 10 but not numeric, discard the input

    vInput = "";

    }

    ////////////////////////////////////////////////////////////////////////////////////////

    // validate whether the new string has valid area code (first 3 bytes)

    else if (vInput.Length == 10)

    {

    // validate whether the new string has valid area code (first 3 bytes)

    AreaCode = vInput.Substring(0, 3);

    //Validate secod part of phone number

    PhoneNumber = vInput.Substring(3, 7);

    if ((AreaCode == "456")

    || (AreaCode == "500")

    || (AreaCode == "555")

    || (AreaCode == "600")

    || (AreaCode == "700")

    || (AreaCode == "710")

    || (AreaCode == "800")

    || (AreaCode == "822")

    || (AreaCode == "833")

    || (AreaCode == "844")

    || (AreaCode == "855")

    || (AreaCode == "866")

    || (AreaCode == "877")

    || (AreaCode == "888")

    || (AreaCode == "898")

    || (AreaCode == "900")

    || (AreaCode == "976")

    || (AreaCode == "999")

    || (PhoneNumber == "2222222")

    || (PhoneNumber == "3333333")

    || (PhoneNumber == "4444444")

    || (PhoneNumber == "5555555")

    || (PhoneNumber == "6666666")

    || (PhoneNumber == "7777777")

    || (PhoneNumber == "8888888")

    //|| (PhoneNumber == "9999999")

    || (vInput.Substring(0, 1) == "0") // check first byte

    || (vInput.Substring(0, 1) == "1") // check first byte

    || (vInput.Substring(1, 1) == "9") // check second byte

    || (vInput.Substring(1, 2) == "11") // check first byte

    || (vInput.Substring(3, 1) == "0") // check third byte

    || (vInput.Substring(3, 1) == "1") // check first byte

    || (vInput.Substring(4, 2) == "11") // check first byte

    )

    {

    vInput = "";

    }

    }

    //Assign modified vInput to vOutput

    vOutput = vInput;

    // Put your code here

    return vOutput ;

    }

    };

  • Hi Piet

    I just figured out. The problem of my (and your) CLR was the RegEx usage. It's much slower than your simple string.Replace. I tried your function and it was much faster than mine. After I removed the RegEx my function seems to be 3 times faster than yours. This depends on your RegEx.Match.

    Replace the RegEx.Replace line of my code with this:

    //string str = _phoneRegEx.Replace(strIn.Value, "");

    string str = strIn.Value;

    str = str.Replace("(", "");

    str = str.Replace(")", "");

    str = str.Replace("-", "");

    str = str.Replace(" ", "");

    I forgot the check if everything is numeric. Place the following code after the check for "str.Length != 10":

    if (str.Length != 10)

    return new SqlString("");

    for (int i = 0; i < str.Length; i++)

    {

    if (str < '0' || str > '9')

    return new SqlString("");

    }

    ... I tried a long.Parse but it seems to be slower. Don't use a foreach enumerator because this is also slower than a index based for-loop.

    An additional think I figured out in some other SQL CLR tests:

    Use the SqlString instead of the Syste.String. SQL Server seems to perform much better with this.

    Greets

    Flo

  • Pieter (5/21/2009)


    J-F,

    Although I haven't worked with table-value functions, I am not sure that they work for my scenario. The validated phone number needs to be linked up to the original data, therefore at minimum I need a result giving me ID_field and validated_phone.

    Pieter, the reason you are not able to link it to your data, is probably because you need to use the "Outer Apply" function, to get your new phone results in your query. You can also use the "Cross Apply", but in a simple select, you don't need to ensure the data is correct, just get it, so Outer apply will work perfectly.

    You can see Outer Apply as a left join, and Cross Apply as an Inner Join, for your information.

    Now, I think Florian has tested my function, and it does not seem to perform well, sometimes, Inline functions are more effective then Scalar, but since there is no data-access on your function, that might not be the case.

    Anyway, you can still test it, you can do something like this:

    select Phone from PhoneNumbers outer apply dbo.[Valid_Phone_Inline](Phone)

    Thanks,

    Cheers,

    J-F

Viewing 15 posts - 16 through 30 (of 41 total)

You must be logged in to reply to this topic. Login to reply