Long execution times when including simple SQL functions into a join

  • Again, thanks for all your input everyone. I have to do production work today and will not be able to work on this problem ;-(

    I did talk to one of our hardcore programmers and he suggested to move the loop to the top of the code, with the rational that REPLACE is expensive and because each REPLACE involves a FOR LOOP, it would be wasting 4 loops each time.

    The other train of thought is the where clause restricting the amount of records passing through the function, 'Time not spent is truly time saved'.

    Enjoy the weekend!

  • Hi Piet

    Pieter (5/22/2009)


    Again, thanks for all your input everyone. I have to do production work today and will not be able to work on this problem ;-(

    I know what you mean to work on a problem but there is always somebody or something else...

    I did talk to one of our hardcore programmers and he suggested to move the loop to the top of the code, with the rational that REPLACE is expensive and because each REPLACE involves a FOR LOOP, it would be wasting 4 loops each time.

    I'm also primary a C# developer. Had the same idea ;-). The version of my validation function seems to be an overkill for most business cases. Anyway, your problem apparently needs the best performance. So a more complicated solution seems to be valid. I changed the input type from SqlString to SqlChars which is faster and use two char arrays instead of a string. If you find a faster version without Interop please let me know.

    The other train of thought is the where clause restricting the amount of records passing through the function, 'Time not spent is truly time saved'.

    I understood this intention probably your WHERE clause could be optimized too. Your current:

    WHERE LEN(PRIMARY_GIVEN_NAME) <> 0 AND LEN(SURNAME) <> 0 AND LEN(MTC_ADDRESS) <> 0 AND LEN(ZIP) <> 0

    Function calls within a WHERE clause are always a potential problem for indexes. Try this instead:

    WHERE PRIMARY_GIVEN_NAME <> '' AND SURNAME <> '' AND MTC_ADDRESS <> '' AND ZIP <> ''

    Your current call of the function:

    PDW_DL_DATA.DBO.VALID_PHONE(isnull(PHONE,'')) as new_phone

    The ISNULL function causes a memory copy for each phone number. Let .NET do the ISNULL check (you can see this in my function below). Just call it as this:

    PDW_DL_DATA.DBO.VALID_PHONE(PHONE) as new_phone

    Here is my current version of the C# validation function:

    using System;

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using System.Text;

    using System.Text.RegularExpressions;

    using System.Xml;

    using Microsoft.SqlServer.Server;

    public partial class UserDefinedFunctions

    {

    [Microsoft.SqlServer.Server.SqlFunction()]

    public static SqlString ufn_clr_ValidPhone(

    // Get a char array instead of a string

    [SqlFacet(MaxSize = 4000)]

    SqlChars strIn

    )

    {

    // Return "" if input was NULL

    if (strIn.IsNull)

    return new SqlString("");

    // Work with a char array instead of a string

    char[] input = strIn.Value;

    // Initialize the output buffer

    char[] output = new char[input.Length];

    // Start position and len within the ouput array

    int start;

    int len = 0;

    // Buffers to avoid several times the same memory copy since we do not use unsafe code

    char c;

    string item;

    // Get only numeric values into the ouput array

    for (int i = 0; i < input.Length; i++)

    {

    char current = input;

    if (current >= '0' && current <= '9')

    {

    output[len++] = current;

    }

    }

    // If we have a len of 11 set start position to 1

    if (len == 11)

    start = 1;

    else

    start = 0;

    // If len is not 10 return

    if (len - start != 10)

    return new SqlString("");

    // Get 1st char and check for 0 and 1

    c = output[start + 0];

    if (c == '0' || c == '1')

    return new SqlString("");

    // Check 2nd char for 9

    c = output[start + 1];

    if (c == '9')

    return new SqlString("");

    // Check 2nd for 1 and 3rd for 1

    if (c == '1' && output[start + 2] == '1')

    return new SqlString("");

    // Get the 3rd char and check for 0 and 1

    c = output[start + 3];

    if (c == '0' || c == '1')

    return new SqlString("");

    // Check 4th for 1 and 5th for 1

    if (output[start + 4] == '1' && output[start + 5] == '1')

    return new SqlString("");

    // Get the 1st three chars and validate them

    item = new string(output, start, 3);

    if (item == "456" || item == "500" || item == "555" || item == "600" || item == "700" || item == "710" || item == "800" || item == "822" || item == "833" || item == "844" || item == "855" || item == "866" || item == "877" || item == "888" || item == "898" || item == "900" || item == "976" || item == "999")

    return new SqlString("");

    // Get the last seven chars and validate them

    item = new string(output, start + 3, 7);

    if (item == "2222222" || item == "3333333" || item == "4444444" || item == "5555555" || item == "6666666" || item == "7777777" || item == "8888888" || item == "9999999")

    return new SqlString("");

    return new SqlString(new string(output, start, len));

    }

    };

    On my box this function again is twice faster than the previous (without the RegEx).

    Enjoy the weekend!

    Wish you also a nice weekend.

    Flo

  • After seeing how fast the various renditions of the function run on a million rows, I'd say that it is NOT the function that is the problem. Adding such a function to a SELECT LIST would not turn a 20 hour run into a 3 day run or a run that never finishes.

    Flo started to identify the problem with the OP using functions against columns. Something else was added somewhere besides the SELECT LIST.

    I also say that there's a much, much bigger problem at hand. I used to assemble all of the monthly long distance calls for a half million customers, import their CDRs, rate the CDRs, and put them in EMI format for delivery to a billing house in less than 4 hours on an old P3-350 and THAT included a 3 legacy reel-to-reel tape changes! It also included To/From City, State, and Country assignments along with some very complex validations. And all of that was before I even knew about the real power of indexes and that you could tune queries to use them.

    My point isn't to brag... my point is to say I've been doing this stuff for the last 14 years and a 20 hour run for anything having to do with telephony has something seriously wrong with it.

    Pieter, you need to start looking for performance problems in the code somewhere besides how this function was built. Look for non-sargable WHERE clauses, Joins that return more internal rows than any of your tables, Joined updates where the target of the update isn't contained in the FROM clause, etc, etc, etc.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    You will get no argument from me that the problems are not just the functions that we run. Grinding through 200+mln records (or 400-500GB) does take some effort. Other software has shown that we have issues with IO, but this is not something that I can change myself. Although the impact may be less, I can try to work with the various SQL developers to include where-clauses and optimize queries, procedures, and functions to get the most out of the current configurations.

    We have observed a performance drop on this specific machine that happened kind of overnight without a good explanation and it was most noticeable when running functions. Digging in to the functions was quite an eye opener and I intend to play with this some more, once I knock out the production work.

  • Are the 200 million rows all CDR's?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Sorry, I am not familiar with the term "CDR's"

  • CDR = "Call Data Record"

    I used to work in that field too... 🙂

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Pieter,

    did you try the CLR function?

    Greets

    Flo

  • RBarryYoung (5/26/2009)


    CDR = "Call Data Record"

    I used to work in that field too... 🙂

    We used to call them "Call Detail Record". Either will do.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Pieter (5/26/2009)


    Sorry, I am not familiar with the term "CDR's"

    "Call Detail Records" or "Call Data Records". Based on all the talk about phone numbers, I thought you were taking raw CDR's and rating/billing them.

    Back to my original question in a slightly different format... are the 200 million rows all in the same table? Either way, what kind of rows are you talking about?

    Also, you say that you grind through 200 million rows. That's not really that many rows... are you trying to do updates to 200 million rows all at the same time? I'm asking because we might be able to help in a fairly simple manner but we need to know just a bit more about the process.

    That's if you're still interested. 😀

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/26/2009)


    RBarryYoung (5/26/2009)


    CDR = "Call Data Record"

    I used to work in that field too... 🙂

    We used to call them "Call Detail Record". Either will do.

    Actually I think you're right, it was over 20 years ago (and I was more on the generating side than the post-processing side).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Hi again,

    I had to take care of some work here. I did try the second CLR that Florian proposed and compared it to the his first version and mine, see results below. There seems to be some reduction in IO at the expense of some CPU time, not sure that it is significant (would requrie further testing on my part)

    To give you some background on our system, I work with marketing data and the large datasets contain personal information and attributes (credit, geographical location, housing, etc., yes everything is tracked now a days ;-). Most of these tables come in at 100-250mln records by 400-800 columns each! Even when you use the correct data types, these tables span 100-500GB initially. I try to minimize direct access to these tables, by making small subsets but there are always individuals that feel a need to join these tables, preferably, via linked server connection.

    Anyway, Jeff I am interested in making this run faster always. I know our IO is the big bottle neck not our 4 quad CPU's, but that requires capital to physically make changes in the setup. I would also need to propose an alternative to the Perc6 cards and the 4 drives (powervaults) that I am currently utilizing (1 drive for TEMPDB and three drives on 2xPERC6 cards). Any suggestions are welcome and appreciated.

    --Results

    --Query 1; baseline (no function)

    CPU time = 478 ms, elapsed time = 749 ms.

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 23978

    -- Query 3a; CLR function Piet)

    CPU time = 5205 ms, elapsed time = 7904 ms.

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 22442

    -- Query 3b; CLR function (Florian Reischl, 1)

    CPU time = 5463 ms, elapsed time = 7961 ms.

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 22506

    -- Query 3c; CLR function (Florian Reischl, 2)

    CPU time = 5604 ms, elapsed time = 7305 ms.

    Scan count 1,

    logical reads 0,

    physical reads 0,

    read-ahead reads 22250

Viewing 12 posts - 31 through 41 (of 41 total)

You must be logged in to reply to this topic. Login to reply