Performance issue with tally solution

  • Hey Paul again!

    Sorry! Didn't see your new post...

    I did not know that CLR functions can return IEnumerable. Thanks for that!

    I tested your function against mine:

    Start clr xml solution

    Milliseconds: 790 | Lines: 28545

    Start clr tvf (PW) solution

    Milliseconds: 1083 | Lines: 25661

    I don't really understand why the way over XML is faster than a direct TVF. The only idea I have is that SQL Server calls the FillRowMethod over reflection, what would be really ugly implementation! I usually handle those kinds of dynamic method calls over dynamic classes and delegates. It's a little more source code but the performance improvement is up to factor 1000.

    Greets

    Flo

    PS: Sorry also for the late answer. I had to restart my rooter about twenty times... Currently I'm doin' something like "stock posting" :hehe: in a text editor until internet gives me a minute or two...

  • Flo,

    More great stuff from you again...!

    I hope your internet connection is fixed soon.

    Off to bed now, but thanks for the XML CLR solution - neat (though I am naturally XML-sceptic).

    I disliked writing the CLR TVF intensely so it's good to see an alternative (and by all measures apparent to me) better solution.

    The documentation regarding IEnumerable and CLR TVFs in general is not fantastic in BOL - there must be something better out there on the interweb...

    Cheers,

    Paul

  • Paul White (4/13/2009)


    Flo,

    More great stuff from you again...!

    I hope your internet connection is fixed soon.

    Off to bed now, but thanks for the XML CLR solution - neat (though I am naturally XML-sceptic).

    I disliked writing the CLR TVF intensely so it's good to see an alternative (and by all measures apparent to me) better solution.

    The documentation regarding IEnumerable and CLR TVFs in general is not fantastic in BOL - there must be something better out there on the interweb...

    Cheers,

    Paul

    Currently I'm online 😉

    I already saw some other solutions working with XML in SQL Server. It seems MS really optimized it for performance! Since now there is no fantastic documentation about CLR in BOL for anything...

    Thanks again for all your help!

    Good night 🙂

    Flo

  • Florian Reischl (4/13/2009)


    I know Jeff will shoot me for this... 🙂

    @jeff

    The day to activate CLR on your servers is close 😀

    Heh... you may be right. Might also have to install C# and learn it, too. I've been waiting for someone to do these types of tests. Thanks, Flo.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Say it ain't so. This is the first I've heard a set based solution losing to RBAR.

    I don't have the desire to check right now, but what about using a CTE instead for the tally table?

    aktikt

  • aktikt (4/13/2009)


    Say it ain't so. This is the first I've heard a set based solution losing to RBAR.

    I don't have the desire to check right now, but what about using a CTE instead for the tally table?

    aktikt

    Hi aktikt

    Yes it is :-). .Net is more optimized for things like string operations and formatting in most other cases the set based solutions should be the better way.

    Which kind of CTE approach do you mean? A recursive CTE will miss because there are possibly more than 100 lines in the text. I would be glad if you have any other approach!

    Greets

    Flo

  • aktikt (4/13/2009)


    Say it ain't so. This is the first I've heard a set based solution losing to RBAR.

    I don't have the desire to check right now, but what about using a CTE instead for the tally table?

    aktikt

    Heh... actually, this is the second time. The While Loop beats the Tally table when "STUFF" is used to clean a string.

    The CTE method was one of the things I quickly checked... it still looses to the While Loop. The CLR smokes all of the methods. Glad someone finally tested that.

    I've got a lot of testing to do between VARCHAR(MAX) and VARCHAR(8000), but I don't know why anyone would write something that couldn't handle the MAX anymore. I'm thinking that the hey-day of the Tally table have finally been replaced by faster methods.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (4/13/2009)


    I'm thinking that the hey-day of the Tally table have finally been replaced by faster methods.

    Don't say that. There might be actually two business cases where the tally table looses but there are many more things to do with than only some string manipulations.

    Tally for ever! 😀

  • This reminds me a bit of the blog I did called 'Never say never to the 'While' loop.

    A Charindex and WHILE loop will always outperform a 'helper table' in certain circumstances. However, these cases are usually fairly esoteric string searches. They don't invalidate the value of the use of helper tables for the common task.

    Is there some test data to try this on? (I couldn't see any) I've been experimenting quite a bit with the NET Regex class for this sort of task, and would love to benchmark these techniques along with regex Split. I'm also very aware that every technique can be improved. For example, one can do the Charindex without the WHILE loop (though the code is a bit opaque, and one needs to have an upper limit to the final table)

    Best wishes,
    Phil Factor

  • Florian Reischl (4/13/2009)


    Jeff Moden (4/13/2009)


    I'm thinking that the hey-day of the Tally table have finally been replaced by faster methods.

    Don't say that. There might be actually two business cases where the tally table looses but there are many more things to do with than only some string manipulations.

    Tally for ever! 😀

    Thanks for the vote of confidence... Like you said, two business cases aren't necessarily a killer, but they do indicate serious potential can be had... to wit, I've got a whole lot of testing to do across a whole lot of scenarios. As Phil mentioned, that would also include some testing on things not having to do with a While Loop.

    You've done a couple of very cool tricks in the While Loop that most have not... like calculating the CHARINDEX only once instead of twice like a good number of folks end up doing. Others have tried methods similar to yours but have failed because of the doubled CHARINDEX and a couple of other nuances that you've very cleverly avoided. It's brilliant code in its simplicity and my hat's off to you.

    You should write an article about this, Flo... this is good stuff.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • [font="Verdana"]Flo, I'll add my kudos as well: it's good to see some numbers showing how CLR rates in comparison with T-SQL. And I'll look forward to the article. 😉

    BTW, in your while loop:

    WHILE EXISTS (SELECT TOP(1) 1 FROM @procs)

    BEGIN

    SELECT TOP(1) @name = name, @definition = definition FROM @procs

    [...]

    END

    Did you try this approach:

    while (1=1) begin

    select top (1) @name = name, @definition = definition from @procs (order by?)

    if @@rowcount = 0 break;

    [...]

    end; -- while

    In the times where I resort to a while loop (er, trying to do it less, honest Jeff!) I often find myself using that format, because otherwise I end up writing more code and making SQL Server do more work. So taking out the additional select may make your while loop a little faster. Just curious as to whether it changes the numbers a little.

    P.S. Phil Factor's blog post makes interesting reading, particularly for this little gem:

    Jeff Moden


    P.S. Heh... Always say "Never" to a RBAR While loop... if the While loop beats the set based method, there's something else wrong. 😉

    :w00t:

    [/font]

  • Jeff Moden (4/13/2009)


    Florian Reischl (4/13/2009)


    Jeff Moden (4/13/2009)


    I'm thinking that the hey-day of the Tally table have finally been replaced by faster methods.

    Don't say that. There might be actually two business cases where the tally table looses but there are many more things to do with than only some string manipulations.

    Tally for ever! 😀

    Thanks for the vote of confidence... Like you said, two business cases aren't necessarily a killer, but they do indicate serious potential can be had... to wit, I've got a whole lot of testing to do across a whole lot of scenarios. As Phil mentioned, that would also include some testing on things not having to do with a While Loop.

    You've done a couple of very cool tricks in the While Loop that most have not... like calculating the CHARINDEX only once instead of twice like a good number of folks end up doing. Others have tried methods similar to yours but have failed because of the doubled CHARINDEX and a couple of other nuances that you've very cleverly avoided. It's brilliant code in its simplicity and my hat's off to you.

    You should write an article about this, Flo... this is good stuff.

    My business case was definitely a special case depending on the length of the text. Sure in most other cases the set based solution will still be the faster and definitely guideline conform solution.

    I did not think specially on the once called CHARINDEX function within the loop. I think the was because I'm an old C guy and always think about "every method call has its price" :-).

    I will start next days with some more investigations about split functions and maybe some other good reasons for CLR integrating approaches. After this I will write an article about this with all test data.

    Thanks a lot for all this praise!!! :blush:

    Flo

  • Hi Bruce

    Bruce W Cassidy (4/13/2009)


    [font="Verdana"]Flo, I'll add my kudos as well: it's good to see some numbers showing how CLR rates in comparison with T-SQL. And I'll look forward to the article. 😉

    Thank you also!!!

    BTW, in your while loop:

    WHILE EXISTS (SELECT TOP(1) 1 FROM @procs)

    BEGIN

    SELECT TOP(1) @name = name, @definition = definition FROM @procs

    [...]

    END

    Did you try this approach:

    while (1=1) begin

    select top (1) @name = name, @definition = definition from @procs (order by?)

    if @@rowcount = 0 break;

    [...]

    end; -- while

    In the times where I resort to a while loop (er, trying to do it less, honest Jeff!) I often find myself using that format, because otherwise I end up writing more code and making SQL Server do more work. So taking out the additional select may make your while loop a little faster. Just curious as to whether it changes the numbers a little.

    P.S. Phil Factor's blog post makes interesting reading, particularly for this little gem:

    Jeff Moden


    P.S. Heh... Always say "Never" to a RBAR While loop... if the While loop beats the set based method, there's something else wrong. 😉

    :w00t:

    Thanks also for this performance improvement!

    Anyway, it's very confusing to discuss all this optimizations for loops with all of you... :Wow:

    Greets

    Flo

  • Jeff may not remember it but he helped me out with some tally-based line-splitting code over the summer and there were several unusual results to come out of it:

    1) For splitting the lines in stored procedure definitions, there is a subtle bug that exists in almost every line-splitting routine, including the ones so far in this thread: there is actually two different kinds of line-breaks incorporated in the system stored procedures: CR-LF, and LF alone. To correctly split them, you must take both into account (which is not trivial).

    2) The fastest tally-method for line-splitting turned out to be, neither an in-line join (as is being tested here), nor even an inline table-valued Function, but suprisingly a multi-line TVF.

    3) The final observation that I had at that time was that the loop-based routines start to catch up with the tally-based methods if the average distance between each separator gets large enough, and for line-splitting system procedures, they tend to be quite large of course (as opposed to comma-separated strings, for instance). This is because the loop-based methods can skip the ahead of "dead" characters in-between with CHARINDEX(). Because Florian figured out how to call CHARINDEX only once per loop, that lowered where that threshold is.

    So anyway, I wanted to test my line splitting functions against these also, so here are the results:

    Start tally solution

    Milliseconds: 6546 | Lines: 27106

    Start Tally-function 3 solution (correct line-splitting)

    Milliseconds: 3830 | Lines: 28519

    Start Tally-function 4 solution (has same bug as the rest)

    Milliseconds: 3653 | Lines: 28119

    Start cursor solution

    Milliseconds: 2376 | Lines: 28119

    these results are pretty stable on my system.

    I'll post my code shortly...

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • RBarryYoung (4/13/2009)


    1) For splitting the lines in stored procedure definitions, there is a subtle bug that exists in almost every line-splitting routine, including the ones so far in this thread: there is actually two different kinds of line-breaks incorporated in the system stored procedures: CR-LF, and LF alone. To correctly split them, you must take both into account (which is not trivial).

    It's pretty trivial in the CLR case 🙂

    RBarryYoung (4/13/2009)


    3) The final observation that I had at that time was that the loop-based routines start to catch up with the tally-based methods if the average distance between each separator gets large enough, and for line-splitting system procedures, they tend to be quite large of course (as opposed to comma-separated strings, for instance). This is because the loop-based methods can skip the ahead of "dead" characters in-between with CHARINDEX(). Because Florian figured out how to call CHARINDEX only once per loop, that lowered where that threshold is.

    *cough*

    Paul White


    The tally or numbers table solution is, to an extent, a brute-force approach. In the case of many long strings where the frequency of the searched-for string is low, a solution based on a loop/CHARINDEX approach should out-perform the tally approach.

    RBarryYoung (4/13/2009)


    So anyway, I wanted to test my line splitting functions against these also, so here are the results:

    Flo


    Start clr xml solution

    Milliseconds: 790 | Lines: 28545

    Start clr tvf (PW) solution

    Milliseconds: 1083 | Lines: 25661

    Still slower than both CLR implementations 😉

    I'm still wondering why the CLR TVF is so much slower than the XML. Note that the line counts are different simply because the TVF omits empty results:

    string[] items = toSplit.Split(new string[] { delimeter }, StringSplitOptions.RemoveEmptyEntries);)

Viewing 15 posts - 16 through 30 (of 522 total)

You must be logged in to reply to this topic. Login to reply