Performance issue with tally solution

  • Eugene Elutin (9/5/2012)


    paul.knibbs (9/5/2012)


    Adam Machanic (9/4/2012)

    The code that follows is my crack at a more memory efficient solution. It works only for a single delimiter (I never did see the point of multi-character delimiters, to be honest)

    The point of multi-character delimiters is to delimit strings which may contain delimiter as legitimate characters in itself. So, it's quite often that combination of non-alpha-numeric characters is used to delimit text data. And of course an example of mixed multi-line delimiters: CRLF, CR or LF or even LFCR.

    Also, I would like to add my two pence into discussion of CLR performance:

    1. You may declare the same CLR function (operating with strings) multiple times in SQL. What will it give you? You will find that performance of VARCHAR(100) and VARCHAR(MAX) is quite different... So, the same function can be "optimised" for use with smaller strings.

    2. If you use RegEx in CLR, you will find that declaring it as static with compiled option boost its performance. Yes, it is less flexible as you can not pass regular expression as parameter, but it's significantly faster as RegEx object needs to initialise once.

    Perhaps we actually agree on this but trying to find out. Just like I said in my previous post and just like you implied in item 1 above, there are certain optimizations you can make for each type (single or multi-character) of delimiter just as you can for the different datatypes.

    In other words, while general purpose split code makes life easy for developers, general purpose code is almost never as fast as purposed code. To wit, I don't believe that a splitter should be made to handle both single and multi-character delimiters just like I wouldn't simply change the DelimitedSplit8K function to handle NVARCHAR and blob data types.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • test...

    --
    Adam Machanic
    whoisactive

  • Pong

  • paul.knibbs (9/5/2012)


    What about a DOS-formatted text file where the delimiter at the end of lines is a carriage return/line feed pair? 🙂

    Not a problem. Choose either character as your delimiter. Both characters are considered to be white space (in .NET -- not in SQL Server for some reason), and so the character other than the one you've chosen will be trimmed off by the splitter.

    Next! 😀

    --
    Adam Machanic
    whoisactive

  • Eugene Elutin (9/5/2012)


    Also, I would like to add my two pence into discussion of CLR performance:

    1. You may declare the same CLR function (operating with strings) multiple times in SQL. What will it give you? You will find that performance of VARCHAR(100) and VARCHAR(MAX) is quite different... So, the same function can be "optimised" for use with smaller strings.

    2. If you use RegEx in CLR, you will find that declaring it as static with compiled option boost its performance. Yes, it is less flexible as you can not pass regular expression as parameter, but it's significantly faster as RegEx object needs to initialise once.

    I do not agree that there is any point in trying to optimize for smaller strings. Especially given the streaming solution, I think you'd be hard pressed to find a case where you'd get any benefit from having more than one function.

    RegEx, same deal. We tried that upthread and found that it was actually slower than doing a simple string comparison. But maybe we missed something? Show us a test case where it's faster!

    In both cases I'd question the benefit over the obvious maintainability issues. One function per character on the compiled RegEx side, multiplied times one function each for small and large strings. That's a lot of functions. I'd rather just come up with one function that does everything as well as it possibly can. And I think we've done that already, but there's always room for a bit more optimization.

    --
    Adam Machanic
    whoisactive

  • I do not agree that there is any point in trying to optimize for smaller strings. Especially given the streaming solution, I think you'd be hard pressed to find a case where you'd get any benefit from having more than one function.

    You may not agree. But if you need to split NVARCHAR(max) and NVARCHAR(100), CLR function declared with NVARCHAR(100) as input parameter will perform faster than one declared with NVARCHAR(MAX).

    RegEx, same deal. We tried that upthread and found that it was actually slower than doing a simple string comparison. But maybe we missed something? Show us a test case where it's faster!

    In both cases I'd question the benefit over the obvious maintainability issues. One function per character on the compiled RegEx side, multiplied times one function each for small and large strings. That's a lot of functions. I'd rather just come up with one function that does everything as well as it possibly can. And I think we've done that already, but there's always room for a bit more optimization.

    Same here. I've not stated that RegEx should be used for splitting strings or finding exact string within a string. RegEx is not a tool for that. It is for finding a PATTERN. Let say you want to find every occurrence of GUID (any GUID) within a string... I don't think you will get to far with "simple string comparison".

    My point was for cases where you do use RegEx for what is really best at. And it is very valid point in terms of performance. Right now the only what we have in T-SQL is PATINDEX which can only find the pattern within a string. I think, even if MS will implement full RegEx kind of functionality in T-SQL (eg.RegEx.Replace method), using CLR where it's defined as static will be still faster.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (9/5/2012)


    You may not agree. But if you need to split NVARCHAR(max) and NVARCHAR(100), CLR function declared with NVARCHAR(100) as input parameter will perform faster than one declared with NVARCHAR(MAX).

    Can you prove that?

    Same here. I've not stated that RegEx should be used for splitting strings or finding exact string within a string. RegEx is not a tool for that. It is for finding a PATTERN. Let say you want to find every occurrence of GUID (any GUID) within a string... I don't think you will get to far with "simple string comparison".

    This is a thread about string splitting.

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (9/5/2012)


    Eugene Elutin (9/5/2012)


    You may not agree. But if you need to split NVARCHAR(max) and NVARCHAR(100), CLR function declared with NVARCHAR(100) as input parameter will perform faster than one declared with NVARCHAR(MAX).

    Can you prove that?

    Same here. I've not stated that RegEx should be used for splitting strings or finding exact string within a string. RegEx is not a tool for that. It is for finding a PATTERN. Let say you want to find every occurrence of GUID (any GUID) within a string... I don't think you will get to far with "simple string comparison".

    This is a thread about string splitting.

    1. In short: Yes I can

    2. There are over 52 pages on this thread discussing many different aspects and not only string splitting. I've just read through few pages and found few people mentioned using RegEx. So, was my advice...

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (9/5/2012)


    1. In short: Yes I can

    Great, I look forward to seeing your repro.

    --
    Adam Machanic
    whoisactive

  • I took my function from this one: http://www.sqlservercentral.com/Forums/Topic1348575-338-1.aspx

    CLR:

    using System.Data;

    using System.Data.SqlClient;

    using System.Data.SqlTypes;

    using Microsoft.SqlServer.Server;

    using System.Text.RegularExpressions;

    namespace CLRPlay

    {

    public partial class UserDefinedFunctions

    {

    static readonly Regex _regex = new Regex(@"^[0]+|[0]+$",RegexOptions.Compiled);

    [Microsoft.SqlServer.Server.SqlFunction]

    public static SqlString TrimLTZr(SqlString val)

    {

    // Put your code here

    if (val.IsNull) return SqlString.Null;

    return _regex.Replace(val.ToString(), String.Empty);

    }

    };

    }

    Registration:

    /****** Object: UserDefinedFunction [dbo].[TrimLTZr] Script Date: 09/05/2012 16:51:51 ******/

    CREATE FUNCTION [dbo].[TrimLTZr](@val [nvarchar](550))

    RETURNS [nvarchar](550) WITH EXECUTE AS CALLER

    AS

    EXTERNAL NAME [CLRPlay].[CLRPlay.UserDefinedFunctions].[TrimLTZr]

    GO

    CREATE FUNCTION [dbo].[TrimLTZrM](@val [nvarchar](max))

    RETURNS [nvarchar](max) WITH EXECUTE AS CALLER

    AS

    EXTERNAL NAME [CLRPlay].[CLRPlay.UserDefinedFunctions].[TrimLTZr]

    GO

    Test (as per Jeff Moden way of time measurement):

    --generate 1,000,000 rows of test data

    SELECT TOP 1000000

    SomeString = REPLICATE('0',ABS(CHECKSUM(NEWID()))%5) -- 0 to 4 leading zeros

    + 'temp'+RIGHT('0000000'+CAST(ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) AS VARCHAR(10)),7)

    INTO #TestTable

    FROM sys.all_columns ac1

    CROSS JOIN sys.all_columns ac2

    --=======================================================================================

    -- Run the tests using a duration timer because SET STATISTICS TIME ON induces

    -- large delay times because of the RBAR nature of Scalar UDFs.

    --=======================================================================================

    --===== Declare some obviously named variables.

    DECLARE @Bitbucket VARCHAR(8000),

    @StartTime DATETIME,

    @DurationMS INT

    --===== Run the tests on each function

    RAISERROR('=======================================',0,1) WITH NOWAIT;

    RAISERROR('============ Running Tests ============',0,1) WITH NOWAIT;

    RAISERROR('=======================================',0,1) WITH NOWAIT;

    -- Note that @Bitbucket allows us to take display and disk time out of the equation.

    -----------------------------------------------------------------------------------------

    SELECT @StartTime = GETDATE();

    RAISERROR('============ TrimLTZr Varchar(550)',0,1) WITH NOWAIT;

    SELECT @Bitbucket = dbo.TrimLTZr(SomeString)

    FROM #TestTable;

    SELECT @DurationMS = DATEDIFF(ms,@StartTime,GETDATE())

    RAISERROR('Duration ms: %u',0,1,@DurationMS) WITH NOWAIT;

    -----------------------------------------------------------------------------------------

    SELECT @StartTime = GETDATE();

    RAISERROR('============ TrimLTZr Varchar(MAX)',0,1) WITH NOWAIT;

    SELECT @Bitbucket = dbo.TrimLTZrM(SomeString)

    FROM #TestTable;

    SELECT @DurationMS = DATEDIFF(ms,@StartTime,GETDATE())

    RAISERROR('Duration ms: %u',0,1,@DurationMS) WITH NOWAIT;

    Results:

    =======================================

    ============ Running Tests ============

    =======================================

    ============ TrimLTZr Varchar(550)

    Duration ms: 2610

    ============ TrimLTZr Varchar(MAX)

    Duration ms: 14843

    Sorry for waiting....

    Please note: the input test data is the same for both functions and it is not NVARCHAR(MAX).

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Interesting repro.

    I can see the same difference you do on both SQL Server 2008 and 2012, but I'm not sure the concerns are the same across problem domains. Your repro shows a scalar operation ("trim"), whereas this thread is primarily concerned with a table-valued operation ("split"). We're talking about what amounts to, on average, around a 9 microsecond difference per call (on my end -- a bit more on yours). When making a scalar call on every row of a table that certainly adds up, but table-valued functions, especially split, aren't used that way -- at least, in my experience. Have you seen a real-world workload would actually benefit enough to justify the maintainability overhead that having multiple split functions would introduce?

    --
    Adam Machanic
    whoisactive

  • I have seen overhead with operations on NVARCHAR(MAX) on all types of CLR's eg. table-valued one. I think it is related to the way of memory reservation for MAX type. I will not be surprised if the real boundary is NVARCHAR(4000).

    ...

    Have you seen a real-world workload would actually benefit enough to justify the maintainability overhead that having multiple split functions would introduce?

    Yes, in my experience I have seen a lot of real-world database projects where such performance difference (on large data-sets) would easily justify the maintainability of such overhead. What we are talking here is just a declaration on a SQL Server, there is no much to maintain really.

    _____________________________________________
    "The only true wisdom is in knowing you know nothing"
    "O skol'ko nam otkrytiy chudnyh prevnosit microsofta duh!":-D
    (So many miracle inventions provided by MS to us...)

    How to post your question to get the best and quick help[/url]

  • Eugene Elutin (9/5/2012)


    Yes, in my experience I have seen a lot of real-world database projects where such performance difference (on large data-sets) would easily justify the maintainability of such overhead. What we are talking here is just a declaration on a SQL Server, there is no much to maintain really.

    How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)

    Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:

    SELECT *

    FROM Table

    WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))

    In this case the string split function will be called once per query, not once per row, and the 9 microsecond difference will never amount to anything.

    This is not just a separate declaration that you need to maintain. You need to maintain separate calls, and somewhere your code needs to understand what to do with different strings.

    IF LEN(@values) <= 4000 THEN

    ... /* call the version for small strings */

    ELSE

    ... /* call the version for larger strings */

    And now we need to put these checks everywhere we do string splitting.

    Or perhaps you can wrap your two declarations in a single outer TVF, if you're very careful with that code and you trust the query optimizer to produce adequate startup filters so that both functions aren't called behind the scenes. And after all of that you still have three objects sitting in the database where you had only one before...

    --
    Adam Machanic
    whoisactive

  • Adam Machanic (9/5/2012)


    How are you seeing string splits used where such a difference would matter? (Again, I am *only* talking about string split operations.)

    Generally when I see string splitters used it's for passing a set of arguments to a stored procedure, and it will end up in some query like:

    SELECT *

    FROM Table

    WHERE ID IN (SELECT CONVERT(INT, output) FROM dbo.StringSplitFunction(@values))

    Adam,

    I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.

  • UMG Developer (9/5/2012)


    I have cases where I have a table with ~20 million rows and each row contains a string that needs to be split, so that is a case where the splitter would be called 20 million times in a single INSERT query, so any optimization to make it faster can make a big difference.

    Well the only logical way to respond to that is to ask you why you're storing these strings in a table. Dare I ask?

    And, going deeper down the rabbit hole... How big are the strings in question?

    --
    Adam Machanic
    whoisactive

Viewing 15 posts - 496 through 510 (of 522 total)

You must be logged in to reply to this topic. Login to reply