Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • mark hutchinson (8/12/2011)


    @Peter

    Jeff's new article uses describes this method and states that the compiler knows that when the number of rows is less than what would be produced at/above one of the joining expressions, that it doesn't evaluate that expression.

    Mark

    Got a link to that article, can't believe I missed the new varchar(max) string splitter hes been working on for so long. My early experiences with avoiding joins showed a severe degenerative effect (it was one of the first things I tested with a fully inline generated tally table). Granded I was more agressive and aplied it on all but the first join.

  • Apologies for my apparent absence on this thread. Looks like I have a bit more testing to do. Thanks John, Mark, and Peter.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @peter-2

    Thanks for the feedback. I've run the test several different ways (current way posted, putting values into a table variable, and single execution statements) and found they produced the same results. If you create a better test harness in your testing, I would love to see it. Let me know how it goes with the udf testing.

    /* Anything is possible but is it worth it? */

  • Gatekeeper (8/12/2011)


    @Peter

    Thanks for the feedback. I've run the test several different ways (current way posted, putting values into a table variable, and single execution statements) and found they produced the same results. If you create a better test harness in your testing, I would love to see it. Let me know how it goes with the udf testing.

    Currenly I do not have the spare time to perform any performance testing myself as I am just before a delivery date.

    You know how these things go :).

    When the full varchar(max) splitter article is out and I got time (it is not yet, is is?), I certainly will perform some tests, including your posted version here.

  • peter-757102 (8/12/2011)


    You know how these things go

    I sure do. I don't think Jeff has the varchar(max) article posted but I'm sure we'll see it featured when it is.

    /* Anything is possible but is it worth it? */

  • @Gatekeeper

    The article to which I referred is rCTE (recursive CTE).

    Mark

  • This is a great article Jeff and going to be extremely useful.

    But, I'm wondering why you appear to be generating a recordset with 10000 rows and then limiting that recordset to the number of rows based on the datalength of the value passed in. Would it not be better to to remove the E4 CTE and then limit the cteTally CTE with the cross join on E2 as below, that way never generating more rows than you need.

    Replace:

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    With:

    cteTally ( N )

    AS ( --==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns" upto a maximum of 10000

    SELECT TOP (ISNULL( DATALENGTH(@pString), 1))

    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL

    ) )

    FROM E2 a ,

    E2 b

    ),

  • Mushroom (9/30/2011)


    This is a great article Jeff and going to be extremely useful.

    But, I'm wondering why you appear to be generating a recordset with 10000 rows and then limiting that recordset to the number of rows based on the datalength of the value passed in. Would it not be better to to remove the E4 CTE and then limit the cteTally CTE with the cross join on E2 as below, that way never generating more rows than you need.

    Replace:

    E4(N) AS (SELECT 1 FROM E2 a, E2 b), --10E+4 or 10,000 rows max

    cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns"

    SELECT TOP (ISNULL(DATALENGTH(@pString),0)) ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4

    ),

    With:

    cteTally ( N )

    AS ( --==== This provides the "base" CTE and limits the number of rows right up front

    -- for both a performance gain and prevention of accidental "overruns" upto a maximum of 10000

    SELECT TOP (ISNULL( DATALENGTH(@pString), 1))

    ROW_NUMBER() OVER ( ORDER BY ( SELECT NULL

    ) )

    FROM E2 a ,

    E2 b

    ),

    Thanks for the feedback.:-)

    The "over runs" I was talking about are when there's a "bad join" between the function and some data. That doesn't happen with TOP.

    Also, as I explained how the cascaded CTE's work in the article, they don't generate 10000 rows and then select from those... TOP is "reflected" back into the cascaded CTE's and the system won't actually generate any more rows than what are needed. For example, if you only need the values from 1 to 5, only E1 will generate any numbers... E2 through 4 won't even come into play.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff,

    I made a few tweaks to your function that I think improve performance a little more:

    • I noticed that the SELECT 0 could be moved from the cteTally table to the cteStart table, eliminating the OR clause. Then I realized that it could be moved out of the CTE completely. So in my function below, the first element is pulled out entirely separately from the rest. It adds some unfortunate duplication in the code, but seems to help.
    • Since the delimiter will be implicitly converted to varchar anyway, it seems more consonant with its use to declare it varchar at the start. I doubt this has any real performance implication.
    • I think that using a binary collation in all the CharIndexes and the WHERE condition could possibly improve performance just a tiny bit more.
    • There's an implicit conversion from integer to string inside DataLength(). Using ' ' seems clearer.
    • If I were going to pursue any further improvements I would experiment with varying numbers of UNION ALLs in the E1 CTE and varying numbers of CROSS JOIN CTEs following that, to see if something fell out.

    Here's my function--sorry about the wild reformatting and renaming, I like to make code match my conventions when I use it. 🙂

    CREATE FUNCTION dbo.DelimitedSplit8KErik (

    @String varchar(8000),

    @Delimiter varchar(1)

    ) RETURNS TABLE WITH SCHEMABINDING

    AS

    RETURN

    WITH E1 (N) AS (

    SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1

    ), E2 (N) AS (

    SELECT 1 FROM E1 a CROSS JOIN E1 b

    ), E4 (N) AS (

    SELECT 1 FROM E2 a CROSS JOIN E2 b

    ), T (N) AS (

    SELECT TOP (DataLength(IsNull(@String, ' ')))

    Row_Number() OVER (ORDER BY (SELECT NULL))

    FROM E4

    ), S (N) AS (

    SELECT N + 1

    FROM T

    WHERE Substring(@String, N, 1) = @Delimiter

    )

    SELECT

    1 ItemNumber,

    Substring(@String, 1, IsNull(NullIf(CharIndex(@Delimiter, @String), 0) - 1, 8000)) Item

    UNION ALL

    SELECT

    Row_Number() OVER (ORDER BY N) + 1,

    Substring(@String, N, IsNull(NullIf(CharIndex(@Delimiter, @String, N), 0) - N, 8000))

    FROM S;

    P.S. You didn't touch on why you're using the WITH SCHEMABINDING option. There are no real external database objects referenced, so it seems superfluous. Would you explain?

  • ErikEckhardt (3/13/2012)

    P.S. You didn't touch on why you're using the WITH SCHEMABINDING option. There are no real external database objects referenced, so it seems superfluous. Would you explain?

    PMFJI: SCHEMABINDING can speed up execution if you know ahead of time that no database objects will be touched during execution of the UDF: http://blogs.msdn.com/b/sqlprogrammability/archive/2006/05/12/596424.aspx and http://beyondrelational.com/blogs/mike/archive/2011/01/17/the-benefits-of-schemabinding.aspx.

    Rich

  • ErikEckhardt (3/13/2012)


    Jeff,

    I made a few tweaks to your function that I think improve performance a little more:

    • I noticed that the SELECT 0 could be moved from the cteTally table to the cteStart table, eliminating the OR clause. Then I realized that it could be moved out of the CTE completely. So in my function below, the first element is pulled out entirely separately from the rest. It adds some unfortunate duplication in the code, but seems to help.
    • Since the delimiter will be implicitly converted to varchar anyway, it seems more consonant with its use to declare it varchar at the start. I doubt this has any real performance implication.
    • I think that using a binary collation in all the CharIndexes and the WHERE condition could possibly improve performance just a tiny bit more.
    • There's an implicit conversion from integer to string inside DataLength(). Using ' ' seems clearer.
    • If I were going to pursue any further improvements I would experiment with varying numbers of UNION ALLs in the E1 CTE and varying numbers of CROSS JOIN CTEs following that, to see if something fell out.

    Here's my function--sorry about the wild reformatting and renaming, I like to make code match my conventions when I use it. 🙂

    ... snip ...

    P.S. You didn't touch on why you're using the WITH SCHEMABINDING option. There are no real external database objects referenced, so it seems superfluous. Would you explain?

    If you'll have a look at the prolog of the article, some of these things were also discovered by others and have been include in the updated scripts in the "Resources" section near the end of the article. Collectively, we managed to squeek out almost an addition 20% (IIRC).

    The reason for the WITH SCHEMABINDING was covered in the previous post.

    I do, very much appreciate the tips and will review them against the updated code and see if we missed any of your suggestions.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • My apologies if I wasted anyone's time! I didn't know there was somewhere else to look for additional improvements. This thread has 28 pages--gadzooks!

  • It's never a waste of time to suggest possible improvements and I thank you for them. Actually, I'm sorry you may have wasted some of your time but the article did have an "UPDATE" leader that said the attached code had been updated with the suggested improvements. The "attached" code can be found in the "Resources" links near the bottom of the article itself.

    Sincere thanks again and glad to meet you here and on the other forum, Erik.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff, did you receive any news regarding the performance characteristics of the code (or variations) on SQL Server 2012?

    And on the extra bright side, in a few months I will be working in a .NET driven development environment, then it is time for me to do some experiments with use of that in SQL Server :).

  • peter-757102 (3/15/2012)


    Jeff, did you receive any news regarding the performance characteristics of the code (or variations) on SQL Server 2012?

    And on the extra bright side, in a few months I will be working in a .NET driven development environment, then it is time for me to do some experiments with use of that in SQL Server :).

    Haven't tried it there yet but I don't see why any of this would change.

    So far as the .Net driven development environment goes, take a look at the CLR splitter that Paul White wrote for the comparision testing in this article.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

Viewing 15 posts - 271 through 285 (of 990 total)

You must be logged in to reply to this topic. Login to reply