Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • peter-757102 (3/24/2016)


    @Jeff Moden,

    These are two new functions I like to see, with a broader scope then just string splitting (but they are suited for it).

    If I had these I would not even care how limited string_split() turns out to be, as long as it is efficient at what it does.

    For general use string_split looks inadequate to me, like every construct that has to be surrounded by a lot of code to make it do what is actually needed. Making a fast function that does little is easy. Speed is easily lost when follow up code is needed. Precisely because the output of the function is an intermediate result or the input is being too limited and pre-processing is required.

    As for extra fixed length columns causing data columns.

    I believe that has virtually no impact when streaming rows.

    Whatever column is not consumed upon use, is discarded after every row iteration.

    Missing the extra column on the other hand causes extra processing and storage, it is not even a fair fight in the slightest.

    How often do YOU use the length of an element when you split strings? I can say that I've never seen the need (so far) so it must be a fairly rare situation.

    Adding the extra LEN column, as fast as it may be, isn't going to be free. Although the system may need to calculate it to actually be able to split the string behind the scenes, it still has to do the return only to be discarded in most cases. If it's as rare a thing to need it, I'd just as soon see it as an outside process rather than causing any extra resource usage.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/24/2016)


    peter-757102 (3/24/2016)


    @Jeff Moden,

    These are two new functions I like to see, with a broader scope then just string splitting (but they are suited for it).

    If I had these I would not even care how limited string_split() turns out to be, as long as it is efficient at what it does.

    For general use string_split looks inadequate to me, like every construct that has to be surrounded by a lot of code to make it do what is actually needed. Making a fast function that does little is easy. Speed is easily lost when follow up code is needed. Precisely because the output of the function is an intermediate result or the input is being too limited and pre-processing is required.

    As for extra fixed length columns causing data columns.

    I believe that has virtually no impact when streaming rows.

    Whatever column is not consumed upon use, is discarded after every row iteration.

    Missing the extra column on the other hand causes extra processing and storage, it is not even a fair fight in the slightest.

    How often do YOU use the length of an element when you split strings? I can say that I've never seen the need (so far) so it must be a fairly rare situation.

    Adding the extra LEN column, as fast as it may be, isn't going to be free. Although the system may need to calculate it to actually be able to split the string behind the scenes, it still has to do the return only to be discarded in most cases. If it's as rare a thing to need it, I'd just as soon see it as an outside process rather than causing any extra resource usage.

    Maybe you misunderstood me.

    I am talking about being able to know the length of a sub-string based on the output of a find() function.

    Without this, there is no simple (read fast and clean) way to perform sub-string actions.

    A workaround would be window functions and then it becomes complicated and inefficient.

    Rich versions of Find functions, such those supporting patterns/regular expressions, have to deal with variable length separators.

    Maybe this what causes the confusion.

  • PS.

    My issue of not being able to post or comment on connect issues has been solved.

    They lost part of my registration, after going trough it again, it works.

  • Thanks Jeff for pointing that out.

    Yes the version in my gist is broken. My mistake I've since updated it in my database project but not in this actual test script. I've updated the gist and rerun the test script (sans string_split as I don't have 2016 at this workstation).

    The performance is as expected fn_split screams.

    SplitterNametotal_davg_d

    -----------------------------------------

    delimsplit8k22.2612260.377308915254237

    fn_split2.8222850.0478353389830508

    SqlBaseline4.1824130.0708883559322034

  • Something I don't understand:

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d','');

    Returns the whole string. This makes sense, the function found 0 occurrences of the delimiter.

    SELECT * FROM dbo.DelimitedSplit8K(NULL,',');

    Returns a NULL. This makes sense, a NULL input should return a NULL output.

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d',NULL);

    Also returns the whole string. Shouldn't it return a NULL just like the previous query does?

    I'm not implying in any way that the logic is wrong, I just don't understand it.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Alan.B (5/13/2016)


    Something I don't understand:

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d','');

    Returns the whole string. This makes sense, the function found 0 occurrences of the delimiter.

    SELECT * FROM dbo.DelimitedSplit8K(NULL,',');

    Returns a NULL. This makes sense, a NULL input should return a NULL output.

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d',NULL);

    Also returns the whole string. Shouldn't it return a NULL just like the previous query does?

    I'm not implying in any way that the logic is wrong, I just don't understand it.

    IMHO, no. It shouldn't return a NULL when the delimiter is NULL. It didn't find any NULLs as delimiters and, just like not finding a comma as a delimiter, should return whatever it was passed... which it does.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/14/2016)


    Alan.B (5/13/2016)


    Something I don't understand:

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d','');

    Returns the whole string. This makes sense, the function found 0 occurrences of the delimiter.

    SELECT * FROM dbo.DelimitedSplit8K(NULL,',');

    Returns a NULL. This makes sense, a NULL input should return a NULL output.

    SELECT * FROM dbo.DelimitedSplit8K('xxx,d',NULL);

    Also returns the whole string. Shouldn't it return a NULL just like the previous query does?

    I'm not implying in any way that the logic is wrong, I just don't understand it.

    IMHO, no. It shouldn't return a NULL when the delimiter is NULL. It didn't find any NULLs as delimiters and, just like not finding a comma as a delimiter, should return whatever it was passed... which it does.

    That makes sense. Thank you sir.

    "I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code."

    -- Itzik Ben-Gan 2001

  • Jeff Moden (5/14/2016)


    IMHO, no. It shouldn't return a NULL when the delimiter is NULL. It didn't find any NULLs as delimiters and, just like not finding a comma as a delimiter, should return whatever it was passed... which it does.

    That looks like treating NULL as a value, rather than the absence of any value. Heresy:exclamation::sick: :w00t:

    Perhaps it would be better to say "if you don't tell it what the delimiter is it can't find a delimiter, so just like any other case of not finding a delimiter it returns whatever string it was passed".

    That gives you the same result without pretending NULL is a value.

    Tom

  • TomThomson (5/14/2016)


    Jeff Moden (5/14/2016)


    IMHO, no. It shouldn't return a NULL when the delimiter is NULL. It didn't find any NULLs as delimiters and, just like not finding a comma as a delimiter, should return whatever it was passed... which it does.

    That looks like treating NULL as a value, rather than the absence of any value. Heresy:exclamation::sick: :w00t:

    Perhaps it would be better to say "if you don't tell it what the delimiter is it can't find a delimiter, so just like any other case of not finding a delimiter it returns whatever string it was passed".

    That gives you the same result without pretending NULL is a value.

    Heh... NULL... source of many hours of sometimes zealous and always entertaining reading in more discussions that I care to admit reading. 😛 There are even people that think there shouldn't be a "NULL" in databases and others that think there should be more than one kind of NULL.

    For the record and, as you probably know from many of my posts in the past, I absolutely agree. NULL is the [font="Arial Black"]absence [/font]of any value and is frequently interpreted simply as "unknown", which cannot have any value. Another way of thinking is that it's a "condition" rather than a value.

    As they say, "But I digress".

    Shifting gears back to your good comments, my original statement was "It didn't find any NULLs as delimiters (and it never will because NULL is an absence of a value) and..." but thought the better of it because I didn't want to start a classic long winded debate on NULLs and whether you can find the absence of a value in a string that has a value. I should have left it as I originally had it.

    But, let's have a bit of fun. Oddly enough (and you know this but it's another source of entertainment), the classic ASCII character set (http://www.asciitable.com/) has a character (CHAR(0)) that is used to represent NULL. It actually does come into play in the way that SQL Server handles some characters, depending on the display mode.

    For example and based on what we've both stated, what would you expect the following to actually return?

    --===== NULL delimiter with embedded ASCII "NULL" characters.

    DECLARE @String VARCHAR(100);

    SELECT @String = 'AAA'+CHAR(0)+'BBB'+CHAR(0)+'CCC';

    SELECT * FROM dbo.DelimitedSplit8K(@String,NULL);

    In the "Results to Grid" mode, it returns the following...

    ItemNumber Item

    1 AAA

    This has led quite a few people down the primrose path into thinking that the ASCII "NULL" character actually does play the part of a real "NULL" as a delimiter (although only partially), including me when I first started a couple of decades ago.

    Now, run the exact same code in the "Results to TEXT" mode, which returns and see the truth...

    ItemNumber Item

    -------------------- -----------

    1 AAA BBB CCC

    (1 row(s) affected)

    It returns the whole string because it couldn't find a real NULL as a delimiter which, as I should have left in my reply, would be impossible to do because NULL isn't a value... it's the absence of a value.

    So, what's up with the "Results to GRID" mode? It's nothing more than a "display problem" based on how the Microsoft programmers decided to handle CHAR(0) in the grid output. The giveaway is that the splitter still only returned one row in the grid mode. It simply didn't display the rest of the string starting with the first CHAR(0) character. You can easily prove that by running the following code in each of the two display modes. In the "Results to GRID" mode, the truncation occurs. In the "Results to TEXT" mode, it does not and the CHAR(0) characters are simply treated as "white space" characters.

    DECLARE @String VARCHAR(100);

    SELECT @String = 'AAA'+CHAR(0)+'BBB'+CHAR(0)+'CCC';

    SELECT @String;

    In the grid mode, it returns...

    AAA

    ... and it the text mode, it returns...

    -------------------

    AAA BBB CCC

    (1 row(s) affected)

    CHAR(0) does act as a delimiter though, which is actually proof that it's not a real NULL. The following displays correctly in both the grid and text modes because the CHAR(0) isn't a part of the output...

    --===== CHAR(0) delimiter with embedded ASCII "NULL" CHAR(0) characters.

    DECLARE @String VARCHAR(100);

    SELECT @String = 'AAA'+CHAR(0)+'BBB'+CHAR(0)+'CCC';

    SELECT * FROM dbo.DelimitedSplit8K(@String,CHAR(0));

    Just to say it again, I agree... NULL has no value. NULL is NOT nothing. It's the absence of any value and the apparent appearance of "NULL" is actually the result of a flag in the "null bitmap" on the page. Here's a pretty detailed article with coded proof on that subject. http://weblogs.sqlteam.com/mladenp/archive/2007/09/06/How_does_SQL_Server_really_store_NULL-s.aspx

    As a favor to everyone on this thread, if someone want's to go into a deeper discussion on NULLs, please start a different thread and post the link here. Thanks.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To further Jeff's point about the possible "unexpected" behavior of ASCII character 0, another reason to not use it unless you have a specific need for it is that it behaves differently between being in VARCHAR form in a SQL_ Collation and in either VARCHAR form in a non-SQL_ Collation or in NVARCHAR form.

    SELECT 1 WHERE CHAR(0) = ''; -- Collation assumed to be the current Database's default

    SELECT 2 WHERE CHAR(0) = N''; -- Collation assumed to be the current Database's default

    SELECT 3 WHERE NCHAR(0) = ''; -- Collation assumed to be the current Database's default

    SELECT 4 WHERE NCHAR(0) = N''; -- Collation assumed to be the current Database's default

    SELECT 5 WHERE CHAR(0) = '' COLLATE SQL_Latin1_General_CP1_CI_AS;

    SELECT 6 WHERE CHAR(0) = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;

    SELECT 7 WHERE NCHAR(0) = '' COLLATE SQL_Latin1_General_CP1_CI_AS;

    SELECT 8 WHERE NCHAR(0) = N'' COLLATE SQL_Latin1_General_CP1_CI_AS;

    SELECT 9 WHERE CHAR(0) = '' COLLATE Latin1_General_100_CI_AS;

    SELECT 10 WHERE CHAR(0) = N'' COLLATE Latin1_General_100_CI_AS;

    SELECT 11 WHERE NCHAR(0) = '' COLLATE Latin1_General_100_CI_AS;

    SELECT 12 WHERE NCHAR(0) = N'' COLLATE Latin1_General_100_CI_AS;

    "5" will not return, and if the current Database's default Collation is a SQL_ Collation, then neither will "1".

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Piling on on the CHAR(0), here are some of my favorites:

    😎

    SET NOCOUNT ON;

    DECLARE @STR_WITH_ZERO_CHAR VARCHAR(20) = '123456789' + CHAR(0) + '1234567890';

    -- CHARINDEX will not return the position of the CHAR(0)

    SELECT CHARINDEX(CHAR(0),@STR_WITH_ZERO_CHAR,1) AS CHARINDEX_OUT

    -- PATINDEX will return 1 as it is effectively only searching for any character (%)

    SELECT PATINDEX('%' + CHAR(0) + '%',@STR_WITH_ZERO_CHAR) AS PATINDEX_OUT

    -- Removing the CHAR(0) does not work using REPLACE

    SELECT REPLACE(@STR_WITH_ZERO_CHAR,CHAR(0),'') AS REPLACE_OUT

    -- which obviously makes it impossible to count the number of CHAR(0) in a string

    -- using the replace trick.

    SELECT LEN(@STR_WITH_ZERO_CHAR) - LEN(REPLACE(@STR_WITH_ZERO_CHAR,CHAR(0),'')) AS LEN_REPLACE_OUT

    Output

    CHARINDEX_OUT

    -------------

    0

    PATINDEX_OUT

    ------------

    1

    REPLACE_OUT

    ---------------------

    123456789 1234567890

    LEN_REPLACE_OUT

    ---------------

    0

  • Eirikur Eiriksson (5/15/2016)


    Piling on on the CHAR(0), here are some of my favorites:

    😎

    SET NOCOUNT ON;

    DECLARE @STR_WITH_ZERO_CHAR VARCHAR(20) = '123456789' + CHAR(0) + '1234567890';

    -- CHARINDEX will not return the position of the CHAR(0)

    SELECT CHARINDEX(CHAR(0),@STR_WITH_ZERO_CHAR,1) AS CHARINDEX_OUT

    -- PATINDEX will return 1 as it is effectively only searching for any character (%)

    SELECT PATINDEX('%' + CHAR(0) + '%',@STR_WITH_ZERO_CHAR) AS PATINDEX_OUT

    -- Removing the CHAR(0) does not work using REPLACE

    SELECT REPLACE(@STR_WITH_ZERO_CHAR,CHAR(0),'') AS REPLACE_OUT

    -- which obviously makes it impossible to count the number of CHAR(0) in a string

    -- using the replace trick.

    SELECT LEN(@STR_WITH_ZERO_CHAR) - LEN(REPLACE(@STR_WITH_ZERO_CHAR,CHAR(0),'')) AS LEN_REPLACE_OUT

    Output

    CHARINDEX_OUT

    -------------

    0

    PATINDEX_OUT

    ------------

    1

    REPLACE_OUT

    ---------------------

    123456789 1234567890

    LEN_REPLACE_OUT

    ---------------

    0

    You should check the default Collation for the Database that was current when you ran that test. I suspect that it is using a Windows Collation (i.e. non-SQL_), which is generally a good thing, but behaves as you are showing in those results. A SQL_ Collation would allow CHAR(0) to match just like any other character.

    Also, you can get the REPLACE and other built-in functions to treat CHAR(0) consistent with the other characters by temporarily overriding the Collation to use a binary one via the COLLATE clause (please use _BIN2 when available as the _BIN Collations have been deprecated):

    SET NOCOUNT ON;

    DECLARE @STR_WITH_ZERO_CHAR VARCHAR(20) = '123456789' + CHAR(0) + '1234567890';

    SELECT CHARINDEX(CHAR(0),@STR_WITH_ZERO_CHAR COLLATE Latin1_General_100_BIN2,1) AS CHARINDEX_OUT

    SELECT PATINDEX('%' + CHAR(0) + '%',@STR_WITH_ZERO_CHAR COLLATE Latin1_General_100_BIN2) AS PATINDEX_OUT

    SELECT REPLACE(@STR_WITH_ZERO_CHAR COLLATE Latin1_General_100_BIN2,CHAR(0),'') AS REPLACE_OUT

    SELECT LEN(@STR_WITH_ZERO_CHAR) - LEN(REPLACE(@STR_WITH_ZERO_CHAR COLLATE Latin1_General_100_BIN2,CHAR(0),'')) AS LEN_REPLACE_OUT

    Results:

    CHARINDEX_OUT

    10

    PATINDEX_OUT

    10

    REPLACE_OUT

    1234567891234567890

    LEN_REPLACE_OUT

    1

    BUT, neither of these clarifications is meant to take anything away from those additional instances in which CHAR(0) has potentially unexpected behavior :-).

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (5/15/2016)


    You should check the default Collation for the Database that was current when you ran that test. I suspect that it is using a Windows Collation (i.e. non-SQL_), which is generally a good thing, but behaves as you are showing in those results. A SQL_ Collation would allow CHAR(0) to match just like any other character.

    Thanks for this Solomon,

    my bad not mentioning the difference between the behaviour using different Collations.

    😎

  • The trick for counting CHAR(0) in a string is to count the number of length one substrings that are equal to CHAR(0). Gloriously inefficient, of course, but I don't know another way of doing it.

    And when comparing varchars which represent encrypted strings the collation needs to be forced to a BIN one or wrong results will happen because encrypted strings may contain char(0) somewhere even if none of the plaintext does (whether it's an SQL collation or a windows collation doesn't matter).

    Of course there's not much use for CHAR(0) in text data these days. A (rather long) while back it was quite common - many flying head print devices needed a few NULLs after NL (ebcdic) or CR LF (ascii) to avoid printing a character in the middle of head flyback, so for example a text document that was to be printed on a devide with a 300 bit/sec interface might use CR LF NULL at the end of lines containing more than about 64 characters (in the ebcdic world, NL NULL in lines longer than about 30 characters and NL NULL NULL in lines longer than about 64 characters). But in those days no-one had to worry about the handling of CHAR(0) (or indeed about anything at all) in SQL.

    Tom

  • TomThomson (5/16/2016)


    The trick for counting CHAR(0) in a string is to count the number of length one substrings that are equal to CHAR(0). Gloriously inefficient, of course, but I don't know another way of doing it.

    As shown in my previous post above, using a _BIN2 Collation via the COLLATE clause (for columns that are not set to a binary Collation and when using literals while not in a database that has a binary default Collation) would allow you to handle CHAR(0) without it having any "special" behaviors. Reworking the example code from that prior post (to focus only on the counting of CHAR(0) query), we get:

    SET NOCOUNT ON;

    DECLARE @STR_WITH_ZERO_CHAR VARCHAR(20) = '123' + CHAR(0) + '45' + CHAR(0)

    + '678' + CHAR(0) + '90';

    SELECT LEN(@STR_WITH_ZERO_CHAR)

    - LEN(REPLACE(@STR_WITH_ZERO_CHAR COLLATE Latin1_General_100_BIN2, CHAR(0), ''))

    AS [CountOfChar0]

    The query above returns: 3.

    In the case of using a VARCHAR column to store the hex bytes of encrypted or hashed values, you might as well set the Collation of that column to Latin1_General_100_BIN2 which will allow the comparisons to be even more efficient than specifying COLLATE per each query.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 15 posts - 886 through 900 (of 990 total)

You must be logged in to reply to this topic. Login to reply