Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • @jeff Moden,

    I understand your position.

    And I do have good reasons to want it a certain way:

    1. Storing empty strings really is an edge case.

    2. Casting null values to a final data-type is easier and more precise/controllable then doing the same for zero-length strings (= less code and faster).

    3. As an exception, I find using IsNULL( column, 'some-value' ) more elegant then using NullIf( column, '' ), followed by possibly even more processing (it is less direct so to speak).

    An example is in order

    Lets say we want to fill up a simple TABLE datatype that has two columns, both of type int.

    Column 1 is the key and column 2 is a value which can be NULL.

    We have as input two strings we have to split and each contains the same amount of elements.

    declare @T TABLE

    (

    ID int not null primary key

    , Value int

    );

    If the split function returns null for zero-length int values, a simple cast to the appropriate data type will do.

    In fact, even implicit conversions would work out of the box here.

    But what if we get zero-length string values instead?

    Any straight up conversions of zero-length strings would result in a 0 value for datatype int and that is not what we want here.

    In order to remedy this we first need to convert the item value to a null whenever it is a zero-length string using NullIf( value, '' ).

    The reason to often want a table with typed columns instead of string columns can vary, but speed is one of the primary goals here.

    Other reasons:

    1. Reuse of values in some complex batch or procedure

    2. Application indexes and constraints

    3. Data-type validation and one time conversion costs.

    4. Collation control

    5. Enriching of the worktable, possibly in multiple steps, before final output is returned.

    I know many of the tests and examples surrounding splitting deal with splitting concatenated sting values. But in my experience dealing with int values and dates is actually more common. Regardless of the used data-types, treating them all the same way is preferable. This is why I lean to returning null values over zero-length strings. It works data-type agnostic, is the best fit for more types and incidentally those types are also the most common in my experience.

    If you follow all this, you probably see why I consider a split function not really a string function.

    It is rather more a table function that has string as input, but often the result does not end up being used as string values.

    Practical applications go beyond splitting books into words simply for kicks :).

  • @Peter,

    MS is making just one splitter. Because of that, it has to be generic. You would prefer it to return NULLs for empty elements. Others would prefer 0's to be returned if the elements are determined to be numeric. Others would prefer it to return empty strings, datatype be damned (my preference, by the way). Still others would prefer that empty elements not be returned at all and still others want the strings to be sorted, deduplicated, or both. And the real crazies are going to want it to handle multiple levels of delimiters in an XML-like fashion. Well, except on Tuesdays. 🙂 Where does it end?

    Differences between CHAR, VARCHAR, NCHAR, NVARCHAR, and SQL_Variant not withstanding, the function (IMHO) should return exactly what it sees either as a single element or as a single level multi-element string and it should be done as fast as possible. Due to the nature of typical delimited data (I say "typical" to avoid yet another controversy that considers binary data and an additional discussion on whether it should be Little-Endian or not or choose-able between the two, etc, ad infinitum), the only thing it can see is a string or a NULL. Making internal ancillary decisions to return the flavor of the day will do nothing but slow it down punishing one or another use case for other use cases insofar as performance is concerned.

    A good string splitter is all I want from MS and, to me, a good string splitter faithfully returns what's between delimiters (and start of string/end of string are permanent and immutable delimiters to me) or what's passed as a single element string even if that's a NULL or empty string (and they ARE different). If I need to accommodate anything else, I'll customize the output punishing only that use case for performance.

    As a bit of a sidebar, if find it ironic that a NULL in ASCII is actually CHAR(0) whereas an empty string truly has no value and is, therefor, unknown. I wonder if that's why Oracle considers adjacent single quotes to be a NULL. 😛 To quote a man much wiser than I, "The good part about standards is that there are so many to choose from". :hehe:

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Ed Wagner (3/12/2016)


    Eirikur Eiriksson (3/11/2016)


    TheSQLGuru (3/11/2016)


    Someone needs to benchmark the STRING_SPLIT() function coming out in SQL Server 2016 and update the main article!

    https://msdn.microsoft.com/en-us/library/mt684588.aspx

    Very quick on Azure, only caveat is that it simply returns

    Msg 208, Level 16, State 1, Line 5

    Invalid object name 'STRING_SPLIT'.

    😎

    Do you have compatibility mode of 130 for your database? I don't have access to 2016 to test, but the BOL page said it's required.

    The Azure SQL Database is in compatibility_level 130, still returns Invalid object name 'STRING_SPLIT'

    😎

    Wouldn't be the first time BOL isn't exactly correct.

  • Jeff Moden (3/12/2016)


    since there's a work around for that, as well, I can live with the idea that the relational purists have of not returning anything because a string wasn't actually provided. It just seems so counter intuitive to have a string function not return a NULL if provided a NULL.

    My objection here is not about relational purity, but about conceptual / programmatic consistency. This is not a string that contains string data like a name, address, description, SKU, etc. It is a CSV, which is an array temporarily disguised as a string. And so the string contains elements, and is not an element itself; it is the container of elements. A NULL input string indicates a NULL container, so talking about a "single element of NULL" is irrelevant since no elements were passed in. Having a NULL input string return a single row of NULL is inconsistent because in that one special case it is evaluating the container on the same level that it evaluates what is inside of the container in all other cases. To state it in pseudo-code:

    int[] _Stuff = new int[];

    foreach (int _Something in _Stuff)

    {

    // do something;

    }

    Wanting a single row of NULL for a NULL input is the same as saying that the above loop should be entered and iterate one time. But that doesn't make sense because the array is empty.

    That being said, I do fully understand that a) there are workarounds for both cases, and b) that plenty of people see things the other way and have coded to that for quite some time and it seems to be working for them. I am just trying to clarify why I don't get the "NULL input returns 1 row of NULL" interpretation.

    So, I would say that it just seems so counter-intuitive to return anything if the input is NULL ;-).

    I should also point out that there is little purpose in bemoaning an extra optional parameter given that you would never even know it was there because you would never use it, and so it would still be a function call with just 2 parameters.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • FWIW, I agree with Solomon here.

    -- Gianluca Sartori

  • Solomon Rutzky (3/13/2016)


    Jeff Moden (3/12/2016)


    since there's a work around for that, as well, I can live with the idea that the relational purists have of not returning anything because a string wasn't actually provided. It just seems so counter intuitive to have a string function not return a NULL if provided a NULL.

    My objection here is not about relational purity, but about conceptual / programmatic consistency. This is not a string that contains string data like a name, address, description, SKU, etc. It is a CSV, which is an array temporarily disguised as a string. And so the string contains elements, and is not an element itself; it is the container of elements. A NULL input string indicates a NULL container, so talking about a "single element of NULL" is irrelevant since no elements were passed in. Having a NULL input string return a single row of NULL is inconsistent because in that one special case it is evaluating the container on the same level that it evaluates what is inside of the container in all other cases. To state it in pseudo-code:

    int[] _Stuff = new int[];

    foreach (int _Something in _Stuff)

    {

    // do something;

    }

    Wanting a single row of NULL for a NULL input is the same as saying that the above loop should be entered and iterate one time. But that doesn't make sense because the array is empty.

    That being said, I do fully understand that a) there are workarounds for both cases, and b) that plenty of people see things the other way and have coded to that for quite some time and it seems to be working for them. I am just trying to clarify why I don't get the "NULL input returns 1 row of NULL" interpretation.

    So, I would say that it just seems so counter-intuitive to return anything if the input is NULL ;-).

    I should also point out that there is little purpose in bemoaning an extra optional parameter given that you would never even know it was there because you would never use it, and so it would still be a function call with just 2 parameters.

    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • spaghettidba (3/14/2016)


    FWIW, I agree with Solomon here.

    Heh... for once, we disagree. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • Jeff Moden (3/14/2016)


    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    Well, for reasons heretofore stated and unstated, whether expressly or implied, I will just have to disagree with that which have you said, with the stated exception that I am in agreement with respect to your disagreement regarding what I have said, with the exception of you being in agreement with what I am saying here with respect to our respective agreement to hold to our prior mutual disagreement on this matter. Agreed? 😉

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (3/14/2016)


    Jeff Moden (3/14/2016)


    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    Well, for reasons heretofore stated and unstated, whether expressly or implied, I will just have to disagree with that which have you said, with the stated exception that I am in agreement with respect to your disagreement regarding what I have said, with the exception of you being in agreement with what I am saying here with respect to our respective agreement to hold to our prior mutual disagreement on this matter. Agreed? 😉

    I think everybody will agree we need paracetamol for this headache. 🙂

    -- Gianluca Sartori

  • spaghettidba (3/14/2016)


    Solomon Rutzky (3/14/2016)


    Jeff Moden (3/14/2016)


    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    Well, for reasons heretofore stated and unstated, whether expressly or implied, I will just have to disagree with that which have you said, with the stated exception that I am in agreement with respect to your disagreement regarding what I have said, with the exception of you being in agreement with what I am saying here with respect to our respective agreement to hold to our prior mutual disagreement on this matter. Agreed? 😉

    I think everybody will agree we need paracetamol for this headache. 🙂

    Agreed. 😉 I never thought a string splitter would make such a stink - unless it didn't work.

  • Ed Wagner (3/14/2016)


    spaghettidba (3/14/2016)


    Solomon Rutzky (3/14/2016)


    Jeff Moden (3/14/2016)


    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    Well, for reasons heretofore stated and unstated, whether expressly or implied, I will just have to disagree with that which have you said, with the stated exception that I am in agreement with respect to your disagreement regarding what I have said, with the exception of you being in agreement with what I am saying here with respect to our respective agreement to hold to our prior mutual disagreement on this matter. Agreed? 😉

    I think everybody will agree we need paracetamol for this headache. 🙂

    Agreed. 😉 I never thought a string splitter would make such a stink - unless it didn't work.

    It depends on how strictly / loosely you are defining "work". If you mean working "properly", then that is actually the issue at hand: the new STRING_SPLIT function's behavior of not returning a single row of NULL when supplied with a NULL string is considered to be "not working properly" by some, hence a bug was filed against this behavior. On the other hand, to return a single row of NULL upon being supplied with a NULL string would be considered "not working properly" by some, and a bug would be filled against that behavior.

    Still, I wouldn't consider this discussion to be that much of a stink. I mean, certainly not to the level that one finds with discussions related to using ORMs, GOTO statements, CURSORs, embedding business logic in the DB (especially within Triggers), using GUIDs as PKs, NoSQL, (most likely to be added to this list soon enough: SQL Server on Linux), etc, etc 😉

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (3/14/2016)


    Jeff Moden (3/14/2016)


    I'll have to agree to strongly disagree with most everything you've said and the reasons you've said them.

    Well, for reasons heretofore stated and unstated, whether expressly or implied, I will just have to disagree with that which have you said, with the stated exception that I am in agreement with respect to your disagreement regarding what I have said, with the exception of you being in agreement with what I am saying here with respect to our respective agreement to hold to our prior mutual disagreement on this matter. Agreed? 😉

    Didn't know you were into politics. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)
    Intro to Tally Tables and Functions

  • spaghettidba (3/14/2016)


    FWIW, I agree with Solomon here.

    I agree with him that returning a single row is not the right answer, but I believe that he is completely wrong in claiming that returning an empty rowset is the right answer. If SQL were a language that treated relations as first class objects (ie a relational language in the same sense that Haskell is a functional language) we would be able to return a NULL rowset and that would be a correct solution: the input was unspecified/unknown so the output should also be unspecified/unknown. But we don't have that option.

    I believe that what Solomon is advocating is actually a lot worse than what Jeff is advocating, because his justification for it is that he wants to introduce yet another case where NULL is used to encode a particular known value rather than indicating that knowledge of the value is not provided (and he amazes me when he appears to claim in one post that there is never any need to have a bunch of strings in which NULLs are distinnguished from zero-length strings; I've done a lot of work processing text in complicated multi-lingual situations and I know from experience that distinguishing NULL from zero-length strings is sometimes very important indeed). Another case where NULL is used to encode a genuine known value would be just a pure disaster. And some of us have years of experience using splitters that behave like Jeff's splitter, so there's a backward compatibility argument too (of course backwards compatibility may work the other way for people who have been been using lots of encodings that treat an empty string and a NULL as the same thing).

    Of course another option would be to make the splitter throw an error on NULL input; I wonder what impact that would have on performance? It would be unpleasant for usability, though, having to enclose it in a try-catch block all the time.

    Tom

  • TomThomson (3/15/2016)


    I believe that what Solomon is advocating is actually a lot worse than what Jeff is advocating, because his justification for it is that he wants to introduce yet another case where NULL is used to encode a particular known value rather than indicating that knowledge of the value is not provided (and he amazes me when he appears to claim in one post that there is never any need to have a bunch of strings in which NULLs are distinnguished from zero-length strings; I've done a lot of work processing text in complicated multi-lingual situations and I know from experience that distinguishing NULL from zero-length strings is sometimes very important indeed). Another case where NULL is used to encode a genuine known value would be just a pure disaster. And some of us have years of experience using splitters that behave like Jeff's splitter, so there's a backward compatibility argument too (of course backwards compatibility may work the other way for people who have been been using lots of encodings that treat an empty string and a NULL as the same thing).

    That "Solomon" dude sounds pretty meshugenah ;-). I, on the other hand, have been saying pretty much the exact opposite of all that. I have been advocating that we specifically do not treat NULL as having a special meaning of a "known value of 1 element of NULL". But to be fair, I am the "someone" who suggested adding a parameter (optional or not) to specify how to treat a NULL input, given that there are two distinct and unwavering interpretations of what should be done. That would not only allow everyone to get what they wanted functionally, but it would also let us continue this discussion as we debate the merits/evils of additional parameters; a win-win situation ;-).

    I also stated that with regards to needing to distinguish NULL strings from empty strings, the CSV format was lacking such ability. I pointed out that this problem has been solved in other formats, such as XML, both in terms of elements and attributes.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • I'm still waiting for the performance review of the new 2016 string splitter function I asked for long ago before all this mish-mash about NULL sidetracked the topic. 😀

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

Viewing 15 posts - 826 through 840 (of 981 total)

You must be logged in to reply to this topic. Login to reply