Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • mark hutchinson (3/29/2015)


    Null is a something that means nothing

    I'll have to disagree with that. To me, an empty string or the number 0 is nothing. Null can have no value like nothing because it's not a value. It's a state that marks places where the value is unknown and even the possibility of something having a value may be unknown.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • mpescod (3/29/2015)


    Very nice and well explained article, I enjoyed reading it and just wanted to add my findings which include handling the case where the string to be parsed is blank. In this instance I would expect to get no rows back but the solution in this article seems to return a row with a blank string.

    Here's the version I've been using for a while now and it seems to match performance with yours with the included handling of a blank source string.

    I also use the values method of a row generator/tally table as I don't need to support prior to 2008 and it makes it more readable IMO.

    ;with e(n) as ( select 0 from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) T(n))

    ,rg(n) as (select 0 from e,e e2,e e3,e e4)

    select Item=substring(@pString,n,isnull(nullif(CHARINDEX(@pDelimiter,@pString,n),0),len(@pString)+1)-n)

    from (

    select n=1 where len(@pString) > 0

    union all

    select n+1 from (select top(len(@pString)) n=ROW_NUMBER() over (order by (select null)) from rg) T where SUBSTRING(@pString,n,1) = @pDelimiter

    ) StartPositions

    Thank you for the feedback. What data did you use to test the performance with?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/29/2015)


    mark hutchinson (3/29/2015)


    Null is a something that means nothing

    I'll have to disagree with that. To me, an empty string or the number 0 is nothing. Null can have no value like nothing because it's not a value. It's a state that marks places where the value is unknown and even the possibility of something having a value may be unknown.

    I must agree with the disagreement. NULL is not nothing, nor is it true, false, zero or an empty string. NULL is unknown.

  • Ed Wagner (3/30/2015)


    Jeff Moden (3/29/2015)


    mark hutchinson (3/29/2015)


    Null is a something that means nothing

    I'll have to disagree with that. To me, an empty string or the number 0 is nothing. Null can have no value like nothing because it's not a value. It's a state that marks places where the value is unknown and even the possibility of something having a value may be unknown.

    I must agree with the disagreement. NULL is not nothing, nor is it true, false, zero or an empty string. NULL is unknown.

    I must disagree with your agreement, "Nothing" doesn't exist in the database, however it does exist in other languages and is quite often used to indicate a NULL.

    An example would be in an SSRS report, where a parameter which is set to "not allow NULLS", it will also not accept the value Nothing - they are synonymous in that context.

    To say the NULL is not Nothing - or indeed that it is Nothing - within the context of TSQL/DML queries has no meaning as Nothing is not defined.

    As to the latest point about correct behaviour when splitting a NULL string, my preference is that it works as Jeff has written it - If I decide to pass in a NULL, all the function should do is return a NULL - I am not asking it to filter my rows, just to split a NULLABLE string - and there is no reasonable output from that function, other than a single NULL return - is there?

    MM



    select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

  • Forum Etiquette: How to post Reporting Services problems
  • [/url]
  • Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
  • [/url]
  • How to Post Performance Problems - by Gail Shaw
  • [/url]

  • mister.magoo (3/30/2015)


    Ed Wagner (3/30/2015)


    Jeff Moden (3/29/2015)


    mark hutchinson (3/29/2015)


    Null is a something that means nothing

    I'll have to disagree with that. To me, an empty string or the number 0 is nothing. Null can have no value like nothing because it's not a value. It's a state that marks places where the value is unknown and even the possibility of something having a value may be unknown.

    I must agree with the disagreement. NULL is not nothing, nor is it true, false, zero or an empty string. NULL is unknown.

    I must disagree with your agreement, "Nothing" doesn't exist in the database, however it does exist in other languages and is quite often used to indicate a NULL.

    An example would be in an SSRS report, where a parameter which is set to "not allow NULLS", it will also not accept the value Nothing - they are synonymous in that context.

    To say the NULL is not Nothing - or indeed that it is Nothing - within the context of TSQL/DML queries has no meaning as Nothing is not defined.

    As to the latest point about correct behaviour when splitting a NULL string, my preference is that it works as Jeff has written it - If I decide to pass in a NULL, all the function should do is return a NULL - I am not asking it to filter my rows, just to split a NULLABLE string - and there is no reasonable output from that function, other than a single NULL return - is there?

    Regarding your disagreement with my agreement (:-D) you raise a good point about Nothing not being defined in SQL. In other languages, Nothing is defined, but not T-SQL. I will, however, continue to regard NULL as unknown.

    And I'm in complete agreement that the function already works as it should.

  • The forum ate my post due to a parsing error...grrr...not going to type it all again...just this:

    Set types and scalar types follow different logic and sets are not null-able, nor do they need to be!

    The idea of "null in", "null out" really does not compute when sets are involved.

    A set works with its own logic such as: union, intersect, except and joins.

    Logic tailored to fit its "structure" and "purpose" best.

    It is easier to accept it for what it is and get "null in", "null out" behavior using an empty set!

    Just use "left outer join" or "outer apply" to get the record with a null column in there when you want to.

  • mister.magoo (3/30/2015)


    Ed Wagner (3/30/2015)


    Jeff Moden (3/29/2015)


    mark hutchinson (3/29/2015)


    Null is a something that means nothing

    I'll have to disagree with that. To me, an empty string or the number 0 is nothing. Null can have no value like nothing because it's not a value. It's a state that marks places where the value is unknown and even the possibility of something having a value may be unknown.

    I must agree with the disagreement. NULL is not nothing, nor is it true, false, zero or an empty string. NULL is unknown.

    I must disagree with your agreement, "Nothing" doesn't exist in the database, however it does exist in other languages and is quite often used to indicate a NULL.

    An example would be in an SSRS report, where a parameter which is set to "not allow NULLS", it will also not accept the value Nothing - they are synonymous in that context.

    To say the NULL is not Nothing - or indeed that it is Nothing - within the context of TSQL/DML queries has no meaning as Nothing is not defined.

    As to the latest point about correct behaviour when splitting a NULL string, my preference is that it works as Jeff has written it - If I decide to pass in a NULL, all the function should do is return a NULL - I am not asking it to filter my rows, just to split a NULLABLE string - and there is no reasonable output from that function, other than a single NULL return - is there?

    I understand that many languages and products treat NULLs differently and that we have to jump whatever hoop is held up for us in that area but I'll stand by the definition of NULL not being equal to Nothing. The fact that some languages define otherwise won't change that for me but I'm flexible enough to use whatever hoop is presented. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • @jeff

    I didn't say "Null was equal to nothing". I wrote that Null was something that "means nothing".

  • mister.magoo (3/30/2015)


    As to the latest point about correct behaviour when splitting a NULL string, my preference is that it works as Jeff has written it - If I decide to pass in a NULL, all the function should do is return a NULL - I am not asking it to filter my rows, just to split a NULLABLE string - and there is no reasonable output from that function, other than a single NULL return - is there?

    The question at the end is incorrectly stated as it excludes the possibility of the correct answer. It cannot be answered with a "yes" or a "no" since there is actually no reasonable output at all from that function given a NULL input.

    But I think I am finally understanding the source of the confusion on how this function should work. I see two main areas that are tripping people up, and both deal with the nature of the input data.

    I am not asking it to filter my rows

    The first area of confusion is what the input value/data represents. The input is not a Row: it is a Table (or array or collection, etc). True, the input value might come from a row in a table, but it is not itself a row. If the data truly represented a row, then the function would only ever return a single row.

    Wanting a split function to return a single NULL row for NULL input is asking it to behave like a scalar function, yet this is a set-based function (this is what Peter just mentioned). If a split function returns a NULL row for NULL input, then it is behaving differently based on the input. Meaning, it would be acting as a set-based function (TVF) for NOT NULL input, but acting like a scalar function (UDF) for NULL input.

    It might help to put this in non-SQL terms. Consider that we have an array of strings in C#. How would it behave between having 1 or more elements, and having no elements?

    string[] _Stuffs = new string[] { "one", "two" };

    foreach (string _Thing in _Stuffs)

    {

    Console.Writeline("Value: {0}", _Thing);

    }

    The output would be:

    Value: one

    Value: two

    Now what if there were no defined elements:

    string[] _Stuffs = new string[];

    foreach (string _Thing in _Stuffs)

    {

    Console.Writeline("Value: {0}", _Thing);

    }

    The output would be:

    There is no output. The output would not be:

    Value:

    Because there is nothing to iterate over. So now back to looking at this in terms of SQL:

    CREATE TABLE #Stuffs (Thing VARCHAR(50));

    INSERT INTO #Stuffs (Thing) VALUES ('one');

    INSERT INTO #Stuffs (Thing) VALUES ('two');

    SELECT [Thing]

    FROM #Stuffs;

    The output would be:

    Thing

    -----

    one

    two

    Now what if there were no rows:

    CREATE TABLE #Stuffs (Thing VARCHAR(50));

    SELECT [Thing]

    FROM #Stuffs;

    The output would be:

    Thing

    -----

    Again, I agree that NULL means "unknown". But that is why no rows can be returned from a TVF / set-based operation since any rows equate to "known".

    just to split a NULLABLE string

    The second area of confusion is that people seem to be getting caught up in the actual datatype of the input value. Yes, in this case it is a string that is being passed in, but it doesn't need to be. This operation should be viewed as "deserializing a serialized array". The only reason that the format of the serialization matters (in this case it is a delimited string) is that it determines the physical implementation of how to deserialize it. But this operation should be viewed first in more general / conceptual terms:

    I will supply an array serialized in some fashion (delimited string, XML, binary, etc).

    I need to receive the array in its natural state.

    So what if our same table / array of strings was serialized in a binary format, maybe one that allowed for compression even? If the following returns those same two rows (i.e. "one" and "two"):

    SELECT * FROM dbo.SplitSerializedArray(0xa30f56bed72c);

    Then would we still expect the following to return a single row of NULL?

    SELECT * FROM dbo.SplitSerializedArray(NULL);

    I sure hope not.

    I hope this helps explain. And I understand that it is still a matter of preference and people will choose one way or the other, I am just saying that we should be clear that it is conceptually wrong, even if we choose to ignore that ;-).

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • A multi-valued parameter within SSRS would have a string list that is comma separated. When nothing is selected you would get a blank string or '' not NULL and for this I would expect to have 0 rows returned for normal set based calculations.

    I am assuming here that NULL means I have not specified any selection yet, blank means I have selected nothing and if I specifically want to select a blank string then that would appear as 'a,,b' in my string.

    To return a single row with a null value does not make any sense and requires special handling whereas an empty set, as Peter mentioned, fits nicely into set handling routines.

    Or have I mis-read what this function was expected to do?

    If I have a blank that means I have selected "nothing" or 0 items and would expect 0 rows in the results.

    However, this is just how I use this kind of function so only mentioned the different way of handling blank strings in case someone else uses it in the same way. Still a great article to recommend to those who haven't quite made it into the set based way of thinking yet.

    And my understanding of NULL is that it is the "lack of a known value", if you have a known value why would you ever suggest you don't?

    If you are implying a special meaning to NULL then your mystical value should be paired with good comments or a reason why it is NULL using a status value.

  • Solomon Rutzky (3/29/2015)


    TomThomson (3/26/2015)


    Actually this is a nice illustration of a problem with SQL: there's no such thing as a NULL table (ie a table whose type we know but we don't know whether it contains any columns).

    Hmm. I think you meant to say "rows" at the end there instead of "columns". But if you did mean "columns", then how are you defining "type"?

    Yes, I meant rows.

    Although I can define a table type I'm not even allowed to cast NULL to that type. :angry:

    With the splitter we know that any rows in the table (or rowset if you prefer) that it returns can't contain a NULL string, since a NULL string has no representation.

    Agreed, but requires further qualification (more on that in the last section).

    So returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string", so it's reasonable to say that this function returns that single row when in an ideal world it would return a NULL table

    This concept introduces the problem of requiring interpretation of the row in the first place. Isn't that a flawed design? You get one row back and it might be valid and it might not be? Easy enough to distinguish, sure, but seems odd to require it.

    And we are talking about sets here, not scalar values. Any row returned represents a member of that set. But there is no set. If simply doing a COUNT(*) of the results you wouldn't be able to trust a result of 1 without adding a WHERE clause for Value IS NOT NULL. And this isn't a data warehouse where we might have a place-holder record to represent "Unknown" so as to avoid LEFT JOINs.

    The operation of this function is to convert a given string into a set of strings. If no string is provided, no set can be returned (or presented differently, an empty set is returned). But NULL could mean that there is something to convert? Ok, so then shouldn't the following return a row of NULLs for each column:

    DECLARE @ObjectID INT;

    SELECT * FROM sys.objects WHERE [object_id] = @ObjectID;

    So no, not reasonable.

    and when it doesn't know what the input string is (because the parameter is NULL) it can't know how many rows the resulting table would be if it did know, so returning no rows would be misleading. So I think returning the single row with a NULL in it is the right thing to do for this particular function.

    So then why not return 2 rows of NULL? Saying that a NULL input isn't specifically "nothing" because it really means "unknown" is not the right way to work with NULLs. Just look at how NULLs are handled in IF statements.

    DECLARE @ObjectID INT;

    IF (@ObjectID = 1)

    BEGIN

    PRINT 'True!';

    END;

    ELSE

    BEGIN

    PRINT 'False?';

    END;

    -- False?

    NULL doesn't really equate to False, although it does follow that path. So the IF question isn't really simplistically "IF True" as the NULL could be True. The question is really "IF known-to-be-True", and a NULL, being "unknown", cannot match that. Along those same lines, returning any rows at all requires passing the "IF known-to-exist" test, not "IF could possibly exist". So both "IF known to not exist" and "IF not known to exist" should return no rows. Certainty is required for proceeding to the "THEN" branch of an "IF" and for returning rows from a set-based operation. Hence, returning a row on NULL input (at least for this operation) is the misleading path.

    Additionally, regarding your statement that a single row of NULL "can't be misinterpreted as meaning `there's one row containing a null string` ", I would say that:

    • this isn't necessarily true (or more accurately, this doesn't need to be the case; explained in the next section)
    • that no rows returned cannot be misinterpreted considering that the only possible means of returning no rows should be by having a NULL input. And that provides a nice segue to......

    Another interesting question is what to do when the input string is empty - return an empty table (no rows), or return a single row containing a zero-length string. The splitter returns the single row in this case, so there is no case where the splitter returns an empty table. That of course means that it could use the 0 rows return to indicate a NULL table instead of an empty table, doing that wouldn't cause any problem. But I don't think that would be the right thing to do.

    What to do with an empty, NOT NULL string shouldn't be a question of returning 0 rows or 1 row. An empty string is a valid string. There is no way to determine that an empty string is no value at all (as opposed to a single empty value) but a string with just the delimiter and nothing more is two empty values. So, in the current conception and discussion, the only way to get 0 rows returned is by having a NULL input. But, this is due to dealing with the issue of splitting in a rather simplistic fashion.

    When I say that the problem of splitting is being dealt with too simplistically, I am mean that we have so far been entirely focused on "how to split" and have not really dealt with "what is being split"; we have been focused on the mechanics and not so much the concept.

    So far we have been talking about splitting a delimited string, which is, of course, what we are doing, but a delimited string of what, exactly? It was stated above that:

    • a NULL string has no representation
    • returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string"

    These statements assume that the collection of items being represented as a delimited string is a collection of strings. But this is certainly not always the case, and just because something is in string form now doesn't mean that it truly is a string. Sometimes we pass in a collection of numbers (whether INT or FLOAT or MONEY), sometimes it is a collection of dates (date / time / datetime), sometimes it is a collection of GUIDs, etc. So sure, a NULL string has no representation, but these other types can validly use an empty string to represent being NULL. So it probably should be allowed, via input parameter, to determine whether or not empty elements will return empty string rows or NULL rows. And if this is allowed, then a single element (i.e. no delimiter) that is empty could pass back a single, NULL row which cannot be interpreted as the input was NULL.

    Additionally, it is sometimes desirable to skip empty elements, regardless of the original entity type being a string, number, date, GUID, or whatever. If allowing for this it would then be possible to have a NOT NULL input string that returns zero rows. But introducing this ambiguity into the meaning of no rows returned should not result in returning a single NULL row as it still makes more sense to just check if the input IS NULL or IS NOT NULL.

    In either case, it seems that in order to properly handle the actual input values (i.e. what it was that ended up as the delimited string), there needs to be an option for how to handle empty elements by returning either an empty string, a NULL, or not returning a row for that element.

    At one level I agree with all that.

    Unfortunately, though, what we want is a splitter that caters for SQL's deficiencies without causing us too much trouble. In my view a significant loss of performance would be too much trouble, I don't see and decent way of amending the code to return an empty table when it has a null input string parameter without making it cease to be an inline table valued function, which would be a significant performance loss any time teh function was used. Testing outside the function to see if the first row returned contains NULL is not a significant performance loss (in some contexts it can use a case statement, which would not be possible inside the function, and in other contexts there's no problem with using "if") nor is it difficult to understand if the interface is defined return that - which it is. If ever tables become first class entities in T-SQL, so that their types can be used everywhere a type can be used, we might have to look again; but SQL treats tables as second-class citizens and clearly isn't going to change anytime in the forseeable future. Similarly if we ever had a string representation for a null string within a list of separated strings we would have to look again.

    We are stuck with choosing between two clearly invalid answers to represent a NULL table (because NULL tables are not possible in T-SQL, although they are in the real world). I think the best choice is the one that delivers good performance. The other choice, returning an empty table, is (as you are arguing and as I have said elsewhere) no less valid as things currently stand, except for the performance issue. But we might want to define a different splitter, in which either all strings were required to be terminated by a terminator (so the separator becomes a terminator), and for that the empty table return can't be used for a NULL input, because an empty table would then be a valid result from non-null parameters. Equally we might have the case where all strings require a terminator. If any of that ever happens (perhaps it won't, perhaps it will - perhaps it would even come in the form of a multi-purpose splitter, with a third parameter indicating whether zero length strings require termination or all strings or none - that's a pretty trivial extension for the current code) it would I believe be useful to use the same representation of NULL Table in both splitters - so not only does the single row with NULL offer better performance but it's probably better future-proofed too.

    Tom

  • mark hutchinson (3/30/2015)


    @Jeff

    I didn't say "Null was equal to nothing". I wrote that Null was something that "means nothing".

    Oh yes. I understood that from the git. I still disagree. Null is NOT something that means "nothing", IMHO.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • peter-757102 (3/30/2015)


    The forum ate my post due to a parsing error...grrr...not going to type it all again...just this:

    Set types and scalar types follow different logic and sets are not null-able, nor do they need to be!

    No they aren't nullable in SQL. Yes, they do need to be (but that's far from its biggest problem). Try actually working in a multi-valued logic framework instead of just paying it lip-service round the edges.

    The idea of "null in", "null out" really does not compute when sets are involved.

    The idea that if you don't know what goes in you can't say what comes out works perfectly well for all functions whose results depend on the values of their arguments - including set-valued functions with set-valued arguments.

    A set works with its own logic such as: union, intersect, except and joins.

    Logic tailored to fit its "structure" and "purpose" best.

    It is easier to accept it for what it is and get "null in", "null out" behavior using an empty set!

    Just use "left outer join" or "outer apply" to get the record with a null column in there when you want to.

    I haven't a clue why you imagine that outer join or outer apply has any relevance at all to the case where the value of the set itself is unknown. Of course that's a case that can't occur in SQL because SQL can't handle three-valued logic for anything other than comparisons of atomic values which have system-defined types and hence can't conceive of anything else being allowed not to have a known value.

    Tom

  • Jeff Moden (3/30/2015)


    I understand that many languages and products treat NULLs differently and that we have to jump whatever hoop is held up for us in that area but I'll stand by the definition of NULL not being equal to Nothing. The fact that some languages define otherwise won't change that for me but I'm flexible enough to use whatever hoop is presented. 😛

    I think maybe someone was making the observation that some languages use the reserved word NOTHING to mean pretty much what the reserved word NULL means in SQL. I can remember NILL being used for that too, but can't remember which language[s?] that was in (NILL usually meant "the invalid pointer" in the bad old days when we had pointers in our languages instead of references).

    Tom

  • TomThomson (3/30/2015)


    Jeff Moden (3/30/2015)


    I understand that many languages and products treat NULLs differently and that we have to jump whatever hoop is held up for us in that area but I'll stand by the definition of NULL not being equal to Nothing. The fact that some languages define otherwise won't change that for me but I'm flexible enough to use whatever hoop is presented. 😛

    I think maybe someone was making the observation that some languages use the reserved word NOTHING to mean pretty much what the reserved word NULL means in SQL. I can remember NILL being used for that too, but can't remember which language[s?] that was in (NILL usually meant "the invalid pointer" in the bad old days when we had pointers in our languages instead of references).

    Last one I remember is XML, xsi:nil, few of the others are Ruby, Lisp, Elixir and Objective-C where in the last one it is limited to a class pointer.

    😎

    Edit:typo

  • Viewing 15 posts - 736 through 750 (of 990 total)

    You must be logged in to reply to this topic. Login to reply