Tally OH! An Improved SQL 8K “CSV Splitter” Function

  • Dwain,

    found the cost of the #Temp inserts being the greatest actual cost variant, much stabler results using "bucket" variables, regardless of the function used.

    Did some slight changes to your original test harness including creating a "best case" as a include/index on a calculated column, think that would be close to a realistic target.

    😎

    USE tempdb;

    GO

    SET NOCOUNT ON;

    DECLARE @Delim CHAR(1) = ',';

    DECLARE @INT_BUCKET INT = 0;

    DECLARE @CHR_BUCKET VARCHAR(2000) = '';

    IF OBJECT_ID(N'tempdb..#SampleData') IS NULL

    BEGIN

    CREATE TABLE #SampleData

    (

    s_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED

    ,s VARCHAR(8000) NULL

    ,OPT_FLAG AS (SIGN(ISNULL(CHARINDEX(',',s,1),0))) PERSISTED

    );

    CREATE NONCLUSTERED INDEX NCLIDX_SAMPLE_DATA_CALC_COL ON #SampleData (OPT_FLAG) INCLUDE (s);

    WITH Tally (n) AS

    (

    SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

    FROM sys.all_columns a CROSS JOIN sys.all_columns b CROSS JOIN sys.all_columns ab

    )

    INSERT INTO #SampleData (s)

    SELECT CASE

    -- 70% of the strings contain no delimiter (no split required)

    WHEN rn BETWEEN 11 AND 80

    THEN REPLICATE('abcd', ABS(CHECKSUM(NEWID()))%2000)

    -- 20% of the strings contain a delimiter (split required)

    WHEN rn BETWEEN 81 AND 100

    THEN REPLICATE('abc,', ABS(CHECKSUM(NEWID()))%2000)

    -- Remaining 10% of the strings are NULL

    END

    FROM (SELECT rn=1+ABS(CHECKSUM(NEWID()))%100) a

    CROSS JOIN Tally b;

    END

    RAISERROR('-----------------------

    Don''t split strings that don''t need splitting

    -----------------------',0,0) WITH NOWAIT

    SET STATISTICS TIME ON;

    SELECT @INT_BUCKET = ItemNumber, @CHR_BUCKET = Item

    FROM #SampleData a

    CROSS APPLY

    (

    SELECT Switch=CASE WHEN s IS NULL OR CHARINDEX(@Delim, s) = 0 THEN 0 ELSE 1 END

    ) b

    CROSS APPLY

    (

    SELECT ItemNumber=1, Item=s

    WHERE Switch = 0

    UNION ALL

    SELECT ItemNumber, Item

    FROM dbo.DelimitedSplit8K(a.s, @Delim) c

    WHERE Switch <> 0

    ) c;

    SET STATISTICS TIME OFF;

    RAISERROR('-----------------------

    Run DelimitedSplit8K on all strings if they need splitting or not

    -----------------------',0,0) WITH NOWAIT

    SET STATISTICS TIME ON;

    SELECT @INT_BUCKET = ItemNumber, @CHR_BUCKET = Item

    FROM #SampleData a

    CROSS APPLY dbo.DelimitedSplit8K(a.s, @Delim) b;

    SET STATISTICS TIME OFF;

    RAISERROR('-----------------------

    CTE

    -----------------------',0,0) WITH NOWAIT

    SET STATISTICS TIME ON;

    ;WITH FLAGGED_SET AS

    (

    SELECT

    SD.s_ID

    ,SIGN(ISNULL(CHARINDEX(@Delim,SD.s,1),0)) AS OPT_FLAG

    ,SD.s

    FROM #SampleData SD

    )

    ,NO_SPLITS AS

    (

    SELECT

    FS.s_ID

    ,CONVERT(INT,1,0) AS ItemNumber

    ,FS.s AS Item

    FROM FLAGGED_SET FS

    WHERE FS.OPT_FLAG = 0

    )

    ,SPLITS AS

    (

    SELECT

    FS.s_ID

    ,X.ItemNumber

    ,X.Item

    FROM FLAGGED_SET FS

    CROSS APPLY dbo.DelimitedSplit8K(FS.s,@Delim) AS X

    WHERE FS.OPT_FLAG = 1

    )

    ,CROSS_BREEDS AS

    (

    SELECT

    NS.s_ID

    ,NS.ItemNumber

    ,NS.Item

    FROM NO_SPLITS NS

    UNION ALL

    SELECT

    S.s_ID

    ,S.ItemNumber

    ,S.Item

    FROM SPLITS S

    )

    SELECT

    @INT_BUCKET = CB.s_ID

    ,@INT_BUCKET = CB.ItemNumber

    ,@CHR_BUCKET = CB.Item

    FROM CROSS_BREEDS CB

    SET STATISTICS TIME OFF;

    RAISERROR('-----------------------

    CALC COLUMN

    -----------------------',0,0) WITH NOWAIT

    SET STATISTICS TIME ON;

    ;WITH CALCCOL_SPLIT AS

    (

    SELECT

    SD.s_ID

    ,CONVERT(INT,1,0) AS ItemNumber

    ,SD.s AS Item

    FROM #SampleData SD

    WHERE SD.OPT_FLAG = 0

    UNION ALL

    SELECT

    SD.s_ID

    ,X.ItemNumber

    ,X.Item

    FROM #SampleData SD

    CROSS APPLY dbo.DelimitedSplit8K(SD.s,@Delim) AS X

    WHERE SD.OPT_FLAG = 1

    )

    SELECT

    @INT_BUCKET = CS.s_ID

    ,@INT_BUCKET = CS.ItemNumber

    ,@CHR_BUCKET = CS.Item

    FROM CALCCOL_SPLIT CS

    SET STATISTICS TIME OFF;

    GO

    /*

    DROP TABLE #SampleData;

    */

    -----------------------

    Don't split strings that don't need splitting

    -----------------------

    SQL Server Execution Times:

    CPU time = 15569 ms, elapsed time = 15593 ms.

    -----------------------

    Run DelimitedSplit8K on all strings if they need splitting or not

    -----------------------

    SQL Server Execution Times:

    CPU time = 14835 ms, elapsed time = 15470 ms.

    -----------------------

    CTE

    -----------------------

    SQL Server Execution Times:

    CPU time = 13526 ms, elapsed time = 13519 ms.

    -----------------------

    CALC COLUMN

    -----------------------

    SQL Server Execution Times:

    CPU time = 8829 ms, elapsed time = 8847 ms.

    Server Version

    ------------------------------------------------------------------------------------

    Microsoft SQL Server 2012 - 11.0.5058.0 (X64)

    May 14 2014 18:34:29

    Copyright (c) Microsoft Corporation

    Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)

  • If it will be helpful, I recently wrote an article about cleaning up internal spaces from a string in the Office VBA environment (VB classic). As part of my test, I created strings with random spaces between the words. It was important to be able to create these 'stuffed' strings reliably.

    http://www.experts-exchange.com/A_17559.html

  • Well isn't that spiffy, every week ending date from a single Saturday.

    SELECT CAST(DateAdd("d",Value * 7,'3/28/2014') as Date) as WE_Date FROM TallyTable

  • Solomon Rutzky (9/4/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).

    First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.

    Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.

    Take care,

    Solomon...

    Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

    All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

    Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

    One for the logicians...

  • Gary Harding (3/26/2015)


    Solomon Rutzky (9/4/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).

    First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.

    Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.

    Take care,

    Solomon...

    Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

    All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

    Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

    One for the logicians...

    Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

    An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

    Thus my expectations for edge cases are:

    NULL

    no records

    Empty string

    one record with empty sub-string

    String with only a separator character

    two records with empty sub-string

    This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

  • peter-757102 (3/26/2015)


    Gary Harding (3/26/2015)


    Solomon Rutzky (9/4/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).

    First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.

    Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.

    Take care,

    Solomon...

    Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

    All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

    Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

    One for the logicians...

    Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

    An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

    Thus my expectations for edge cases are:

    NULL

    no records

    Empty string

    one record with empty sub-string

    String with only a separator character

    two records with empty sub-string

    This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

    Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (3/26/2015)


    peter-757102 (3/26/2015)


    Gary Harding (3/26/2015)


    Solomon Rutzky (9/4/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).

    First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.

    Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.

    Take care,

    Solomon...

    Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

    All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

    Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

    One for the logicians...

    Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

    An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

    Thus my expectations for edge cases are:

    NULL

    no records

    Empty string

    one record with empty sub-string

    String with only a separator character

    two records with empty sub-string

    This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

    Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!

    Take care,

    Solomon..

    Actually this is a nice illustration of a problem with SQL: there's no such thing as a NULL table (ie a table whose type we know but we don't know whether it contains any columns).

    Given this omission, when what one really wants to do is return a NULL table it can be quite hard to decide whether to return the single row with a single column whose value is NULL or to return nothing. In some cases it is sufficiently hard that designers decide to throw an error rather than do either. But the splitter isn't a hard case, it's a case where effectively there's a free choice.

    With the splitter we know that any rows in the table (or rowset if you prefer) that it returns can't contain a NULL string, since a NULL string has no representation. So returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string", so it's reasonable to say that this function returns that single row when in an ideal world it would return a NULL table - and when it doen't know what the input string is (because the parameter is NULL) it can't know how many rows the resulting table would be if it did know, so returning no rows would be misleading. So I think returning the single row with a NULL in it is the right thing to do for this particular function.

    Another interesting question is what to do when the input string is empty - return an empty table (no rows), or return a single row containing a zero-length string. The splitter returns the single row in this case, so there is no case where the splitter returns an empty table. That of course means that it could use the 0 rows return to indicate a NULL table instead of an empty table, doing that wouldn't cause any problem. But I don't think that would be the right thing to do.

    Tom

  • Solomon Rutzky (3/26/2015)


    peter-757102 (3/26/2015)


    Gary Harding (3/26/2015)


    Solomon Rutzky (9/4/2014)


    Jeff Moden (9/3/2014)


    I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.

    I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).

    First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.

    Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.

    Take care,

    Solomon...

    Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

    All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

    Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

    One for the logicians...

    Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

    An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

    Thus my expectations for edge cases are:

    NULL

    no records

    Empty string

    one record with empty sub-string

    String with only a separator character

    two records with empty sub-string

    This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

    Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!

    Take care,

    Solomon..

    I've wondered about the logic behind the behavior of returning one row with a NULL Item for a NULL string. Perhaps adding a flag to the FUNCTION call to allow it to alternate behaviors might be the ticket if performance doesn't take too much of a hit.

    Certainly using the short-circuit technique I offered above would allow that degree of control.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (3/26/2015)


    My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.

    I have to agree. There's no logical reason to choose a behaviour other than the current one (as I said before, looking at the requirements and the "science" and the deficiencies of SQL leaves effectively a free choice of behaviour for the NULL parameter case) and there's a very good performance reason not to change that behaviour.

    From my personal point of view, it's a bonus that the current behaviour seems more logical to me wearing my scientst hat than the alternatives do. So my engineer instinct and my science instinct agree (and I hate it when they don't :-)).

    Tom

  • TomThomson (3/27/2015)


    Jeff Moden (3/26/2015)


    My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.

    I have to agree. There's no logical reason to choose a behaviour other than the current one (as I said before, looking at the requirements and the "science" and the deficiencies of SQL leaves effectively a free choice of behaviour for the NULL parameter case) and there's a very good performance reason not to change that behaviour.

    From my personal point of view, it's a bonus that the current behaviour seems more logical to me wearing my scientst hat than the alternatives do. So my engineer instinct and my science instinct agree (and I hate it when they don't :-)).

    I agree 100%. A NULL value still exists in a row. Not only does the function make sense to me with its current behavior, but the performance impact of implementing that parameter simply isn't worth the flexibility it provides. You can always put a WHERE clause on your calling query when you need it.

  • TomThomson (3/26/2015)


    Solomon Rutzky (3/26/2015)


    peter-757102 (3/26/2015)


    Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

    An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

    Thus my expectations for edge cases are:

    NULL

    no records

    Empty string

    one record with empty sub-string

    String with only a separator character

    two records with empty sub-string

    This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

    Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!

    Take care,

    Solomon..

    Actually this is a nice illustration of a problem with SQL: there's no such thing as a NULL table (ie a table whose type we know but we don't know whether it contains any columns).

    Hmm. I think you meant to say "rows" at the end there instead of "columns". But if you did mean "columns", then how are you defining "type"?

    With the splitter we know that any rows in the table (or rowset if you prefer) that it returns can't contain a NULL string, since a NULL string has no representation.

    Agreed, but requires further qualification (more on that in the last section).

    So returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string", so it's reasonable to say that this function returns that single row when in an ideal world it would return a NULL table

    This concept introduces the problem of requiring interpretation of the row in the first place. Isn't that a flawed design? You get one row back and it might be valid and it might not be? Easy enough to distinguish, sure, but seems odd to require it.

    And we are talking about sets here, not scalar values. Any row returned represents a member of that set. But there is no set. If simply doing a COUNT(*) of the results you wouldn't be able to trust a result of 1 without adding a WHERE clause for Value IS NOT NULL. And this isn't a data warehouse where we might have a place-holder record to represent "Unknown" so as to avoid LEFT JOINs.

    The operation of this function is to convert a given string into a set of strings. If no string is provided, no set can be returned (or presented differently, an empty set is returned). But NULL could mean that there is something to convert? Ok, so then shouldn't the following return a row of NULLs for each column:

    DECLARE @ObjectID INT;

    SELECT * FROM sys.objects WHERE [object_id] = @ObjectID;

    So no, not reasonable.

    and when it doesn't know what the input string is (because the parameter is NULL) it can't know how many rows the resulting table would be if it did know, so returning no rows would be misleading. So I think returning the single row with a NULL in it is the right thing to do for this particular function.

    So then why not return 2 rows of NULL? Saying that a NULL input isn't specifically "nothing" because it really means "unknown" is not the right way to work with NULLs. Just look at how NULLs are handled in IF statements.

    DECLARE @ObjectID INT;

    IF (@ObjectID = 1)

    BEGIN

    PRINT 'True!';

    END;

    ELSE

    BEGIN

    PRINT 'False?';

    END;

    -- False?

    NULL doesn't really equate to False, although it does follow that path. So the IF question isn't really simplistically "IF True" as the NULL could be True. The question is really "IF known-to-be-True", and a NULL, being "unknown", cannot match that. Along those same lines, returning any rows at all requires passing the "IF known-to-exist" test, not "IF could possibly exist". So both "IF known to not exist" and "IF not known to exist" should return no rows. Certainty is required for proceeding to the "THEN" branch of an "IF" and for returning rows from a set-based operation. Hence, returning a row on NULL input (at least for this operation) is the misleading path.

    Additionally, regarding your statement that a single row of NULL "can't be misinterpreted as meaning `there's one row containing a null string` ", I would say that:

    • this isn't necessarily true (or more accurately, this doesn't need to be the case; explained in the next section)
    • that no rows returned cannot be misinterpreted considering that the only possible means of returning no rows should be by having a NULL input. And that provides a nice segue to......

    Another interesting question is what to do when the input string is empty - return an empty table (no rows), or return a single row containing a zero-length string. The splitter returns the single row in this case, so there is no case where the splitter returns an empty table. That of course means that it could use the 0 rows return to indicate a NULL table instead of an empty table, doing that wouldn't cause any problem. But I don't think that would be the right thing to do.

    What to do with an empty, NOT NULL string shouldn't be a question of returning 0 rows or 1 row. An empty string is a valid string. There is no way to determine that an empty string is no value at all (as opposed to a single empty value) but a string with just the delimiter and nothing more is two empty values. So, in the current conception and discussion, the only way to get 0 rows returned is by having a NULL input. But, this is due to dealing with the issue of splitting in a rather simplistic fashion.

    When I say that the problem of splitting is being dealt with too simplistically, I am mean that we have so far been entirely focused on "how to split" and have not really dealt with "what is being split"; we have been focused on the mechanics and not so much the concept.

    So far we have been talking about splitting a delimited string, which is, of course, what we are doing, but a delimited string of what, exactly? It was stated above that:

    • a NULL string has no representation
    • returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string"

    These statements assume that the collection of items being represented as a delimited string is a collection of strings. But this is certainly not always the case, and just because something is in string form now doesn't mean that it truly is a string. Sometimes we pass in a collection of numbers (whether INT or FLOAT or MONEY), sometimes it is a collection of dates (date / time / datetime), sometimes it is a collection of GUIDs, etc. So sure, a NULL string has no representation, but these other types can validly use an empty string to represent being NULL. So it probably should be allowed, via input parameter, to determine whether or not empty elements will return empty string rows or NULL rows. And if this is allowed, then a single element (i.e. no delimiter) that is empty could pass back a single, NULL row which cannot be interpreted as the input was NULL.

    Additionally, it is sometimes desirable to skip empty elements, regardless of the original entity type being a string, number, date, GUID, or whatever. If allowing for this it would then be possible to have a NOT NULL input string that returns zero rows. But introducing this ambiguity into the meaning of no rows returned should not result in returning a single NULL row as it still makes more sense to just check if the input IS NULL or IS NOT NULL.

    In either case, it seems that in order to properly handle the actual input values (i.e. what it was that ended up as the delimited string), there needs to be an option for how to handle empty elements by returning either an empty string, a NULL, or not returning a row for that element.

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Heh... to quote someone far wise than I, "NULL is not nothing".

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Null is a something that means nothing

  • Very nice and well explained article, I enjoyed reading it and just wanted to add my findings which include handling the case where the string to be parsed is blank. In this instance I would expect to get no rows back but the solution in this article seems to return a row with a blank string.

    Here's the version I've been using for a while now and it seems to match performance with yours with the included handling of a blank source string.

    I also use the values method of a row generator/tally table as I don't need to support prior to 2008 and it makes it more readable IMO.

    ;with e(n) as ( select 0 from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) T(n))

    ,rg(n) as (select 0 from e,e e2,e e3,e e4)

    select Item=substring(@pString,n,isnull(nullif(CHARINDEX(@pDelimiter,@pString,n),0),len(@pString)+1)-n)

    from (

    select n=1 where len(@pString) > 0

    union all

    select n+1 from (select top(len(@pString)) n=ROW_NUMBER() over (order by (select null)) from rg) T where SUBSTRING(@pString,n,1) = @pDelimiter

    ) StartPositions

Viewing 15 posts - 721 through 735 (of 990 total)

You must be logged in to reply to this topic. Login to reply