Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 721 through 735 (of 990 total)

You must be logged in to reply to this topic. Login to reply

Eirikur Eiriksson SSC Guru Points: 182882 More actions · Answer 1

Dwain,

found the cost of the #Temp inserts being the greatest actual cost variant, much stabler results using "bucket" variables, regardless of the function used.

Did some slight changes to your original test harness including creating a "best case" as a include/index on a calculated column, think that would be close to a realistic target.

😎

USE tempdb;

GO

SET NOCOUNT ON;

DECLARE @Delim CHAR(1) = ',';

DECLARE @INT_BUCKET INT = 0;

DECLARE @CHR_BUCKET VARCHAR(2000) = '';

IF OBJECT_ID(N'tempdb..#SampleData') IS NULL

BEGIN

CREATE TABLE #SampleData

(

s_ID INT IDENTITY(1,1) NOT NULL PRIMARY KEY CLUSTERED

,s VARCHAR(8000) NULL

,OPT_FLAG AS (SIGN(ISNULL(CHARINDEX(',',s,1),0))) PERSISTED

);

CREATE NONCLUSTERED INDEX NCLIDX_SAMPLE_DATA_CALC_COL ON #SampleData (OPT_FLAG) INCLUDE (s);

WITH Tally (n) AS

(

SELECT TOP (10000) ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b CROSS JOIN sys.all_columns ab

)

INSERT INTO #SampleData (s)

SELECT CASE

-- 70% of the strings contain no delimiter (no split required)

WHEN rn BETWEEN 11 AND 80

THEN REPLICATE('abcd', ABS(CHECKSUM(NEWID()))%2000)

-- 20% of the strings contain a delimiter (split required)

WHEN rn BETWEEN 81 AND 100

THEN REPLICATE('abc,', ABS(CHECKSUM(NEWID()))%2000)

-- Remaining 10% of the strings are NULL

END

FROM (SELECT rn=1+ABS(CHECKSUM(NEWID()))%100) a

CROSS JOIN Tally b;

END

RAISERROR('-----------------------

Don''t split strings that don''t need splitting

-----------------------',0,0) WITH NOWAIT

SET STATISTICS TIME ON;

SELECT @INT_BUCKET = ItemNumber, @CHR_BUCKET = Item

FROM #SampleData a

CROSS APPLY

(

SELECT Switch=CASE WHEN s IS NULL OR CHARINDEX(@Delim, s) = 0 THEN 0 ELSE 1 END

) b

CROSS APPLY

(

SELECT ItemNumber=1, Item=s

WHERE Switch = 0

UNION ALL

SELECT ItemNumber, Item

FROM dbo.DelimitedSplit8K(a.s, @Delim) c

WHERE Switch <> 0

) c;

SET STATISTICS TIME OFF;

RAISERROR('-----------------------

Run DelimitedSplit8K on all strings if they need splitting or not

-----------------------',0,0) WITH NOWAIT

SET STATISTICS TIME ON;

SELECT @INT_BUCKET = ItemNumber, @CHR_BUCKET = Item

FROM #SampleData a

CROSS APPLY dbo.DelimitedSplit8K(a.s, @Delim) b;

SET STATISTICS TIME OFF;

RAISERROR('-----------------------

CTE

-----------------------',0,0) WITH NOWAIT

SET STATISTICS TIME ON;

;WITH FLAGGED_SET AS

(

SELECT

SD.s_ID

,SIGN(ISNULL(CHARINDEX(@Delim,SD.s,1),0)) AS OPT_FLAG

,SD.s

FROM #SampleData SD

)

,NO_SPLITS AS

(

SELECT

FS.s_ID

,CONVERT(INT,1,0) AS ItemNumber

,FS.s AS Item

FROM FLAGGED_SET FS

WHERE FS.OPT_FLAG = 0

)

,SPLITS AS

(

SELECT

FS.s_ID

,X.ItemNumber

,X.Item

FROM FLAGGED_SET FS

CROSS APPLY dbo.DelimitedSplit8K(FS.s,@Delim) AS X

WHERE FS.OPT_FLAG = 1

)

,CROSS_BREEDS AS

(

SELECT

NS.s_ID

,NS.ItemNumber

,NS.Item

FROM NO_SPLITS NS

UNION ALL

SELECT

S.s_ID

,S.ItemNumber

,S.Item

FROM SPLITS S

)

SELECT

@INT_BUCKET = CB.s_ID

,@INT_BUCKET = CB.ItemNumber

,@CHR_BUCKET = CB.Item

FROM CROSS_BREEDS CB

SET STATISTICS TIME OFF;

RAISERROR('-----------------------

CALC COLUMN

-----------------------',0,0) WITH NOWAIT

SET STATISTICS TIME ON;

;WITH CALCCOL_SPLIT AS

(

SELECT

SD.s_ID

,CONVERT(INT,1,0) AS ItemNumber

,SD.s AS Item

FROM #SampleData SD

WHERE SD.OPT_FLAG = 0

UNION ALL

SELECT

SD.s_ID

,X.ItemNumber

,X.Item

FROM #SampleData SD

CROSS APPLY dbo.DelimitedSplit8K(SD.s,@Delim) AS X

WHERE SD.OPT_FLAG = 1

)

SELECT

@INT_BUCKET = CS.s_ID

,@INT_BUCKET = CS.ItemNumber

,@CHR_BUCKET = CS.Item

FROM CALCCOL_SPLIT CS

SET STATISTICS TIME OFF;

GO

/*

DROP TABLE #SampleData;

*/

-----------------------

Don't split strings that don't need splitting

-----------------------

SQL Server Execution Times:

CPU time = 15569 ms, elapsed time = 15593 ms.

-----------------------

Run DelimitedSplit8K on all strings if they need splitting or not

-----------------------

SQL Server Execution Times:

CPU time = 14835 ms, elapsed time = 15470 ms.

-----------------------

CTE

-----------------------

SQL Server Execution Times:

CPU time = 13526 ms, elapsed time = 13519 ms.

-----------------------

CALC COLUMN

-----------------------

SQL Server Execution Times:

CPU time = 8829 ms, elapsed time = 8847 ms.

Server Version

------------------------------------------------------------------------------------

Microsoft SQL Server 2012 - 11.0.5058.0 (X64)

May 14 2014 18:34:29

Copyright (c) Microsoft Corporation

Enterprise Edition (64-bit) on Windows NT 6.1 <X64> (Build 7601: Service Pack 1)

mark hutchinson SSC Eights! Points: 878 More actions · Answer 2

If it will be helpful, I recently wrote an article about cleaning up internal spaces from a string in the Office VBA environment (VB classic). As part of my test, I created strings with random spaces between the words. It was important to be able to create these 'stuffed' strings reliably.

http://www.experts-exchange.com/A_17559.html

david.c.holley SSCrazy Points: 2423 More actions · Answer 3

Well isn't that spiffy, every week ending date from a single Saturday.

SELECT CAST(DateAdd("d",Value * 7,'3/28/2014') as Date) as WE_Date FROM TallyTable

Sporidium SSCommitted Points: 1550 More actions · Answer 4

Solomon Rutzky (9/4/2014)
Jeff Moden (9/3/2014)
I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.
I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).
First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.
Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.
Take care,
Solomon...

Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".

All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.

Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?

One for the logicians...

peter-757102 SSCertifiable Points: 6877 More actions · Answer 5

Gary Harding (3/26/2015)
Solomon Rutzky (9/4/2014)
Jeff Moden (9/3/2014)
I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.
I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).
First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.
Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.
Take care,
Solomon...
Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".
All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.
Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?
One for the logicians...

Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.

An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.

Thus my expectations for edge cases are:

NULL

no records

Empty string

one record with empty sub-string

String with only a separator character

two records with empty sub-string

This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 6

peter-757102 (3/26/2015)
Gary Harding (3/26/2015)
Solomon Rutzky (9/4/2014)
Jeff Moden (9/3/2014)
I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.
I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).
First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.
Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.
Take care,
Solomon...
Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".
All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.
Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?
One for the logicians...
Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.
An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.
Thus my expectations for edge cases are:
NULL
no records
Empty string
one record with empty sub-string
String with only a separator character
two records with empty sub-string
This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.

Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!

Take care,

Solomon..

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

TomThomson SSC Guru Points: 104773 More actions · Answer 7

Solomon Rutzky (3/26/2015)
peter-757102 (3/26/2015)
Gary Harding (3/26/2015)
Solomon Rutzky (9/4/2014)
Jeff Moden (9/3/2014)
I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.
I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).
First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.
Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.
Take care,
Solomon...
Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".
All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.
Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?
One for the logicians...
Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.
An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.
Thus my expectations for edge cases are:
NULL
no records
Empty string
one record with empty sub-string
String with only a separator character
two records with empty sub-string
This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.
Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!
Take care,
Solomon..

Actually this is a nice illustration of a problem with SQL: there's no such thing as a NULL table (ie a table whose type we know but we don't know whether it contains any columns).

Given this omission, when what one really wants to do is return a NULL table it can be quite hard to decide whether to return the single row with a single column whose value is NULL or to return nothing. In some cases it is sufficiently hard that designers decide to throw an error rather than do either. But the splitter isn't a hard case, it's a case where effectively there's a free choice.

With the splitter we know that any rows in the table (or rowset if you prefer) that it returns can't contain a NULL string, since a NULL string has no representation. So returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string", so it's reasonable to say that this function returns that single row when in an ideal world it would return a NULL table - and when it doen't know what the input string is (because the parameter is NULL) it can't know how many rows the resulting table would be if it did know, so returning no rows would be misleading. So I think returning the single row with a NULL in it is the right thing to do for this particular function.

Another interesting question is what to do when the input string is empty - return an empty table (no rows), or return a single row containing a zero-length string. The splitter returns the single row in this case, so there is no case where the splitter returns an empty table. That of course means that it could use the 0 rows return to indicate a NULL table instead of an empty table, doing that wouldn't cause any problem. But I don't think that would be the right thing to do.

Tom

Dwain Camps SSC Guru Points: 86908 More actions · Answer 8

Solomon Rutzky (3/26/2015)
peter-757102 (3/26/2015)
Gary Harding (3/26/2015)
Solomon Rutzky (9/4/2014)
Jeff Moden (9/3/2014)
I guess I'd have to disagree. Nulls mean "Unknown" and it is definitely known that there are no values on either side of the space delimiter so, IMHO, empty strings are the proper thing to return. I also respect disagreement on such a controversial subject so if you want NULLs, just wrap the ITEM being returned from the function in an ISNULL.
I guess I'd have to agree with Jeff's disagreement, but probably disagree with any need to respect disagreement as it shouldn't be a controversial subject ;-).
First, if [space] is the delimiter and it just happens to be the only character in the string to be split, that is no different than splitting a string of ',' with a delimiter of ','; it is a set of two empty strings. A single empty string would not have any delimiter present in the string to split. And a NULL string to split would return a NULL.
Second, this shouldn't be controversial since a NULL concatenated into a non-NULL string will (well, should) result in a NULL. This behavior can be changed via SET CONCAT_NULL_YIELDS_NULL OFF, but the MSDN documentation even says that in a future version it will always be ON. Basically, NULL does not have a string representation since an empty string is a valid string with a length of 0.
Take care,
Solomon...
Sorry to embed a six-month-old quote, but the bit that caught my eye was "a NULL string to split would return a NULL".
All the variants I've seen of Jeff's DelimitedSplit8K function return a single row, value NULL, if the string parameter passed in is NULL.
Is this strictly the correct behaviour, or should the function return no rows in this scenario, i.e. an empty table?
One for the logicians...
Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.
An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.
Thus my expectations for edge cases are:
NULL
no records
Empty string
one record with empty sub-string
String with only a separator character
two records with empty sub-string
This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.
Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!
Take care,
Solomon..

I've wondered about the logic behind the behavior of returning one row with a NULL Item for a NULL string. Perhaps adding a flag to the FUNCTION call to allow it to alternate behaviors might be the ticket if performance doesn't take too much of a hit.

Certainly using the short-circuit technique I offered above would allow that degree of control.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 9

My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TomThomson SSC Guru Points: 104773 More actions · Answer 10

Jeff Moden (3/26/2015)
My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.

I have to agree. There's no logical reason to choose a behaviour other than the current one (as I said before, looking at the requirements and the "science" and the deficiencies of SQL leaves effectively a free choice of behaviour for the NULL parameter case) and there's a very good performance reason not to change that behaviour.

From my personal point of view, it's a bonus that the current behaviour seems more logical to me wearing my scientst hat than the alternatives do. So my engineer instinct and my science instinct agree (and I hate it when they don't :-)).

Tom

Ed Wagner SSC Guru Points: 287024 More actions · Answer 11

TomThomson (3/27/2015)
Jeff Moden (3/26/2015)
My recommendation would be to do those kinds of checks externally rather than adding them to the function. Changes for convenience will slow things down when it's not convenient. It's built to do just one thing as fast as possible.
I have to agree. There's no logical reason to choose a behaviour other than the current one (as I said before, looking at the requirements and the "science" and the deficiencies of SQL leaves effectively a free choice of behaviour for the NULL parameter case) and there's a very good performance reason not to change that behaviour.
From my personal point of view, it's a bonus that the current behaviour seems more logical to me wearing my scientst hat than the alternatives do. So my engineer instinct and my science instinct agree (and I hate it when they don't :-)).

I agree 100%. A NULL value still exists in a row. Not only does the function make sense to me with its current behavior, but the performance impact of implementing that parameter simply isn't worth the flexibility it provides. You can always put a WHERE clause on your calling query when you need it.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 12

TomThomson (3/26/2015)
Solomon Rutzky (3/26/2015)
peter-757102 (3/26/2015)
Because a row functionally constitutes an identified sub-string in compliance with the declared separator, I would expect no rows at all with NULL as input.
An empty set is in my view by far the best set representation of no sub-strings found and I think other behavior can and will complicate higher-level logic that depends on the results.
Thus my expectations for edge cases are:
NULL
no records
Empty string
one record with empty sub-string
String with only a separator character
two records with empty sub-string
This makes sense when you consider any not null input has always at least one sub-string, regardless of the declared separator.
Peter and Gary: I actually have no clue as to why I said that a NULL input string should return a NULL row. I agree that a NULL input should not return any row as nothing exists to split. Most likely I was just thinking in terms of a scalar function at that moment in time. I'm gonna go with that ;-). I will also update that prior post to correct that statement. Thanks for pointing it out!
Take care,
Solomon..
Actually this is a nice illustration of a problem with SQL: there's no such thing as a NULL table (ie a table whose type we know but we don't know whether it contains any columns).

Hmm. I think you meant to say "rows" at the end there instead of "columns". But if you did mean "columns", then how are you defining "type"?

With the splitter we know that any rows in the table (or rowset if you prefer) that it returns can't contain a NULL string, since a NULL string has no representation.

Agreed, but requires further qualification (more on that in the last section).

So returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string", so it's reasonable to say that this function returns that single row when in an ideal world it would return a NULL table

This concept introduces the problem of requiring interpretation of the row in the first place. Isn't that a flawed design? You get one row back and it might be valid and it might not be? Easy enough to distinguish, sure, but seems odd to require it.

And we are talking about sets here, not scalar values. Any row returned represents a member of that set. But there is no set. If simply doing a COUNT(*) of the results you wouldn't be able to trust a result of 1 without adding a WHERE clause for Value IS NOT NULL. And this isn't a data warehouse where we might have a place-holder record to represent "Unknown" so as to avoid LEFT JOINs.

The operation of this function is to convert a given string into a set of strings. If no string is provided, no set can be returned (or presented differently, an empty set is returned). But NULL could mean that there is something to convert? Ok, so then shouldn't the following return a row of NULLs for each column:

DECLARE @ObjectID INT;

SELECT * FROM sys.objects WHERE [object_id] = @ObjectID;

So no, not reasonable.

and when it doesn't know what the input string is (because the parameter is NULL) it can't know how many rows the resulting table would be if it did know, so returning no rows would be misleading. So I think returning the single row with a NULL in it is the right thing to do for this particular function.

So then why not return 2 rows of NULL? Saying that a NULL input isn't specifically "nothing" because it really means "unknown" is not the right way to work with NULLs. Just look at how NULLs are handled in IF statements.

DECLARE @ObjectID INT;

IF (@ObjectID = 1)

BEGIN

PRINT 'True!';

END;

ELSE

BEGIN

PRINT 'False?';

END;

-- False?

NULL doesn't really equate to False, although it does follow that path. So the IF question isn't really simplistically "IF True" as the NULL could be True. The question is really "IF known-to-be-True", and a NULL, being "unknown", cannot match that. Along those same lines, returning any rows at all requires passing the "IF known-to-exist" test, not "IF could possibly exist". So both "IF known to not exist" and "IF not known to exist" should return no rows. Certainty is required for proceeding to the "THEN" branch of an "IF" and for returning rows from a set-based operation. Hence, returning a row on NULL input (at least for this operation) is the misleading path.

Additionally, regarding your statement that a single row of NULL "can't be misinterpreted as meaning `there's one row containing a null string` ", I would say that:

this isn't necessarily true (or more accurately, this doesn't need to be the case; explained in the next section)
that no rows returned cannot be misinterpreted considering that the only possible means of returning no rows should be by having a NULL input. And that provides a nice segue to......

Another interesting question is what to do when the input string is empty - return an empty table (no rows), or return a single row containing a zero-length string. The splitter returns the single row in this case, so there is no case where the splitter returns an empty table. That of course means that it could use the 0 rows return to indicate a NULL table instead of an empty table, doing that wouldn't cause any problem. But I don't think that would be the right thing to do.

What to do with an empty, NOT NULL string shouldn't be a question of returning 0 rows or 1 row. An empty string is a valid string. There is no way to determine that an empty string is no value at all (as opposed to a single empty value) but a string with just the delimiter and nothing more is two empty values. So, in the current conception and discussion, the only way to get 0 rows returned is by having a NULL input. But, this is due to dealing with the issue of splitting in a rather simplistic fashion.

When I say that the problem of splitting is being dealt with too simplistically, I am mean that we have so far been entirely focused on "how to split" and have not really dealt with "what is being split"; we have been focused on the mechanics and not so much the concept.

So far we have been talking about splitting a delimited string, which is, of course, what we are doing, but a delimited string of what, exactly? It was stated above that:

a NULL string has no representation
returning a single row containing NULL can't be misinterpreted as meaning "there's one row containing a null string"

These statements assume that the collection of items being represented as a delimited string is a collection of strings. But this is certainly not always the case, and just because something is in string form now doesn't mean that it truly is a string. Sometimes we pass in a collection of numbers (whether INT or FLOAT or MONEY), sometimes it is a collection of dates (date / time / datetime), sometimes it is a collection of GUIDs, etc. So sure, a NULL string has no representation, but these other types can validly use an empty string to represent being NULL. So it probably should be allowed, via input parameter, to determine whether or not empty elements will return empty string rows or NULL rows. And if this is allowed, then a single element (i.e. no delimiter) that is empty could pass back a single, NULL row which cannot be interpreted as the input was NULL.

Additionally, it is sometimes desirable to skip empty elements, regardless of the original entity type being a string, number, date, GUID, or whatever. If allowing for this it would then be possible to have a NOT NULL input string that returns zero rows. But introducing this ambiguity into the meaning of no rows returned should not result in returning a single NULL row as it still makes more sense to just check if the input IS NULL or IS NOT NULL.

In either case, it seems that in order to properly handle the actual input values (i.e. what it was that ended up as the delimited string), there needs to be an option for how to handle empty elements by returning either an empty string, a NULL, or not returning a row for that element.

Take care,

Solomon..

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Jeff Moden SSC Guru Points: 1003865 More actions · Answer 13

Heh... to quote someone far wise than I, "NULL is not nothing".

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mark hutchinson SSC Eights! Points: 878 More actions · Answer 14

mark hutchinson

SSC Eights!

Points: 878

March 29, 2015 at 3:32 pm

#1787188

Null is a something that means nothing

mpescod Grasshopper Points: 22 More actions · Answer 15

Very nice and well explained article, I enjoyed reading it and just wanted to add my findings which include handling the case where the string to be parsed is blank. In this instance I would expect to get no rows back but the solution in this article seems to return a row with a blank string.

Here's the version I've been using for a while now and it seems to match performance with yours with the included handling of a blank source string.

I also use the values method of a row generator/tally table as I don't need to support prior to 2008 and it makes it more readable IMO.

;with e(n) as ( select 0 from (values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) T(n))

,rg(n) as (select 0 from e,e e2,e e3,e e4)

select Item=substring(@pString,n,isnull(nullif(CHARINDEX(@pDelimiter,@pString,n),0),len(@pString)+1)-n)

from (

select n=1 where len(@pString) > 0

union all

select n+1 from (select top(len(@pString)) n=ROW_NUMBER() over (order by (select null)) from rg) T where SUBSTRING(@pString,n,1) = @pDelimiter

) StartPositions