Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 811 through 825 (of 991 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 1

peter-757102 (3/12/2016)
When treating NULL as an "empty string" instead of "no string" and thus return rows for it, it is going to complicate and limits its use for serialization/de-serialization tasks (which is like its primary use).

I think that's the whole problem with the way a whole lot of people think. NULL should never be treated as an "empty string" or "no string". Certainly, it should never be treated as Nothing. It should be treated only as "unknown" and, IMHO, if you give the function an unknown, it should return a confirmation that unknown was passed to it. This doesn't even violate the COUNT(*) problem that someone posted on the CONNECT because the NULL will be disregarded by COUNT(*).

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peter-757102 SSCertifiable Points: 6877 More actions · Answer 2

Jeff Moden (3/12/2016)
peter-757102 (3/12/2016)
When treating NULL as an "empty string" instead of "no string" and thus return rows for it, it is going to complicate and limits its use for serialization/de-serialization tasks (which is like its primary use).
I think that's the whole problem with the way a whole lot of people think. NULL should ever be treated as an "empty string" or "no string". Certainly, it should never be treated as Nothing. It should be treated only as "unknown" and, IMHO, if you give the function an unknown, it should return a confirmation that unknown was passed to it. This doesn't even violate the COUNT(*) problem that someone posted on the CONNECT because the NULL will be disregarded by COUNT(*).

It comes down what states of input there are and how can they be unique mapped to process-able output.

With scalar functions it is easy: NULL in is NULL out.

With the transformation between scalar and complex types things are well, more complex.

De-serialization tasks need to work exactly in reverse to the serialization, which is:

Empty array/set/list --> NULL

One empty string or ID --> ""

Two empty strings or ID ---> ","

Array with one ID 10 ---> "10"

Array with two ID 10 and 20 ---> "10,20"

Array with unknown ID and ID 20 ---> ",20"

Try to reverse this serialization and there is no way around it that NULL means empty set (else it would be indistinguishable against a string of length 0 or a nullable integer represented as such).

This mapping works well and supports every behavior one can desire.

Want to handle an empty set as a single row containing NULL?

We can do that at the price of loosing support for NULL values in the serialized array/set/list!

Use a "left join" or "outer apply" and that will result in a single row with NULL values for each column in the result.

Code something like this:

select

S.value

from

TableA as A

outer apply splitter( NULL, ',' ) as S

;

Ultimately it comes down to how broad the use of the function needs to be.

Personally I use splitters almost exclusively for de-serialization of serialized integer ID and other value types that come from an application layer.

Ordinal positions and support for nullable values in the serialization are highly desirable in this case.

Having to filter out records with ordinal position NULL would work too I think, but would more often be in the way then help.

Such a ** conditionally existing ** record would just be artificial and represent a state that is already supported naturally as an empty set.

Where we seem to disagree is how we view the result of our function and what it represents.

I view it as a set representation of the original array/set/list, where NULL means no values were serialized.

Scalar function behavior does not come into my reasoning, the output is a set.

And I view serializing nullable values as a perfectly valid use case.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 3

You have to trust me when I say that I already know the OUTER APPLY work around so, ultimately, it doesn't matter which way they go on the NULL thing but, to be sure, NULL will never mean "empty string" or "no string" to me. NULL is not NOTHING. NULL means one and only one thing... UNKNOWN. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peter-757102 SSCertifiable Points: 6877 More actions · Answer 4

Jeff Moden (3/12/2016)
You have to trust me when I say that I already know the OUTER APPLY work around so, ultimately, it doesn't matter which way they go on the NULL thing but, to be sure, NULL will never mean "empty string" or "no string" to me. NULL is not NOTHING. NULL means one and only one thing... UNKNOWN. 😉

I know you know, etc...

As for NULL only meaning unknown.

This is a valid viewpoint approached purely from an RDBMS perspective.

The world does not end there however :).

Applications routinely give more meaning to the NULL state, for example:

1. Not filled in yet.

2. Not filled in yet, supply own default.

3. Not applicable / forever unknown.

4. Same as empty string (for loosely/duck typed languages that work primarily with strings, this is common and works fine as input is usually in the form of text anyway)

This all does not matter to the relational logic and from that perspective every NULL can safely be treated as UNKNOWN, even if it means something slightly different one tier higher.

TomThomson SSC Guru Points: 104768 More actions · Answer 5

Jeff Moden (3/12/2016)
peter-757102 (3/12/2016)
When treating NULL as an "empty string" instead of "no string" and thus return rows for it, it is going to complicate and limits its use for serialization/de-serialization tasks (which is like its primary use).
I think that's the whole problem with the way a whole lot of people think. NULL should ever be treated as an "empty string" or "no string". Certainly, it should never be treated as Nothing. It should be treated only as "unknown" and, IMHO, if you give the function an unknown, it should return a confirmation that unknown was passed to it. This doesn't even violate the COUNT(*) problem that someone posted on the CONNECT because the NULL will be disregarded by COUNT(*).

Tht's a bit careless, Jeff. No, count(*) doesn't disregard NULL.

select count(*) from db.|delimitedSplit8k(NULL,',') returns 1, not 0.

Since we can't get an iTVF to return NULL when it has a NULL parameter, since the single statement rule effectively prevents that and even if it didn't there's no such thing as a table-type NULL, and since it's illogical nonsense to return a defined rowset instead of a NULL rowset (if we aren't given the input string, we can't determine the output rowset, so we should return a NULL rowset, not a wel defined rowset - whether the empty rowset or one containing a single record containing a NULL string), there is no clean way of resolving this without introducing NULL table types into SQL.

I don't know why peter-757102 is arguing that the illogical substitution of an empty set for NULL is better that returning a single record containing a NULL string, which is equally illogical and can be interpreted as indicating a NULL input parameter just as easily as could an empty rowset. Claiming that one is relationally more correct or less illogical than the other is just nonsense, and if he gets his way it will probably mean that many people will come to believe that using an empty set to represent not knowing whether there is any data or not is rational simply because that's the argument being presented for picking the empty set. Arguing to keep it as we have had it for the last few years - for backwards compatibility and ease of adoption is a very different sort of argument, in two ways: it's a valid argument, and it's not misleading.

Tom

TomThomson SSC Guru Points: 104768 More actions · Answer 6

peter-757102 (3/12/2016)
.....
De-serialization tasks need to work exactly in reverse to the serialization, which is:
Empty array/set/list --> NULL
One empty string or ID --> ""
Two empty strings or ID ---> ","
Array with one ID 10 ---> "10"
Array with two ID 10 and 20 ---> "10,20"
Array with unknown ID and ID 20 ---> ",20"

Anyone who use that serialisation is evidently crazy.

The second and third line clearly indicate that you have to encode empty strings, and encode them as empty strings.

The sixth line indicates that you have to encode unknown values (presumably NULLs), and encode them as empty strings.

So you presumably have a great volume of encoding that you haven't told us about, which is clearly not context free (if it's context free the 3rd and 6th lines can't both be correct) so is probably going to have pretty abysmal performance in both encoding and decoding.

And then you decide to use NULL to encode something where you know exactly hat you have, rather than using NULL for the case where you don't know something. That might be rather embarrassing sometimes unless you are clever enough never to lose anything.

Try to reverse this serialization and there is no way around it that NULL means empty set (else it would be indistinguishable against a string of length 0 or a nullable integer represented as such).

It's not at all surprising that if you misuse NULL to indicte somethig on which you have full interprettion (the first line of your encoding above) you will need to treat it as meaning that.

This mapping works well and supports every behavior one can desire.

That is clearly not true. You are representing both zero length strings and NULLs. Using what you have shown us, it is impossible (without a lot more information being encoded by means you haven't bothered to mention) to represent a set of just two values, an empty string and a null, in either order.

Want to handle an empty set as a single row containing NULL?
We can do that at the price of loosing support for NULL values in the serialized array/set/list!

Depends whether the NULL goes in the string column or in the position column of the output - if you use the first position column (which presumably always contains 1 if the input isn't NULL) you can handle nulls as long as you have some means of encoding them; and if you don't have any means of encoding them (your encoding instructions above don't really indicate that you actually do have any such means) then you have nothing to lose anyway.

Tom

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 7

TomThomson (3/12/2016)
Jeff Moden (3/12/2016)
peter-757102 (3/12/2016)
When treating NULL as an "empty string" instead of "no string" and thus return rows for it, it is going to complicate and limits its use for serialization/de-serialization tasks (which is like its primary use).
I think that's the whole problem with the way a whole lot of people think. NULL should ever be treated as an "empty string" or "no string". Certainly, it should never be treated as Nothing. It should be treated only as "unknown" and, IMHO, if you give the function an unknown, it should return a confirmation that unknown was passed to it. This doesn't even violate the COUNT(*) problem that someone posted on the CONNECT because the NULL will be disregarded by COUNT(*).
Tht's a bit careless, Jeff. No, count(*) doesn't disregard NULL.
select count(*) from db.|delimitedSplit8k(NULL,',') returns 1, not 0.

Ah, crud. Agreed, careless on my part. I did mean select count(Item) from dbo.delimitedSplit8k(NULL,',') as the workaround and didn't finish it. Thanks for the correction, Tom.

Since we can't get an iTVF to return NULL when it has a NULL parameter, since the single statement rule effectively prevents that and even if it didn't there's no such thing as a table-type NULL, and since it's illogical nonsense to return a defined rowset instead of a NULL rowset (if we aren't given the input string, we can't determine the output rowset, so we should return a NULL rowset, not a wel defined rowset - whether the empty rowset or one containing a single record containing a NULL string), there is no clean way of resolving this without introducing NULL table types into SQL.

If you run the following, the result from the iTVF returns a NULL so I'm confused by what you mean by "we can't get an iTVF to return a NULL.

select Item from dbo.delimitedSplit8k(NULL,',')

If what you mean is that you can't have a null table, then I agree but I don't actually care about such rules for a string function. The function is actually nothing more than multiple logical substrings based on a delimiter and, like the SUBSTRING function, I believe the function should return a NULL as a single element return from the iTVF just as if you used SUBSTRING(NULL,x,y).

Of course, others have a different opinion on that and, although I prefer the new STRING_SPLIT() function to return a NULL when given a NULL, it doesn't actually matter that much to me because I can, in fact, use OUTER APPLY instead of CROSS APPLY to duplicate the functionality that I need.

My real concern, especially on the CONNECT posts, is the old "user wants to build a swing" engineering diagram and I hope MS doesn't screw it up based on a bazillion suggestions. All it really needs to do is what it's already doing quite correctly (NULL return not withstanding) but with a column for the ordinal position to be returned. It should have two and only two parameters... the string to split and the string to use as a delimiter. It needs to be lean, easy to use, and fast as hell. It can and should be made to blow the doors off of even a CLR. According to some early testing that Wayne Sheffield has done, it does just exactly that.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peter-757102 SSCertifiable Points: 6877 More actions · Answer 8

TomThomson (3/12/2016)
peter-757102 (3/12/2016)
.....
De-serialization tasks need to work exactly in reverse to the serialization, which is:
Empty array/set/list --> NULL
One empty string or ID --> ""
Two empty strings or ID ---> ","
Array with one ID 10 ---> "10"
Array with two ID 10 and 20 ---> "10,20"
Array with unknown ID and ID 20 ---> ",20"
Anyone who use that serialisation is evidently crazy.
The second and third line clearly indicate that you have to encode empty strings, and encode them as empty strings.
The sixth line indicates that you have to encode unknown values (presumably NULLs), and encode them as empty strings.
So you presumably have a great volume of encoding that you haven't told us about, which is clearly not context free (if it's context free the 3rd and 6th lines can't both be correct) so is probably going to have pretty abysmal performance in both encoding and decoding.
And then you decide to use NULL to encode something where you know exactly hat you have, rather than using NULL for the case where you don't know something. That might be rather embarrassing sometimes unless you are clever enough never to lose anything.
Try to reverse this serialization and there is no way around it that NULL means empty set (else it would be indistinguishable against a string of length 0 or a nullable integer represented as such).
It's not at all surprising that if you misuse NULL to indicte somethig on which you have full interprettion (the first line of your encoding above) you will need to treat it as meaning that.
This mapping works well and supports every behavior one can desire.
That is clearly not true. You are representing both zero length strings and NULLs. Using what you have shown us, it is impossible (without a lot more information being encoded by means you haven't bothered to mention) to represent a set of just two values, an empty string and a null, in either order.
Want to handle an empty set as a single row containing NULL?
We can do that at the price of loosing support for NULL values in the serialized array/set/list!
Depends whether the NULL goes in the string column or in the position column of the output - if you use the first position column (which presumably always contains 1 if the input isn't NULL) you can handle nulls as long as you have some means of encoding them; and if you don't have any means of encoding them (your encoding instructions above don't really indicate that you actually do have any such means) then you have nothing to lose anyway.

First the obvious

An empty set should be serializable and subsequently de-serializable without changing set representation.

Support for null values in serialized data

Supporting this, is a choice.

I for one desire the concept of allowing NULL values encoded as empty strings in the serialized input.

Doing so increases the functions usefulness and compatibility with sets in SQL as well as complex data structures in other languages.

Encoding nulls

While there is a theoretical problem when both zero-length strings and null strings being encoded the same, this is in practice not an issue.

For character types, nulls are typically mutually exclusive with zero length strings and an application uses either one or the other.

In practice no fancy or costly encoding/escaping is needed to support nulls in strings.

Even in databases where both nulls and zero-length strings exist in a column, it is pretty much always intended to mean exactly the same and just a result of sloppy coding. Therefore the distinction in the serialized input is not required and IsNull() can be required in case of exceptions.

I can go in great detail of why zero-length strings and null values are typically mutual-exclusive, but the following will suffice:

User input is typically text to begin with and only during processing is it converted to a strict data-type.

During this process, empty strings are typically converted to NULL for columns that support it.

And for non-text datatypes a zero length string is the perfect serialized representation of a NULL.

It is compact, easy to recognize and quick to process.

Finally

A bit of an open mind helps and evidently crazy and pretty abysmal performance are not that.

In T-SQL and most certainly C#, detecting a zero-length string and conditionally returning a null value instead of an empty string is not a cost I would worry about.

In T-SQL it would be either a NULLIF on the value or a CASE WHEN length = 0 THEN .. ELSE ... END style of code.

In C# it is absolutely without cost, essentially free, contrary to use of escape characters and such.

peter-757102 SSCertifiable Points: 6877 More actions · Answer 9

@jeff Moden,

I fully agree with you on the number of parameters.

More then two configuration options on this function will make it harder to use and code containing it, harder to interpret.

It is the main reason why I rather see more result columns that are automatically optimized away when not used.

Having the ordinal position, serialized item position, serialized item length available allows for all sorts of function enhancing post-processing, like:

* Correlate multiple split operations

* Use the character position/length detection as a starting point for custom logic

* Remove rows with length 0 values (nulls)

I have come to appreciatie having separate specialized functions over functions with more arguments.

And would support more then one split function when something useful cannot be incorporated neatly in the new function.

Any new function should map well to a broad spectrum of use cases either directly or as a new building block.

In this respect the past releases have disappointed one way or another and I am afraid 2016 will not be that different.

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 10

Hi Peter,

Good to "see" you again.

To be honest, I cannot think of a single instance where I wanted to remove zero length (or as you refer to them, NULL) elements.

As for the length of the returned elements in each column, that's fine provided that it does nothing to slow down the function. To be honest, I've also not needed to know the length of each element except in one case and it was easy enough to add that to the SELECT list. Since it's not possible to add the length column to the function without slowing it down, I'd just as soon that not even be an option because it would "punish" the other use cases where knowing the length wasn't a requirement.

It's a bit like the NULL thing that has become so controversial. I'd prefer it to return a NULL when called with a NULL but that's not essential to me because there is a work around. Having the ordinal position, however, is an absolute must because there is no external method to get the order of the split elements reliably.

I have to say it again, I do absolutely love your idea of passing an empty string for the delimiter parameter to invoke splitting by individual character. That's a great idea.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

TomThomson SSC Guru Points: 104768 More actions · Answer 11

Jeff Moden (3/12/2016)
My real concern, especially on the CONNECT posts, is the old "user wants to build a swing" engineering diagram and I hope MS doesn't screw it up based on a bazillion suggestions. All it really needs to do is what it's already doing quite correctly (NULL return not withstanding) but with a column for the ordinal position to be returned. It should have two and only two parameters... the string to split and the string to use as a delimiter. It needs to be lean, easy to use, and fast as hell. It can and should be made to blow the doors off of even a CLR. According to some early testing that Wayne Sheffield has done, it does just exactly that.

I certainly agree on only two parameters, the only reaon for having another one would be to placate some people who will otherwise make trouble by arguing for a different solution (the only other parameter any one has suggested is one to say whether no rows or one row with a null in it should be returned for the null input string case).

The column for ordinal position is important, and I'm glad it's in a separate connect item because I think that will get a lot of support and not get caught up in the "it should return no rows" argument.

I remember that old diagram, first saw it in 1975; it and the "you want it when" picture were side by side on the project pin-board of a project I took over (with orders to get it sorted out - it had been failing for a long time, I was the fifth or sixth project manager in less than two years). The previous project managers had failed because although they could see that those cartoons represented reality the were too damed scared to act on that knowledge. I love those cartoons. They inspired me to be absolutely bloody minded to the sort of idiot who thought they could design without understanding what was wanted or create schedules without a clue what had to be done, and fortunately for me some of the top management understood that being bloody-minded about that was actually useful, even if it did piss off a lot of marketeers and space-wasting yesmen.

I used to get pissed off every time I saw some American plagiarist claiming that he/she created one of/both of those cartoons in 1979 or 1985, years after I first saw them - and even more years after they were transmitted over UKC's (or was it UCL's?) transtlantic link (via Norway) to what was already called arpanet (not yet internet, although it used IP) and began to spread around the US defense and academic network. But no more, it's all too long ago to matter now.

Tom

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 12

peter-757102 (3/12/2016)
TomThomson (3/12/2016)
peter-757102 (3/12/2016)
.....
De-serialization tasks need to work exactly in reverse to the serialization, which is:
Empty array/set/list --> NULL
One empty string or ID --> ""
Two empty strings or ID ---> ","
Array with one ID 10 ---> "10"
Array with two ID 10 and 20 ---> "10,20"
Array with unknown ID and ID 20 ---> ",20"
Anyone who use that serialisation is evidently crazy.
The second and third line clearly indicate that you have to encode empty strings, and encode them as empty strings.
The sixth line indicates that you have to encode unknown values (presumably NULLs), and encode them as empty strings.
So you presumably have a great volume of encoding that you haven't told us about, which is clearly not context free (if it's context free the 3rd and 6th lines can't both be correct) so is probably going to have pretty abysmal performance in both encoding and decoding.
And then you decide to use NULL to encode something where you know exactly hat you have, rather than using NULL for the case where you don't know something. That might be rather embarrassing sometimes unless you are clever enough never to lose anything.
Try to reverse this serialization and there is no way around it that NULL means empty set (else it would be indistinguishable against a string of length 0 or a nullable integer represented as such).
It's not at all surprising that if you misuse NULL to indicte somethig on which you have full interprettion (the first line of your encoding above) you will need to treat it as meaning that.
This mapping works well and supports every behavior one can desire.
That is clearly not true. You are representing both zero length strings and NULLs. Using what you have shown us, it is impossible (without a lot more information being encoded by means you haven't bothered to mention) to represent a set of just two values, an empty string and a null, in either order.
Want to handle an empty set as a single row containing NULL?
We can do that at the price of loosing support for NULL values in the serialized array/set/list!
Depends whether the NULL goes in the string column or in the position column of the output - if you use the first position column (which presumably always contains 1 if the input isn't NULL) you can handle nulls as long as you have some means of encoding them; and if you don't have any means of encoding them (your encoding instructions above don't really indicate that you actually do have any such means) then you have nothing to lose anyway.
First the obvious
An empty set should be serializable and subsequently de-serializable without changing set representation.
Support for null values in serialized data
Supporting this, is a choice.
I for one desire the concept of allowing NULL values encoded as empty strings in the serialized input.
Doing so increases the functions usefulness and compatibility with sets in SQL as well as complex data structures in other languages.
Encoding nulls
While there is a theoretical problem when both zero-length strings and null strings being encoded the same, this is in practice not an issue.
For character types, nulls are typically mutually exclusive with zero length strings and an application uses either one or the other.
In practice no fancy or costly encoding/escaping is needed to support nulls in strings.
Even in databases where both nulls and zero-length strings exist in a column, it is pretty much always intended to mean exactly the same and just a result of sloppy coding. Therefore the distinction in the serialized input is not required and IsNull() can be required in case of exceptions.
I can go in great detail of why zero-length strings and null values are typically mutual-exclusive, but the following will suffice:
User input is typically text to begin with and only during processing is it converted to a strict data-type.
During this process, empty strings are typically converted to NULL for columns that support it.
And for non-text datatypes a zero length string is the perfect serialized representation of a NULL.
It is compact, easy to recognize and quick to process.
Finally
A bit of an open mind helps and evidently crazy and pretty abysmal performance are not that.
In T-SQL and most certainly C#, detecting a zero-length string and conditionally returning a null value instead of an empty string is not a cost I would worry about.
In T-SQL it would be either a NULLIF on the value or a CASE WHEN length = 0 THEN .. ELSE ... END style of code.
In C# it is absolutely without cost, essentially free, contrary to use of escape characters and such.

I agree. There are times when adjacent delimiters, leading delimiters, trailing delimiters, and even only delimiters are passed and have meaning and should return either an empty string or a NULL. Empirically, since it's not possible to include a NULL string in a non-null string that contains non-null delimiters (at the very least), I personally prefer for empty strings to be returned for such elements rather than nulls, but can handle it either way so long as they are handled consistently from the git. If, however, a space or multiple spaces are passed between delimiters or as a single element, I don't want the function to do me any favors by trimming the string to an empty string because the spaces may have some purpose.

My simple rule for splitters is that anything and everything between delimiters should be returned as an element including spaces unless spaces are the delimiter. Of course, the other simple rule for splitters that I have is that if a single element is passed, no matter what it is and providing that it does not contain a delimiter (which would make it multi-element), then it should return that element unscathed as element #1.

That's also why I'm a bit adamant about if it is provided a NULL, that's a single element that should be returned precisely as it was received and that means a single element "set" should be returned containing a NULL. But, since there's a work around for that, as well, I can live with the idea that the relational purists have of not returning anything because a string wasn't actually provided. It just seems so counter intuitive to have a string function not return a NULL if provided a NULL.

It hasn't come up yet but I'll also state that if the use of multi-character delimiters slows it down in the slightest, then it should only accept single character delimiters because that's what most of the use cases require. The string can be pre-groomed for multi-character delimiters at the expense of someone else's time. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

peter-757102 SSCertifiable Points: 6877 More actions · Answer 13

Jeff Moden (3/12/2016)
It hasn't come up yet but I'll also state that if the use of multi-character delimiters slows it down in the slightest, then it should only accept single character delimiters because that's what most of the use cases require. The string can be pre-groomed for multi-character delimiters at the expense of someone else's time. 😉

In our own CLI implementations, single character-character delimiters can be recognized at run-time and the most optimized implementation for the use-case executed. There we declare and decorate functions visible to and accessible by SQL Server. When invoked the function routes to a specific hidden implementation (which can be dynamic). There can even be multiple functions that point to the same implementation but have slightly different decorations. Decorations such as deterministic or what types the result-set or arguments contain.

One advantage of having a function build into SQL Server is collation and optimizer awareness, plus the improved memory management and reduced iteration overhead. A query plan generator can automatically select the most suited implementation directly (or generate one on demand). It knows the datatypes in advance and quite often also the separator. So I wouldn't worry about that too much :).

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 14

TomThomson (3/12/2016)
Jeff Moden (3/12/2016)
My real concern, especially on the CONNECT posts, is the old "user wants to build a swing" engineering diagram and I hope MS doesn't screw it up based on a bazillion suggestions. All it really needs to do is what it's already doing quite correctly (NULL return not withstanding) but with a column for the ordinal position to be returned. It should have two and only two parameters... the string to split and the string to use as a delimiter. It needs to be lean, easy to use, and fast as hell. It can and should be made to blow the doors off of even a CLR. According to some early testing that Wayne Sheffield has done, it does just exactly that.
I certainly agree on only two parameters, the only reaon for having another one would be to placate some people who will otherwise make trouble by arguing for a different solution (the only other parameter any one has suggested is one to say whether no rows or one row with a null in it should be returned for the null input string case).
The column for ordinal position is important, and I'm glad it's in a separate connect item because I think that will get a lot of support and not get caught up in the "it should return no rows" argument.
I remember that old diagram, first saw it in 1975; it and the "you want it when" picture were side by side on the project pin-board of a project I took over (with orders to get it sorted out - it had been failing for a long time, I was the fifth or sixth project manager in less than two years). The previous project managers had failed because although they could see that those cartoons represented reality the were too damed scared to act on that knowledge. I love those cartoons. They inspired me to be absolutely bloody minded to the sort of idiot who thought they could design without understanding what was wanted or create schedules without a clue what had to be done, and fortunately for me some of the top management understood that being bloody-minded about that was actually useful, even if it did piss off a lot of marketeers and space-wasting yesmen.
I used to get pissed off every time I saw some American plagiarist claiming that he/she created one of/both of those cartoons in 1979 or 1985, years after I first saw them - and even more years after they were transmitted over UKC's (or was it UCL's?) transtlantic link (via Norway) to what was already called arpanet (not yet internet, although it used IP) and began to spread around the US defense and academic network. But no more, it's all too long ago to matter now.

Heh... I still love those cartoons, as well, and they're still (unfortunately) very appropriate. And, yeah... right there with you on being bloody-minded and having managers that actually appreciate someone with the nads to understand that teamwork does NOT always mean being a "Yes" man.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1004686 More actions · Answer 15

peter-757102 (3/12/2016)
Jeff Moden (3/12/2016)
It hasn't come up yet but I'll also state that if the use of multi-character delimiters slows it down in the slightest, then it should only accept single character delimiters because that's what most of the use cases require. The string can be pre-groomed for multi-character delimiters at the expense of someone else's time. 😉
In our own CLI implementations, single character-character delimiters can be recognized at run-time and the most optimized implementation for the use-case executed. There we declare and decorate functions visible to and accessible by SQL Server. When invoked the function routes to a specific hidden implementation (which can be dynamic). There can even be multiple functions that point to the same implementation but have slightly different decorations. Decorations such as deterministic or what types the result-set or arguments contain.
One advantage of having a function build into SQL Server is collation and optimizer awareness, plus the improved memory management and reduced iteration overhead. A query plan generator can automatically select the most suited implementation directly (or generate one on demand). It knows the datatypes in advance and quite often also the separator. So I wouldn't worry about that too much :).

Having seen similar go awry in very short order, I always worry about such things. That's why I wrote the article that this thread is attached to and why I appreciated the extra 20 or so percent performance we were able to get out of even my improvements thanks to your great tip way back when.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)