Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Forum Etiquette: How to post Reporting Services problems

[/url]

Forum Etiquette: How to post data/code on a forum to get the best help - by Jeff Moden
[/url]

How to Post Performance Problems - by Gail Shaw
[/url]

Ed Wagner SSC Guru Points: 287026 More actions · Answer 1

TheSQLGuru (3/16/2016)
I'm still waiting for the performance review of the new 2016 string splitter function I asked for long ago before all this mish-mash about NULL sidetracked the topic. 😀

Here you go: http://www.sqlservercentral.com/Forums/FindPost1769742.aspx

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

mister.magoo SSC-Forever Points: 47174 More actions · Answer 2

If this new 2016 splitter does get changed to return an ordinal position, (which I hope it does), then I would have to say that returning an empty set for a NULL input would seem most logical to me, otherwise, it would have to return a set with one row, with an Ordinal Position column containing either a 1 or a NULL.

I see three possible return values for a NULL input :

Empty Set:

ItemNumber Item

---------------

One Row With ItemNumber 1:

ItemNumber Item

---------------

1 NULL

One Row With ItemNumber NULL:

ItemNumber Item

---------------

NULL NULL

I can only really get on board with the Empty Set, as both the other options, if COUNTed would tell me the string had split into 1 (one) result, which it hasn't.

I don't think I can therefore agree with returning a NULL in the current version, as it implies 1 (one) result which is NULL, when there is no result really.

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Ed Wagner SSC Guru Points: 287026 More actions · Answer 3

If it doesn't get updated to return the ordinal position, I won't use it much. I don't see why they would limit it like that, but I don't work there.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 4

mister.magoo (3/16/2016)
If this new 2016 splitter does get changed to return an ordinal position, (which I hope it does), then I would have to say that returning an empty set for a NULL input would seem most logical to me, otherwise, it would have to return a set with one row, with an Ordinal Position column containing either a 1 or a NULL.
I see three possible return values for a NULL input :
Empty Set:
ItemNumber Item
---------------
One Row With ItemNumber 1:
ItemNumber Item
---------------
1 NULL
One Row With ItemNumber NULL:
ItemNumber Item
---------------
NULL NULL
I can only really get on board with the Empty Set, as both the other options, if COUNTed would tell me the string had split into 1 (one) result, which it hasn't.
I don't think I can therefore agree with returning a NULL in the current version, as it implies 1 (one) result which is NULL, when there is no result really.

So, what would you prefer to see with the following (using a cte as if a Table Source)...

WITH cteTestData (RowNum,CSV) AS

(

SELECT 1, '1,2,3' UNION ALL

SELECT 2, NULL UNION ALL

SELECT 3, '6,7'

)

SELECT *

FROM cteTestData td

CROSS APPLY dbo.STRING_SPLIT(td.CSV)

;

My basic rule for any function, whether it returns a scalar value or a result set, is the same as any other function. Pass a NULL, get a NULL back. Just because this function returns a result set, I don't believe it should return nothing for a NULL. Since this function is logically nothing more than the execution of multiple SUBSTRINGs, imagine for a minute if SUBSTRING returned nothing instead of a NULL when given a NULL.

Here's the result set that I'd expect from the above...

RowNum ItemNumber Value

1 1 1

1 2 2

1 3 3

2 1 NULL

3 1 6

3 2 7

That would protect the "typical" people that use such functions without reading about nuances. Sure, the work around would be to use OUTER APPLY but most people expect functions to return a NULL if given a NULL. With the understanding that this has little to do with Codd-like functionality for relational engines, I'd expect this function to behave like other functions when given a NULL.

Since there is such a work around (which is why I also don't want to see a 3rd parameter), I don't actually give a rat's patooti which camp wins on this subject but I'll still call it wrong if NULL returns anything other than a NULL, like it does nearly everywhere else.

What I'm really concerned about is the current astigmatic solution doesn't return an ordinal position for the elements. Let's stop the mutual floccinaucinihilipilification, join forces, and urge everyone to spread the word to get THAT change voted up on CONNECT.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Steve Jones - SSC Editor SSC Guru Points: 741485 More actions · Answer 5

Vote count. You need lots of them, but they do matter. There are some Connect items on split_string(), but you may be able to influence how it evolves.

mister.magoo SSC-Forever Points: 47174 More actions · Answer 6

Jeff Moden (3/16/2016)
So, what would you prefer to see with the following (using a cte as if a Table Source)...
WITH cteTestData (RowNum,CSV) AS
(
SELECT 1, '1,2,3' UNION ALL
SELECT 2, NULL UNION ALL
SELECT 3, '6,7'
)
SELECT *
FROM cteTestData td
CROSS APPLY dbo.STRING_SPLIT(td.CSV)
;
My basic rule for any function, whether it returns a scalar value or a result set, is the same as any other function. Pass a NULL, get a NULL back. Just because this function returns a result set, I don't believe it should return nothing for a NULL. Since this function is logically nothing more than the execution of multiple SUBSTRINGs, imagine for a minute if SUBSTRING returned nothing instead of a NULL when given a NULL.
Here's the result set that I'd expect from the above...
RowNum ItemNumber Value
1 1 1
1 2 2
1 3 3
2 1 NULL
3 1 6
3 2 7

Hi Jeff,

Because it returns a rowset, not a scalar, I would expect this:

RowNum ItemNumber Value

1 1 1

1 2 2

1 3 3

3 1 6

3 2 7

And, as you stated, if I wanted a NULL for RowNum 2, I would use OUTER APPLY.

I don't think the comparison to SUBSTRING is particularly valid as it is comparing a SCALAR function to a TVF.

A more reasonable (in my mind) comparison would be to something like this:

select [text]

from [sys].[dm_exec_sql_text](null)

Which, quite correctly, returns an empty rowset for a null input, but is expected to return a string for a valid input.

However, I'm not arguing against your point of view (wanting a null for a null), just trying to expand on the discussion 🙂

MM

select geometry::STGeomFromWKB(0x0106000000020000000103000000010000000B0000001000000000000840000000000000003DD8CCCCCCCCCC0840000000000000003DD8CCCCCCCCCC08408014AE47E17AFC3F040000000000104000CDCCCCCCCCEC3F9C999999999913408014AE47E17AFC3F9C99999999991340000000000000003D0000000000001440000000000000003D000000000000144000000000000000400400000000001040000000000000F03F100000000000084000000000000000401000000000000840000000000000003D0103000000010000000B000000000000000000143D000000000000003D009E99999999B93F000000000000003D009E99999999B93F8014AE47E17AFC3F400000000000F03F00CDCCCCCCCCEC3FA06666666666FE3F8014AE47E17AFC3FA06666666666FE3F000000000000003D1800000000000040000000000000003D18000000000000400000000000000040400000000000F03F000000000000F03F000000000000143D0000000000000040000000000000143D000000000000003D, 0);

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 7

mister.magoo (3/16/2016)
Jeff Moden (3/16/2016)
So, what would you prefer to see with the following (using a cte as if a Table Source)...
WITH cteTestData (RowNum,CSV) AS
(
SELECT 1, '1,2,3' UNION ALL
SELECT 2, NULL UNION ALL
SELECT 3, '6,7'
)
SELECT *
FROM cteTestData td
CROSS APPLY dbo.STRING_SPLIT(td.CSV)
;
My basic rule for any function, whether it returns a scalar value or a result set, is the same as any other function. Pass a NULL, get a NULL back. Just because this function returns a result set, I don't believe it should return nothing for a NULL. Since this function is logically nothing more than the execution of multiple SUBSTRINGs, imagine for a minute if SUBSTRING returned nothing instead of a NULL when given a NULL.
Here's the result set that I'd expect from the above...
RowNum ItemNumber Value
1 1 1
1 2 2
1 3 3
2 1 NULL
3 1 6
3 2 7
Hi Jeff,
Because it returns a rowset, not a scalar, I would expect this:
RowNum ItemNumber Value
1 1 1
1 2 2
1 3 3
3 1 6
3 2 7
And, as you stated, if I wanted a NULL for RowNum 2, I would use OUTER APPLY.
I don't think the comparison to SUBSTRING is particularly valid as it is comparing a SCALAR function to a TVF.
A more reasonable (in my mind) comparison would be to something like this:
select [text]
from [sys].[dm_exec_sql_text](null)
Which, quite correctly, returns an empty rowset for a null input, but is expected to return a string for a valid input.
However, I'm not arguing against your point of view (wanting a null for a null), just trying to expand on the discussion 🙂

Heh... as Adam Savage would say, "I reject your reality and substitute my own." 😉

Seriously, though, I can see it either way (although I prefer the NULL return for reasons previously stated). I just want them to fix the very real problem of not having an ordinal position column.

Anyone reading this... if you haven't gone over to CONNECT to vote up the additional column, please do. For your convenience, here's the link.

https://connect.microsoft.com/SQLServer/Feedback/Details/2433171

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed Wagner SSC Guru Points: 287026 More actions · Answer 8

Jeff Moden (3/16/2016)
mister.magoo (3/16/2016)
Jeff Moden (3/16/2016)
So, what would you prefer to see with the following (using a cte as if a Table Source)...
WITH cteTestData (RowNum,CSV) AS
(
SELECT 1, '1,2,3' UNION ALL
SELECT 2, NULL UNION ALL
SELECT 3, '6,7'
)
SELECT *
FROM cteTestData td
CROSS APPLY dbo.STRING_SPLIT(td.CSV)
;
My basic rule for any function, whether it returns a scalar value or a result set, is the same as any other function. Pass a NULL, get a NULL back. Just because this function returns a result set, I don't believe it should return nothing for a NULL. Since this function is logically nothing more than the execution of multiple SUBSTRINGs, imagine for a minute if SUBSTRING returned nothing instead of a NULL when given a NULL.
Here's the result set that I'd expect from the above...
RowNum ItemNumber Value
1 1 1
1 2 2
1 3 3
2 1 NULL
3 1 6
3 2 7
Hi Jeff,
Because it returns a rowset, not a scalar, I would expect this:
RowNum ItemNumber Value
1 1 1
1 2 2
1 3 3
3 1 6
3 2 7
And, as you stated, if I wanted a NULL for RowNum 2, I would use OUTER APPLY.
I don't think the comparison to SUBSTRING is particularly valid as it is comparing a SCALAR function to a TVF.
A more reasonable (in my mind) comparison would be to something like this:
select [text]
from [sys].[dm_exec_sql_text](null)
Which, quite correctly, returns an empty rowset for a null input, but is expected to return a string for a valid input.
However, I'm not arguing against your point of view (wanting a null for a null), just trying to expand on the discussion 🙂
Heh... as Adam Savage would say, "I reject your reality and substitute my own." 😉
Seriously, though, I can see it either way (although I prefer the NULL return for reasons previously stated). I just want them to fix the very real problem of not having an ordinal position column.
Anyone reading this... if you haven't gone over to CONNECT to vote up the additional column, please do. For your convenience, here's the link.
https://connect.microsoft.com/SQLServer/Feedback/Details/2433171

Thanks for the link, Jeff. Voted and commented. The vote is currently 15 to 0. I'm glad I got it in before they closed it. 😉

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

TomThomson SSC Guru Points: 104768 More actions · Answer 9

mister.magoo (3/16/2016)
I can only really get on board with the Empty Set, as both the other options, if COUNTed would tell me the string had split into 1 (one) result, which it hasn't.
I don't think I can therefore agree with returning a NULL in the current version, as it implies 1 (one) result which is NULL, when there is no result really.

I agree, if we don't have count sequence number we have to return the empty set when the input is null. But if we don't get the sequence number I almost certainly won't be using it anyway. I also agree that returning the sequence number as null (or both the sequence number and the string) is the way to handle null input, rather than returning sequence number 1 (although that would be rather hard to get to work fast if done in SQL rather than in native code).

Tom

Jeff Moden SSC Guru Points: 1004706 More actions · Answer 10

TomThomson (3/17/2016)
mister.magoo (3/16/2016)
I can only really get on board with the Empty Set, as both the other options, if COUNTed would tell me the string had split into 1 (one) result, which it hasn't.
I don't think I can therefore agree with returning a NULL in the current version, as it implies 1 (one) result which is NULL, when there is no result really.
I agree, if we don't have count sequence number we have to return the empty set when the input is null. But if we don't get the sequence number I almost certainly won't be using it anyway. I also agree that returning the sequence number as null (or both the sequence number and the string) is the way to handle null input, rather than returning sequence number 1 (although that would be rather hard to get to work fast if done in SQL rather than in native code).

Personally and whether or not it's Codd-like or not, I'd like the consistency of returning a sequence number even if a NULL were passed just for the sake of consistently.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Ed Wagner SSC Guru Points: 287026 More actions · Answer 11

Jeff Moden (3/17/2016)
TomThomson (3/17/2016)
mister.magoo (3/16/2016)
I can only really get on board with the Empty Set, as both the other options, if COUNTed would tell me the string had split into 1 (one) result, which it hasn't.
I don't think I can therefore agree with returning a NULL in the current version, as it implies 1 (one) result which is NULL, when there is no result really.
I agree, if we don't have count sequence number we have to return the empty set when the input is null. But if we don't get the sequence number I almost certainly won't be using it anyway. I also agree that returning the sequence number as null (or both the sequence number and the string) is the way to handle null input, rather than returning sequence number 1 (although that would be rather hard to get to work fast if done in SQL rather than in native code).
Personally and whether or not it's Codd-like or not, I'd like the consistency of returning a sequence number even if a NULL were passed just for the sake of consistently.

If it doesn't have a sequence number, I wont be using it. 'Nuff said.

Tally Tables - Performance Personified
String Splitting with True Performance
Best practices on how to ask questions

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 12

Just added the string_split function to Jeff's harness and Ouch :pinch:, what a mess! Not only is it slower than anything else in the harness, the Lead() version comes second last which tells me that the optimized worktable limit is still relatively low (10000 rows). Not using the string_split function if this isn't improved.

😎

mburbea Right there with Babe Points: 781 More actions · Answer 13

Can you post your test script? I suspect that since delimited8k did so well and string_split did so poorly, this might be related to that optimization that can be disabled with undocumented traceflag 8690.

In sql 2012 or 2014, Sql server got a little bit smarter, and the trick with the view being used to allow calling a non-deterministic function gets constant optimized and reduces the randomness of the rows. Since sTVF getting invoked is a blackbox step, sql server can't assume each row will yield the same result like it can with a schemabound inline tvf.

I have a test script here that should resolve this and has my own clr splitter

https://gist.github.com/mburbea/5e142f846d0141c714a1 (basically the same as Jeff's just slightly tweaked to be more dynamic and add in the traceflag).

Eirikur Eiriksson SSC Guru Points: 182951 More actions · Answer 14

mburbea (3/18/2016)
Can you post your test script? I suspect that since delimited8k did so well and string_split did so poorly, this might be related to that optimization that can be disabled with undocumented traceflag 8690.
In sql 2012 or 2014, Sql server got a little bit smarter, and the trick with the view being used to allow calling a non-deterministic function gets constant optimized and reduces the randomness of the rows. Since sTVF getting invoked is a blackbox step, sql server can't assume each row will yield the same result like it can with a schemabound inline tvf.
I have a test script here that should resolve this and has my own clr splitter
https://gist.github.com/mburbea/5e142f846d0141c714a1 (basically the same as Jeff's just slightly tweaked to be more dynamic and add in the traceflag).

The test script is as stated in the post Jeff Moden's test splitter test harness and IIRCC the same as you have on github. What puzzles me is why you are suggesting that disabling the spool operator which is what the traceflag 8690 does will affect the execution plan when it doesn't even appear in the plan. Sounds to me more like a poking a needle in a clockwork approach, hope you can enlighten me on this one. Although the traceflag 8690 isn't documented it has been covered in few blogs such as blogs.msdn.microsoft.com by Microsoft Engineers.

Unfortunately I don't have 2016 on the same / equal hardware as the earlier versions so judging or estimating improvements would be guestimates or finger in the air at the best, working on it though. One thing I've been able to test on equal hardware/os is the tempdb select into #, very promising as 2016 operations are more than 10x faster than 2012/14.

😎

The output trees with and without the traceflag are identical

WITH QUERYTRACEON 8690

*** Output Tree: ***

PhyOp_Apply (x_jtInner)

PhyOp_ComputeScalar

PhyOp_Range TBL: dbo.Csv8K(alias TBL: X)(0) ASC Bmk ( COL: Bmk1000 ) IsRow: COL: IsBaseRow1001

AncOp_PrjList

AncOp_PrjEl COL: Expr1002

ScaOp_Convert int,Null,ML=4

ScaOp_Identifier QCOL: [X].RowNum

PhyOp_StreamingTabUDF RESULT(QCOL: STRING_SPLIT.value)

ScaOp_Identifier QCOL: [X].CSV

ScaOp_Const TI(varchar collate 53256,Var,Trim,ML=1) XVAR(varchar,Not Owned,Value=Len,Data = (1,,))

WITHOUT QUERYTRACEON 8690

*** Output Tree: ***

PhyOp_Apply (x_jtInner)

PhyOp_ComputeScalar

PhyOp_Range TBL: dbo.Csv8K(alias TBL: X)(0) ASC Bmk ( COL: Bmk1000 ) IsRow: COL: IsBaseRow1001

AncOp_PrjList

AncOp_PrjEl COL: Expr1002

ScaOp_Convert int,Null,ML=4

ScaOp_Identifier QCOL: [X].RowNum

PhyOp_StreamingTabUDF RESULT(QCOL: STRING_SPLIT.value)

ScaOp_Identifier QCOL: [X].CSV

ScaOp_Const TI(varchar collate 53256,Var,Trim,ML=1) XVAR(varchar,Not Owned,Value=Len,Data = (1,,))

Goldie Lesser SSCertifiable Points: 6155 More actions · Answer 15

Ed Wagner (3/17/2016)
If it doesn't have a sequence number, I wont be using it. 'Nuff said.

I use Jeff's function all the time, and I can't even think of a single case where I've used the sequence number.

Can someone give a case of where the ordinal number is important?