Tally OH! An Improved SQL 8K “CSV Splitter” Function

Question

Tally OH! An Improved SQL 8K “CSV Splitter” Function

Viewing 15 posts - 16 through 30 (of 990 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 1

David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?
(This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )

That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛

Thanks for stopping by, David. I'll see what I can do for the true CSV stuff, soon. 🙂

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Solomon Rutzky SSCoach Points: 16430 More actions · Answer 2

Jeff Moden (5/2/2011)
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?
(This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )
That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.
I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛

David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR ;-). A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.

That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it :-).

Take care,

Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR

Rich Mechaber SSChampion Points: 11130 More actions · Answer 3

Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.

One question: where's the CLR function? I didn't see it in the attachments.

Thanks,

Rich

Robert Livermore SSC-Addicted Points: 403 More actions · Answer 4

Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt

One more idea to explore.

The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 5

Gianluca Sartori (5/2/2011)
Jeff, your article is great!
The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us.
Thank you very much.

I'm humbled by your observation, Gianluca. :blush: I didn't think that anyone would understand that especially with the silly and fun metaphors I made about my angst on the subject. In real life, it was killing me that the original function performed so terribly on the high end of strings and I had to either fix it or apologize for it. Thank you for the very kind words.

For the record, I really do have the artwork with the cat in the box. I commissioned a friend to make it for me several years ago to remind me that "Before you can think outside the box, you must first realize... you're in a box!"

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mtassin SSC-Insane Points: 23099 More actions · Answer 6

Lol I finally made a mention in one of Jeff's articles.

One day I'll have to collaborate with Jeff to define just what this alliance is.

On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.

Btw, before I start experimenting with Jeff's downloads, if you were to concatenate the deliminters onto the string before running it through the splitter, wouldn't that help with the performance issue? As I see it, the CHARINDEX searches to find the delimiters is repeating the concatenation over and over within the where clause, but I could be seeing this wrong.

SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, t.N) - t.N)

FROM cteTally t

WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

AND SUBSTRING(@pDelimiter + @pString, t.N, 1) = @pDelimiter

;

If that were rewritten as

SET @pString = ',' + @pString + ',';

SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),

Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString, t.N) - t.N)

FROM cteTally t

WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters

AND SUBSTRING(@pString, t.N, 1) = @pDelimiter

;

Doesn't that help the performance hit from the concatenation in the where clause?

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 7

rmechaber (5/2/2011)
Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.
One question: where's the CLR function? I didn't see it in the attachments.
Thanks,
Rich

Very kind words. Thanks Rich.

I kind of blew it on the CLR function. I asked someone to write it for me and then he came down with a really nasty flu and couldn't get to it. As a result, I had actually submitted the article without a chart or any info on the CLR except to say that it would probably still be faster. When I did finally get the CLR code, I was in a dead rush to update the article before the publishing date and I forgot to ask if I could include his actual code. Although I'm sure he'll say it's OK to attach his code, I still have to ask the question as a professional courtesy. If he says "yes", I'll be sure to add it to the attachments.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Rich Mechaber SSChampion Points: 11130 More actions · Answer 8

Jeff Moden (5/2/2011)
I kind of blew it on the CLR function. I asked someone to write it for me and then he came down with a really nasty flu and couldn't get to it. As a result, I had actually submitted the article without a chart or any info on the CLR except to say that it would probably still be faster. When I did finally get the CLR code, I was in a dead rush to update the article before the publishing date and I forgot to ask if I could include his actual code. Although I'm sure he'll say it's OK to attach his code, I still have to ask the question as a professional courtesy. If he says "yes", I'll be sure to add it to the attachments.

Ah, great, that makes sense. I look forward to seeing it.

And here I was thinking you'd intentionally left out that nasty-non-SQL-not-set-based-procedural-dingo-droppings code. ("If I don't post it, no one can use it?") 😉

I am lucky (or unlucky for my goals of professional development) to work with small- and moderate-sized data sets that perform well using most of these approaches, but better code is .... better. Having just used CLR to solve a difficult problem ("which students can earn a special degree certificate?") involving a lot of forwards and backwards computations, scenario splitting, etc., I have to say I'm a convert. I don't see that having this CLR TVF is any more or less secure than a SQL-based TVF, as I still have to GRANT users rights to execute it.

Thanks again for another great article, Jeff,

Rich

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 9

mtassin (5/2/2011)
Lol I finally made a mention in one of Jeff's articles.
One day I'll have to collaborate with Jeff to define just what this alliance is.
On another note though, I notice something about the final graph of performance around the 270 mark on the X axis. The pure SQL set based method seems to continue on a straight line path, while the CLR methods seems to be starting to curve upwards similarly to what Jeff was complaining about the original Tally splitter doing, just later in the X series than the original Tally did.
Btw, before I start experimenting with Jeff's downloads, if you were to concatenate the deliminters onto the string before running it through the splitter, wouldn't that help with the performance issue? As I see it, the CHARINDEX searches to find the delimiters is repeating the concatenation over and over within the where clause, but I could be seeing this wrong.
SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),
Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString + @pDelimiter, t.N) - t.N)
FROM cteTally t
WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters
AND SUBSTRING(@pDelimiter + @pString, t.N, 1) = @pDelimiter
;
If that were rewritten as
SET @pString = ',' + @pString + ',';
SELECT ItemNumber = ROW_NUMBER() OVER (ORDER BY t.N),
Item = SUBSTRING(@pString, t.N, CHARINDEX(@pDelimiter, @pString, t.N) - t.N)
FROM cteTally t
WHERE t.N <= DATALENGTH(@pString)+1 --DATATLENGTH allows for trailing space delimiters
AND SUBSTRING(@pString, t.N, 1) = @pDelimiter
;
Doesn't that help the performance hit from the concatenation in the where clause?

Hi Mark,

Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀

Shifting gears... I actually tried the pre-concatenation but it left the original two problems. Sure, it makes the splitter function look like it's doing a better job if you only consider the splitter function but, overall, you still have the time consumed by the concatenation. The other problem that I was working on but didn't emphasize in the article is that such concatenation means that you can never split a full 8K bytes. Depending on whether you concatenate 1 or 2 delimiters (the original function used 1 on one end or the other at different times), you'll always fall 1 or 2 bytes short of a full 8K. The new splitter can handle a full 8K.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 10

Robert Livermore (5/2/2011)
Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt
One more idea to explore.
The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.

Nah... didn't "just" discover it. The old splitter, the basis of which has been around for years before I came along, used the same approach. It's good to know the history behind it, though. And thanks for the kudo.

I took a look at the code you pointed to. Very clever indeed. Will it parse "True" CSV with double-quoted text qualifiers?

Also, you should send a note to Phil Factor. He's been working on splitters for JSON and he'd be very interested in the "Irony" code.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mtassin SSC-Insane Points: 23099 More actions · Answer 11

Jeff Moden (5/2/2011)
mtassin (5/2/2011)
Lol I finally made a mention in one of Jeff's articles.
Doesn't that help the performance hit from the concatenation in the where clause?
Hi Mark,
Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀
Shifting gears... I actually tried the pre-concatenation but it left the original two problems. Sure, it makes the splitter function look like it's doing a better job if you only consider the splitter function but, overall, you still have the time consumed by the concatenation. The other problem that I was working on but didn't emphasize in the article is that such concatenation means that you can never split a full 8K bytes. Depending on whether you concatenate 1 or 2 delimiters (the original function used 1 on one end or the other at different times), you'll always fall 1 or 2 bytes short of a full 8K. The new splitter can handle a full 8K.

I suppose, though these days if I suspect I'm going to exceed or even come close to exceeding the 8k, I'd switch to varchar(max). Which I know this article doesn't cover (because you said you'd get to it).

Today I get to learn something, I figured the issue with the concatenation was that as it was part of a WHERE clause that was re-applied as the tally table progressed through the set, that this was causing the concatentation to be reapplied for each value of N in the tally table.

I just don't understand how concatenating a pair of commas to a string can be such a performance impact, or how it can cause the performance to change slope like that later in the graph. I can see it causing a lag as the concatenation is done, just not that lag increasing as the size of the string does.

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 12

Solomon Rutzky (5/2/2011)
Jeff Moden (5/2/2011)
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?
(This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it :rolleyes: )
That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.
I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 😛
David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR ;-). A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.
That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it :-).
Take care,
Solomon...

Several years ago, I actually did make a pure T-SQL solution for "True" CSV which allowed not only embedded delimiters, but also allowed for "escaped" double quotes to be embedded. It worked very nicely but was a fair bit slower as you would expect it be even though a Tally Table certainly came into play.

I set it aside in favor of the more efficient splitters available in Jet (and now, ACE) although those two methods are a bit more difficult to use. The algorithm I used would make a heck of a nice CLR, though. Since it's "dedicated code" instead of "general purpose" such as RegEx is, it may be even faster.

I'm impressed by the wide array of functions available in SQL# but also a bit disappointed. I'm definitely NOT an anti_CLR zealot, but there are a lot of functions (random numbers, for example) that are just too simple to do correctly in T-SQL. I AM, however, seriously interested in the BULK EXPORT I see there. I wish MS had thought of such a thing and it's always perplexed by as to why they didn't.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

mtassin SSC-Insane Points: 23099 More actions · Answer 13

Jeff Moden (5/2/2011)
Hi Mark,
Heh... I'm thinking T-Shirts with the classic "No RBAR" symbol on them to start with! 😀

For just $19.95 and if you're one of the first 20 callers we'll send you another one absolutely free (except for an additional $8.95/shirt S&H of course).

🙂

--Mark Tassin
MCITP - SQL Server DBA
Proud member of the Anti-RBAR alliance.
For help with Performance click this link[/url]
For tips on how to post your problems[/url]

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 14

mtassin (5/2/2011)
Today I get to learn something, I figured the issue with the concatenation was that as it was part of a WHERE clause that was re-applied as the tally table progressed through the set, that this was causing the concatentation to be reapplied for each value of N in the tally table.

Ah... let me make a correction here. I was talking about pre-concatenation within a function and you're talking about doing it outside the function. You are correct... pre-concatenation does definitely speed things up for the reasons you stated and allows things like the XML method to run fairly quickly and in a linear fashion. I see no reason why it wouldn't also do so for the code you offered.

To be absolutely honest, I didn't try pre-concatenation outside of the function. It's likely going to be very fast... maybe even faster than this new function because the new function does an ISNULL/NULLIF on every element whereas your fine example does not. But, can you imagine the outcry if I settled for pre-concatenation outside of the function? :Whistling:

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Jeff Moden SSC Guru Points: 1003894 More actions · Answer 15

@mark-3,

IIRC, I believe I tried doing pre-aggregation in an mTVF and it slowed things down too much (I stated something to that effect in the article). I can't find my notes on it so I must have dismissed it pretty early in the game but I'll revisit it. Still, even with using a zero based Tally Table and only concatenating a delimiter at the end, it'll still only allow 7999 bytes which may prove to be a distraction to some because it won't error on any truncation that will occur when passed a full 8K string.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)