Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

Tally OH! An Improved SQL 8K “CSV Splitter” Function Expand / Collapse
Author
Message
Posted Monday, May 2, 2011 3:33 AM


SSCertifiable

SSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiableSSCertifiable

Group: General Forum Members
Last Login: Yesterday @ 4:31 AM
Points: 5,014, Visits: 10,514
Jeff, your article is great!
The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us.
Thank you very much.


--
Gianluca Sartori

Get your two-cent-answer quickly
spaghettidba.com
@spaghettidba
Post #1101480
Posted Monday, May 2, 2011 4:29 AM


SSC-Insane

SSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-InsaneSSC-Insane

Group: General Forum Members
Last Login: Yesterday @ 4:59 PM
Points: 23,000, Visits: 31,482
Just wanted to add my 2 cents as well. Very well done. I will be adding these routines to my toolbox as well.

I'm curious, I'm wondering if I could make them work in Oracle as well. Maybe I'll give it a shot when I have some time.



Lynn Pettis

For better assistance in answering your questions, click here
For tips to get better help with Performance Problems, click here
For Running Totals and its variations, click here or when working with partitioned tables
For more about Tally Tables, click here
For more about Cross Tabs and Pivots, click here and here
Managing Transaction Logs

SQL Musings from the Desert Fountain Valley SQL (My Mirror Blog)
Post #1101495
Posted Monday, May 2, 2011 5:31 AM


Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Tuesday, June 24, 2014 12:24 PM
Points: 514, Visits: 1,717
Awesome article. Thanks for the effort, Jeff.
Post #1101505
Posted Monday, May 2, 2011 6:11 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 9:38 AM
Points: 36,726, Visits: 31,176
Solomon Rutzky (5/1/2011)
Very thorough analysis. Great job!

Take care,
Solomon...


Thanks for stopping by, Solomon.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1101518
Posted Monday, May 2, 2011 6:12 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 9:38 AM
Points: 36,726, Visits: 31,176
Abrar Ahmad_ (5/2/2011)
Solomon Rutzky (5/1/2011)
Very thorough analysis. Great job!

Take care,
Solomon...


But it seems at least a week long activity !!!


Goodish!


Heh... it seemed like a year to me. Thanks for stopping by, Abrar.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1101519
Posted Monday, May 2, 2011 6:21 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 9:38 AM
Points: 36,726, Visits: 31,176
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?

(This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it )


That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh?

Thanks for stopping by, David. I'll see what I can do for the true CSV stuff, soon.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1101523
Posted Monday, May 2, 2011 6:41 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Wednesday, July 9, 2014 3:30 PM
Points: 352, Visits: 1,868
Jeff Moden (5/2/2011)
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data?

(This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it )


That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.

I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh?


David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR . A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.

That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it .

Take care,
Solomon...





SQL# - http://www.SQLsharp.com/
Post #1101535
Posted Monday, May 2, 2011 6:49 AM


Say Hey Kid

Say Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey KidSay Hey Kid

Group: General Forum Members
Last Login: Thursday, July 17, 2014 2:01 PM
Points: 687, Visits: 3,002
Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.

One question: where's the CLR function? I didn't see it in the attachments.

Thanks,
Rich
Post #1101541
Posted Monday, May 2, 2011 7:06 AM
SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, July 17, 2014 5:57 AM
Points: 196, Visits: 59
Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt
One more idea to explore.
The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.



Post #1101549
Posted Monday, May 2, 2011 7:27 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 9:38 AM
Points: 36,726, Visits: 31,176
Gianluca Sartori (5/2/2011)
Jeff, your article is great!
The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us.
Thank you very much.


I'm humbled by your observation, Gianluca. I didn't think that anyone would understand that especially with the silly and fun metaphors I made about my angst on the subject. In real life, it was killing me that the original function performed so terribly on the high end of strings and I had to either fix it or apologize for it. Thank you for the very kind words.

For the record, I really do have the artwork with the cat in the box. I commissioned a friend to make it for me several years ago to remind me that "Before [I] can think outside the box, [I] first have to realize... [I'm] in a box!"


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1101562
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse