|
|
|
SSCarpal Tunnel
       
Group: General Forum Members
Last Login: Yesterday @ 8:17 AM
Points: 4,804,
Visits: 8,074
|
|
Jeff, your article is great! The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us. Thank you very much.
Get your two-cent-answer quickly The Spaghetti DBA
|
|
|
|
|
SSC-Insane
         
Group: General Forum Members
Last Login: Today @ 2:45 AM
Points: 21,630,
Visits: 27,485
|
|
|
|
|
|
SSC-Addicted
      
Group: General Forum Members
Last Login: Monday, May 20, 2013 7:10 AM
Points: 485,
Visits: 1,568
|
|
| Awesome article. Thanks for the effort, Jeff.
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 4:51 PM
Points: 32,923,
Visits: 26,811
|
|
Solomon Rutzky (5/1/2011) Very thorough analysis. Great job!
Take care, Solomon...
Thanks for stopping by, Solomon.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 4:51 PM
Points: 32,923,
Visits: 26,811
|
|
Abrar Ahmad_ (5/2/2011)
Solomon Rutzky (5/1/2011) Very thorough analysis. Great job!
Take care, Solomon... But it seems at least a week long activity !!!

Goodish!
Heh... it seemed like a year to me. Thanks for stopping by, Abrar.
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 4:51 PM
Points: 32,923,
Visits: 26,811
|
|
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data? (This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it  )
That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do.
I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 
Thanks for stopping by, David. I'll see what I can do for the true CSV stuff, soon. 
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|
|
SSC Veteran
      
Group: General Forum Members
Last Login: Wednesday, May 22, 2013 9:37 AM
Points: 285,
Visits: 1,378
|
|
Jeff Moden (5/2/2011)
David Data (5/2/2011)
All these methods use CHARINDEX to efficiently see where the next delimiter is. But what about handling CSV strings with quoted text, e.g. where "John Smith","23, The High Street","Sometown" has to be split into 3 strings? Do you have a good method for this sort of data? (This is why I prefer tab separated to comma separated, with all its different ways to handle embedded comas, but usually you have to process the client's data as they give it to you, not as you'd like it  ) That would be "True" CSV as it was meant to be instead of the this "Comedy Separated Value" stuff that comma separated is referred to as. You're correct that this splitter won't handle it especially when the quotes may be conditional. There IS a way to handle true CSV's using ACE drivers (64 bit) or Jet drivers (32 bit). I guess a "Spackle" article on those would do. I also usually prefer TAB delimited or, even better, ASCII characters 28 through 31 (yep, there are characters designed for this type of stuff but no one uses them anymore). But, again, you're correct that you usually have to take whatever some 3rd party is giving you. Of course, that's not a problem because they always do it "right", huh? 
David and Jeff: the real issue here is that ANY delimiter, whether single or multiple characters, can be embedded in the data itself. The amount of code it would take to turn off and back on again the parsing within text-qualified sections of the data would likely render the code much less efficient (and I am not sure if it would truly be set-based). However, there is a way: Regular Expressions via SQLCLR . A RegEx Split function can handle text-qualified data with embedded delimiters. You can find some examples around the net on doing this yourself or you can simply download an already done free set of functions at: http://www.SQLsharp.com/.
That said, if Jeff is able to make a set-based function to do this then I would love to see it as I know that I would learn something from it .
Take care, Solomon...
SQL# - http://www.SQLsharp.com/
|
|
|
|
|
SSChasing Mays
      
Group: General Forum Members
Last Login: Wednesday, May 15, 2013 7:43 AM
Points: 656,
Visits: 2,921
|
|
Jeff, your exposition of how you mentally wrestle with a problem was exceptional. That alone made your article worth reading! You've got a rare combination of technical skill, lucid writing style, and humor.
One question: where's the CLR function? I didn't see it in the attachments.
Thanks, Rich
|
|
|
|
|
SSC-Enthusiastic
      
Group: General Forum Members
Last Login: Thursday, May 09, 2013 9:20 AM
Points: 196,
Visits: 49
|
|
Congratulations, You just rediscovered a LRLA(1) parser (left to right lookahead one), invented in 20 years ago by DICK GRUNE and CERIEL JACOBS. PARSING TECHNIQUES A Practical Guide. I am impress you implemented it in TSQL. Here is another experiment you can try. I think it should come out equal in speed; but lower in memory because same parsing technique but manages memory with CLR types. Take a look at http://irony.codeplex.com/. This is a very good implementation of a LRLA(1) parser. It should be able to run in the SQLCCLR as safe code there are no external dependencies. Grab the BNF grammar for CSV from http://www.ietf.org/rfc/rfc4180.txt One more idea to explore. The LRLA(1) parser can take generic grammars. You could easily parse JSON http://www.ietf.org/rfc/rfc4627.txt into SQL table. Source data is data type richer compared to CSV because JSON values are type, not just tuples of strings for CSV.
|
|
|
|
|
SSC-Dedicated
           
Group: General Forum Members
Last Login: Yesterday @ 4:51 PM
Points: 32,923,
Visits: 26,811
|
|
Gianluca Sartori (5/2/2011) Jeff, your article is great! The final splitter function could have been 3 times slower than the original one, and the article would have been memorable anyway. It's a tale of stubborn resistance to surrender and a "thinking outside the box" lesson for all of us. Thank you very much.
I'm humbled by your observation, Gianluca. I didn't think that anyone would understand that especially with the silly and fun metaphors I made about my angst on the subject. In real life, it was killing me that the original function performed so terribly on the high end of strings and I had to either fix it or apologize for it. Thank you for the very kind words.
For the record, I really do have the artwork with the cat in the box. I commissioned a friend to make it for me several years ago to remind me that "Before [I] can think outside the box, [I] first have to realize... [I'm] in a box!"
--Jeff Moden "RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".
First step towards the paradigm shift of writing Set Based code: Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."
For better, quicker answers on T-SQL questions, click on the following... http://www.sqlservercentral.com/articles/Best+Practices/61537/
For better answers on performance questions, click on the following... http://www.sqlservercentral.com/articles/SQLServerCentral/66909/
|
|
|
|