SQL CLR code needed to replace fn_regex_split

Question

Post reply

SQL CLR code needed to replace fn_regex_split

Marios Philippopoulos

SSC Guru

Points: 57190
More actions
October 10, 2009 at 8:54 pm

#212157

Hi,
I have been looking at this post (http://sqlblog.com/blogs/adam_machanic/archive/2009/04/26/faster-more-scalable-sqlclr-string-splitting.aspx) for an effective SQL-CLR equivalent of the xp-based fn_regex_split.
That solution returns a 1-column table where each row contains each split segment in the sequence.
For example, parsing string 'a,b,c,d' (delimiter ",") with Adam's functionality returns the following 1-column table:
a
b
c
d
My problem is that I need to replace fn_regex_split code where the UDF is passed 3 arguments: the string to parse, the delimiter and the number of segment in the sequence to be returned.
In the above example, specifying "2" for the 3rd parameter would return "b".
So I need a scalar-UDF equivalent of Adam's code in the above URL that returns to me a specific segment after the split and not the entire sequence as a column of values.
If anyone could help me with this, it would be very much appreciated.
__________________________________________________________________________________
SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
Persisting SQL Server Index-Usage Statistics with MERGE[/url]
Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

Viewing 15 posts - 1 through 15 (of 19 total)

You must be logged in to reply to this topic. Login to reply

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 1

You don't need a CLR for this...

http://www.sqlservercentral.com/articles/T-SQL/63003/

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Marios Philippopoulos SSC Guru Points: 57190 More actions · Answer 2

Jeff Moden (10/10/2009)
You don't need a CLR for this...
http://www.sqlservercentral.com/articles/T-SQL/63003/

Thanks, I will certainly take a close look; I just took a quick glance at your article, and it looks very interesting.

I have to ask the question, however, which solution would scale better for long strings, the T-SQL one or a SQL-CLR one?

__________________________________________________________________________________
SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
Persisting SQL Server Index-Usage Statistics with MERGE[/url]
Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

RBarryYoung SSC Guru Points: 143329 More actions · Answer 3

I think that you'll find the best string splitting routines, CLR, Tally Table and otherwise, in this thread here. It's long, but well worth the read.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 4

Marios Philippopoulos (10/13/2009)
Jeff Moden (10/10/2009)
You don't need a CLR for this...
http://www.sqlservercentral.com/articles/T-SQL/63003/
Thanks, I will certainly take a close look; I just took a quick glance at your article, and it looks very interesting.
I have to ask the question, however, which solution would scale better for long strings, the T-SQL one or a SQL-CLR one?

The Tally table splitter makes sucking sounds on VARCHAR(MAX) because it doesn't like doing joins on "out of row" datatypes. On things that require VARCHAR(8000) or less, the Tally table does well and comes close to the CLR solution on the really short stuff.

Surprisingly, a well formed While Loop solution seems to do well for VARCHAR(MAX) but it still can't hold a candle to the CLR. And make no doubt about it... it's a very tight nicely written CLR. If you are really compelled to use a CLR, use the one that Flo posted by one of his programmers on the thread that Barry mentioned.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Marios Philippopoulos SSC Guru Points: 57190 More actions · Answer 5

Thanks guys, I will take a look at that thread.

__________________________________________________________________________________
SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
Persisting SQL Server Index-Usage Statistics with MERGE[/url]
Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

Paul White SSC Guru Points: 150468 More actions · Answer 6

I just want to add:

MAX data types like VARCHAR(MAX) are not 'out of row' data types. By specifying 'MAX' you are letting the storage engine choose how best to store the data. If the actual data stored is 8000 bytes or less, it will be stored exactly as if you had declared the column as a non-MAX data type. So, in the case of VARCHAR(MAX), the data would be stored exactly as for VARCHAR(n) - even going as far as storing data in ROW_OVERFLOW_DATA allocation units if required.

If the data exceeds 8000 bytes, it is stored as a LOB - exactly as for the deprecated text/ntext/image data types. Of course, you can modify this behaviour using sp_tableoption to set large value types out of row so that MAX data types are always stored as a LOB - but that is not the default.

Finally, string manipulation using T-SQL pretty much always sucks - at least in non-trivial scenarios. If you are competent in a .NET language (like C#, C++, or VB), writing the routine as a common language runtime (CLR) routine is often the 'best' solution.

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 7

Ummm... ok... take any split function that works on an VARCHAR(8000) and test it for performance using something less than 8k to split (sorry for being obvious). Then, just change all the VARCHAR(8000)'s to VARCHAR(MAX) and watch the code suddenly take twice as long with the same data you used to test the VARCHAR(8000) code with. If what you say is true, why does that happen?

Heh... "not being snarky" here... I really want to know. 😀

And except for a really well written split CLR function, I disagree on the CLR thing... take a look at how badly the CLR got beaten on the recent article on concatenation.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Paul White SSC Guru Points: 150468 More actions · Answer 8

Jeff Moden (10/17/2009)
take any split function...

Ah right. It's really important to distinguish between storage and execution behaviour with MAX data types. The comments I made previously were exclusively related to physical storage. It was the 'out of row' statement that prompted that - since talking about 'out of row' only makes sense when talking about physical storage.

When it comes to execution, SQL Server has to use different strategies and code paths when it deals with variables, parameters and the like which could potentially be up to 2GB in size. So in code it certainly does make sense to define things using non-MAX data types unless structures over 8000 bytes are actually required. I suppose it's worth mentioning that I include XML and the old LOB types when I say 'MAX datatypes'.

Jeff Moden (10/17/2009)
And except for a really well written split CLR function, I disagree on the CLR thing... take a look at how badly the CLR got beaten on the recent article on concatenation.

I do wish you would come to see CLR routines as complimentary to T-SQL 🙂

T-SQL, as I have said many times before, is always the default choice. It has the distinct advantage of being local to the data and being very heavily optimized for set-based manipulation of that data.

CLR objects extend T-SQL and provide the SQL Server developer with access to much of the power of the .NET framework. Specifically, it allows us to do things that previously would have been hard or dumb to try in the first place. I for one will never go back to extended stored procedures, sp_oA methods, or xp_cmdshell if I can help it!

In the case of string manipulation, on a level playing field, a .NET language routine will absolutely knock the socks off any T-SQL implementation, including the XML trick. The only reason that concatenation is faster using the XML trick (and let's not forget that non-XML T-SQL methods suck too) is that the data is local to it.

The cost of passing the input strings to the CLR row-by-row more than compensates for the fact that the actual work - the concatenation - is much faster in a .NET language. The situation with string splitting is much more favourable to the CLR code: a large string is passed to the code once. The cost of that extra step, and streaming the results back, is small compared to the massively more efficient splitting possible in .NET code. Splitting, after all, is computationally intensive.

Forget Adam Machanic's very highly optimized CLR code (awesome as it is). The first CLR solution in the splitting thread came from me - not optimized for anything at all, a totally naiive and amateurish C# effort. Nevertheless, it beat all the well-established T-SQL efforts in very nearly every test case. Considering the head start the T-SQL has, this should say something about how inefficient string manipulation is in SQL Server.

If the data to be concatenated/split existed in a text file, you wouldn't expect T-SQL to be a good choice since the overhead of importing the data in the first place would likely be the dominant factor. With that in mind, it is amazing that the CLR stuff is ever faster than T-SQL methods! But it is 😉

Concatenation is a really tough ask for a CLR routine - lots of data access and comparatively low computation costs make this a less than ideal candidate for CLR. The best T-SQL method has been very widely researched and tested, so the interest for me is in finding a good CLR solution which comes as close as possible to the T-SQL version.

As I said elsewhere tonight, I am somewhat uncomfortable with the XML solution. (1) Why the heck are we converting strings to XML and back just to join them together; (2) the syntax is just plain ugly; and (3) the XML method is incomplete - try concatenating a string containing CHAR(29) for example:

[font="Courier New"]Msg 6841, Level 16, State 1, Line 1

FOR XML could not serialize the data for node 'NoName' because it contains a character (0x001D) which is not allowed in XML. To retrieve this data using FOR XML, convert it to binary, varbinary or image data type and use the BINARY BASE64 directive.[/font]

Paul

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 9

Paul White (10/17/2009)
I do wish you would come to see CLR routines as complimentary to T-SQL 🙂

Heh... Actually, I do and I also understand the power there. The problem I have is that too many people use it as a crutch to do things that they don't know how to do in T-SQL and it usually ends up costing quite a bit in a lot of areas including performance.

In the case of string manipulation, on a level playing field, a .NET language routine will absolutely knock the socks off any T-SQL implementation, including the XML trick. The only reason that concatenation is faster using the XML trick (and let's not forget that non-XML T-SQL methods suck too) is that the data is local to it.

Agreed... but as you very well point out, it's not a level playing field.

If the data to be concatenated/split existed in a text file, you wouldn't expect T-SQL to be a good choice since the overhead of importing the data in the first place would likely be the dominant factor. With that in mind, it is amazing that the CLR stuff is ever faster than T-SQL methods! But it is 😉

Agreed... in fact, forget CLR stuff... precompiled programs in the form of DLL's and EXE's are the berries when it comes to speed if you don't have to go through layers of API's to get to the data. BCP is a pretty good example of that. What I'd really like to see is for MS to acknowledge that concatenating/splitting and other string manipulation is a very common task and to write a decent CONCATENATE and SPLIT function for T-SQL (for starters).

As I said elsewhere tonight, I am somewhat uncomfortable with the XML solution. (1) Why the heck are we converting strings to XML and back just to join them together; (2) the syntax is just plain ugly; and (3) the XML method is incomplete - try concatenating a string containing CHAR(29) for example:

No argument there... it's stupid. Like I said, a decent CONCATENATE function by MS would keep people from doing it all wrong because, right now, none of the methods are correct in my opinion.

Just to be absolutely clear and as contrary as it sounds to many of my posts, I'm not anti-CLR. But it has to make sense like a really fast, well written split CLR does (as proven by testing) does. Writing a CLR to make up for a lack of knowledge in T-SQL, especially if the CLR turns out to be slower or less effecient, doesn't make sense to me.

As a side bar, I wish MS would do a couple of other things like allow parameterized BULK INSERTs, FROM clauses, and USE statements instead of having to write it as dynamic SQL. A good DIR function would be nice instead of having to resort to xp_CmdShell, sp_OA*, or undocumented/unsupported features like the 3rd operand of xp_DirTree. I forget who it is but one of the frequent posters says it very well in his signature line... "SQL = Scarcely Qualifies as a Language". 😀 There wouldn't actually be a need for CLR's if SQL did what folks needed.

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 10

I'll duck in here for a minute.

I have to disagree for a second there with extending T-SQL into areas it "doesn't belong in". Jeff - I was with you all of the way up to building in a dir command. Using parameters for USE, FROM clauses, etc (by the way - I'd throw just plain "saved parameterized queries" to the pile), all extend SQL in the way it's "meant to be extended", i.e. as a DATA manipulation language. Anything involving the "outside world" IMO just incentivizes all of those cruel ways some folks love to torture SQL Server (e.g. let me use my SQL server to automate EVERYTHING).

In short - let's not add any more procedural capabilities, and keep that stuff out. The dev team has enough things to have to deal with in the DML realm (like windowing aggregates, making functions actually "work" worth a damn, better intellisense for the SQL devs), instead of having to worry why Johhn'y s directory of record covers can't talk to the OS.

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

Jeff Moden SSC Guru Points: 1004664 More actions · Answer 11

Matt Miller (#4) (10/17/2009)
Anything involving the "outside world" IMO just incentivizes all of those cruel ways some folks love to torture SQL Server (e.g. let me use my SQL server to automate EVERYTHING).

Heh... nah... I wanted that stuff to make life a little easier in the ETL world for me. As a side benefit, people would abuse that stuff just like they have Cursors and CLR's... makes more work for folks like me to go in an clean stuff up at a pretty good rate. 😉

--Jeff Moden

RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

Change is inevitable... Change for the better is not.

Helpful Links:
How to post code problems
How to Post Performance Problems
Create a Tally Function (fnTally)

Matt Miller (4) SSC Guru Points: 124210 More actions · Answer 12

Jeff Moden (10/17/2009)
Matt Miller (#4) (10/17/2009)
Anything involving the "outside world" IMO just incentivizes all of those cruel ways some folks love to torture SQL Server (e.g. let me use my SQL server to automate EVERYTHING).
Heh... nah... I wanted that stuff to make life a little easier in the ETL world for me. As a side benefit, people would abuse that stuff just like they have Cursors and CLR's... makes more work for folks like me to go in an clean stuff up at a pretty good rate. 😉

I think "benefit" is dependent on whether you get to clean it up for "free" or for a fee. For the first time in a long many years - I find myself not consulting at all, so the only messes I get stuck cleaning are the "free" ones, so they're not nearly as fun...:)

That said - I could see helping the ETL tools (outside of SSIS), but I'd almost prefer it be a separate utility kind of like BCP. So you can externalize the security, etc.... But I will stop there, it's starting to sound like a CLR command-line routine.:)

----------------------------------------------------------------------------------
Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

RBarryYoung SSC Guru Points: 143329 More actions · Answer 13

Matt Miller (#4) (10/17/2009)
Jeff Moden (10/17/2009)
Matt Miller (#4) (10/17/2009)
Anything involving the "outside world" IMO just incentivizes all of those cruel ways some folks love to torture SQL Server (e.g. let me use my SQL server to automate EVERYTHING).
Heh... nah... I wanted that stuff to make life a little easier in the ETL world for me. As a side benefit, people would abuse that stuff just like they have Cursors and CLR's... makes more work for folks like me to go in an clean stuff up at a pretty good rate. 😉
I think "benefit" is dependent on whether you get to clean it up for "free" or for a fee. For the first time in a long many years - I find myself not consulting at all, so the only messes I get stuck cleaning are the "free" ones, so they're not nearly as fun...:)

Ouch! That's a real shame, Matt, as that has always been my favorite thing about consulting.

...
That said - I could see helping the ETL tools (outside of SSIS), but I'd almost prefer it be a separate utility kind of like BCP. So you can externalize the security, etc.... But I will stop there, it's starting to sound like a CLR command-line routine.:)

Actually I did write a .net exe just like that, I could probably convert it to CLR (of course, I never know when SQLCLR's restrictions are going to bite me :-)). The problem was though, that it I could never get my .net version to run as fast as whatever archaic framework they wrote BCP in.

[font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung [font="Arial Black"]
Proactive Performance Solutions, Inc. [/font][font="Verdana"] "Performance is our middle name."[/font]

Paul White SSC Guru Points: 150468 More actions · Answer 14

Good response there Jeff - and Matt makes good points too.

I'm glad we are much closer in agreement on the CLR stuff than I suspected.

As far as extending T-SQL is concerned...I do agree that CONCATENATE and SPLIT are glaring omissions, along with many of the other suggestions you make (USE is particularly vexing).

Integrating file system stuff? Less convinced on that one...I think I lean more toward Matt's views there, but it's not clear cut - none of the available options are entirely satisfactory except maybe CLR stuff on 64-bit platforms.

Paul

edit: Oh! And Barry too - who snuck in a response while I was typing. IIRC the BCP interfaces are already exposed in .NET...?

2nd edit: Ah yes, it was SqlBulkCopy I was thinking of: http://msdn.microsoft.com/en-us/library/system.data.sqlclient.sqlbulkcopy.aspx

Paul White
All articles available on SQL.kiwi
@SQL_Kiwi