RE: Splitting Strings Based on Patterns

SSC Guru

Points: 86908

December 24, 2012 at 10:30 pm

SQL Kiwi (11/29/2012)
dwain.c (11/29/2012)
fregatepallada (11/29/2012)
IMHO this task would be better implemented in CLR UDF rather than in pure T-SQL. CLR UDF easily allows to apply REGULAR EXPRESSIONS.
So how about an example?
I'd love to put it through its paces and compare the performance against the pure SQL version.
Hi Dwain,
SSC author Solomon Rutzky[/url] has the free SQL# library available at http://www.sqlsharp.com. Among the many features available in the free version running under the SAFE permission set is the RegEx_Replace function. Taking the example from the original thread and the start of this article:
Given an input string such as: 1234ABC123456XYZ1234567890ADS, I would like to replace any string of digits that is longer than 5 characters with some other character (e.g., ‘x’), while leaving the remaining characters in the string unchanged.
The whole solution is:
SELECT
SQL#.RegEx_Replace(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', N'x', -1, 1, NULL);
Producing the result:
1234ABCxXYZxADS
There are all sorts of powerful and useful functions in this library, for example, if you want to list the matches for a regular expression (which is a much more powerful superset of the CHARINDEX and PATINDEX syntax), you can use:
SELECT * FROM SQL#.RegEx_Matches(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', 1, NULL);
+----------------------------------------------------+
¦ MatchNum ¦ Value ¦ StartPos ¦ EndPos ¦ Length ¦
¦----------+------------+----------+--------+--------¦
¦ 1 ¦ 123456 ¦ 8 ¦ 13 ¦ 6 ¦
¦ 2 ¦ 1234567890 ¦ 17 ¦ 26 ¦ 10 ¦
+----------------------------------------------------+

Finally found some time to play around with the SQL# library, so I thought I'd try a bit of a performance test against PatternSplitCM. So I constructed the below test harness, after verifying that I was giving it the right RegEx match argument and that returned results were identical:

CREATE TABLE #t1 (MyString VARCHAR(8000))

DECLARE @Pattern VARCHAR(500) = '[0-9]'

;WITH Tally (n) AS (

SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))

FROM sys.all_columns a CROSS JOIN sys.all_columns b)

INSERT INTO #t1

SELECT REPLICATE('abc0123hij456nopq789uvw1y', 1 + ABS(CHECKSUM(NEWID())) % 315)

FROM Tally

DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)

PRINT 'PatternSplitCM'

SET STATISTICS TIME ON

SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item

--SELECT MyString, ItemNumber, Item, [Matched]

FROM #t1

CROSS APPLY PatternSplitCM(MyString, @Pattern)

WHERE [Matched]=1

SET STATISTICS TIME OFF

PRINT 'SQL#.RegEx_Matches'

SET STATISTICS TIME ON

SELECT @MyString=MyString, @ItemNumber=MatchNum, @Item=Value

--SELECT MyString, MatchNum, Value

FROM #t1

CROSS APPLY SQL#.RegEx_Matches(MyString, N'\d{1,}', 1, NULL);

SET STATISTICS TIME OFF

DROP TABLE #t1

Of course, the RegEx match function doesn't return the unmatched strings, however you can limit the strings returned from PatternSplitCM using the [Matched] column. Here are the results.

(1000 row(s) affected)

PatternSplitCM

SQL Server Execution Times:

CPU time = 5102 ms, elapsed time = 5876 ms.

SQL#.RegEx_Matches

SQL Server Execution Times:

CPU time = 28673 ms, elapsed time = 28997 ms.

Even thought PatternSplitCM emerged victorious in the CPU/elapsed time wars, I do concede that the SQL# RegEx match will be much more flexible.

I really like some of the other functions in this library too. Anybody that hasn't downloaded it, should do so now!

Edit: As an afterthought, I realized I may have biased the tests using VARCHAR(8000) so I changed all the strings defined as VARCHAR(8000) to NVARCHAR(4000) [used by the SQL# function] but not the internal variable type of PatternSplitCM. This was the revised result (SQL# was now a little faster but not enough to beat PatternSplitCM, most likely because it is not as generalized):

(1000 row(s) affected)

PatternSplitCM

SQL Server Execution Times:

CPU time = 4009 ms, elapsed time = 4198 ms.

SQL#.RegEx_Matches

SQL Server Execution Times:

CPU time = 16957 ms, elapsed time = 17392 ms.

My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]
My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
Since random numbers are too important to be left to chance, let's generate some![/url]
Learn to understand recursive CTEs by example.[/url]
[url url=http://www.sqlservercentral.com/articles/St