SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


Splitting Strings Based on Patterns


Splitting Strings Based on Patterns

Author
Message
dwain.c
dwain.c
SSC Guru
SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)

Group: General Forum Members
Points: 79693 Visits: 6432
SQL Kiwi (11/29/2012)
dwain.c (11/29/2012)
fregatepallada (11/29/2012)
IMHO this task would be better implemented in CLR UDF rather than in pure T-SQL. CLR UDF easily allows to apply REGULAR EXPRESSIONS.

So how about an example?

I'd love to put it through its paces and compare the performance against the pure SQL version.

Hi Dwain,

SSC author Solomon Rutzky has the free SQL# library available at http://www.sqlsharp.com. Among the many features available in the free version running under the SAFE permission set is the RegEx_Replace function. Taking the example from the original thread and the start of this article:

Given an input string such as: 1234ABC123456XYZ1234567890ADS, I would like to replace any string of digits that is longer than 5 characters with some other character (e.g., ‘x’), while leaving the remaining characters in the string unchanged.

The whole solution is:


SELECT
SQL#.RegEx_Replace(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', N'x', -1, 1, NULL);



Producing the result:

1234ABCxXYZxADS

There are all sorts of powerful and useful functions in this library, for example, if you want to list the matches for a regular expression (which is a much more powerful superset of the CHARINDEX and PATINDEX syntax), you can use:


SELECT * FROM SQL#.RegEx_Matches(N'1234ABC123456XYZ1234567890ADS', N'\d{5,}', 1, NULL);




╔══════════╦════════════╦══════════╦════════╦════════╗
║ MatchNum ║ Value ║ StartPos ║ EndPos ║ Length ║
╠══════════╬════════════╬══════════╬════════╬════════╣
║ 1 ║ 123456 ║ 8 ║ 13 ║ 6 ║
║ 2 ║ 1234567890 ║ 17 ║ 26 ║ 10 ║
╚══════════╩════════════╩══════════╩════════╩════════╝


Finally found some time to play around with the SQL# library, so I thought I'd try a bit of a performance test against PatternSplitCM. So I constructed the below test harness, after verifying that I was giving it the right RegEx match argument and that returned results were identical:


CREATE TABLE #t1 (MyString VARCHAR(8000))
DECLARE @Pattern VARCHAR(500) = '[0-9]'

;WITH Tally (n) AS (
SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #t1
SELECT REPLICATE('abc0123hij456nopq789uvw1y', 1 + ABS(CHECKSUM(NEWID())) % 315)
FROM Tally

DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)

PRINT 'PatternSplitCM'
SET STATISTICS TIME ON
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
--SELECT MyString, ItemNumber, Item, [Matched]
FROM #t1
CROSS APPLY PatternSplitCM(MyString, @Pattern)
WHERE [Matched]=1
SET STATISTICS TIME OFF

PRINT 'SQL#.RegEx_Matches'
SET STATISTICS TIME ON
SELECT @MyString=MyString, @ItemNumber=MatchNum, @Item=Value
--SELECT MyString, MatchNum, Value
FROM #t1
CROSS APPLY SQL#.RegEx_Matches(MyString, N'\d{1,}', 1, NULL);
SET STATISTICS TIME OFF

DROP TABLE #t1




Of course, the RegEx match function doesn't return the unmatched strings, however you can limit the strings returned from PatternSplitCM using the [Matched] column. Here are the results.


(1000 row(s) affected)

PatternSplitCM
SQL Server Execution Times:
CPU time = 5102 ms, elapsed time = 5876 ms.

SQL#.RegEx_Matches
SQL Server Execution Times:
CPU time = 28673 ms, elapsed time = 28997 ms.



Even thought PatternSplitCM emerged victorious in the CPU/elapsed time wars, I do concede that the SQL# RegEx match will be much more flexible.

I really like some of the other functions in this library too. Anybody that hasn't downloaded it, should do so now!

Edit: As an afterthought, I realized I may have biased the tests using VARCHAR(8000) so I changed all the strings defined as VARCHAR(8000) to NVARCHAR(4000) [used by the SQL# function] but not the internal variable type of PatternSplitCM. This was the revised result (SQL# was now a little faster but not enough to beat PatternSplitCM, most likely because it is not as generalized):


(1000 row(s) affected)

PatternSplitCM
SQL Server Execution Times:
CPU time = 4009 ms, elapsed time = 4198 ms.

SQL#.RegEx_Matches
SQL Server Execution Times:
CPU time = 16957 ms, elapsed time = 17392 ms.




My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
Paul White
Paul White
SSC Guru
SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)SSC Guru (138K reputation)

Group: General Forum Members
Points: 138341 Visits: 11449
Dwain,

I see rather different performance using your test rig on SQL Server 2012 (build 3321) and SQL Server 2008 R2 (build 4266):

╔════════════╦════════════════╦════════════════════╗
║ SQL Server ║ PatternSplitCM ║ SQL#.RegEx_Matches ║
╠════════════╬════════════════╬════════════════════╣
║ 2012 ║ 24203 ║ 7889 ║
║ 2008 R2 ║ 23019 ║ 24303 ║
╚════════════╩════════════════╩════════════════════╝


Taking the 2008 R2 result first the RegEx is slightly slower, but there is a very important reason for this: the test rig does not produce unique strings. In a typical test, the count of distinct strings generated was 303 (out of 1000 rows total). The optimizer notices this and introduces a 'performance spool' into the query plan for the PatternSplitCM execution only. This optimization introduces a sort on the MyString column above the nested loops join, and a lazy table spool below it. The effect is to group identical MyString occurrences together, so only the first of any set of duplicates is processed by the PatternSplitCM logic. For subsequent rows with the same value of MyString, the spool replays the spooled result.

You can see this in a post-execution plan by inspecting the rewinds and rebinds counter of the table spool. Rewinds occur where the spool replays saved results; a rebind occurs when the correlated parameter (MyString) changes and the PatternSplitCM logic needs to be performed for real. In the test where there were 303 distinct MyString values, the table spool showed 303 rebinds and 697 rewinds. Likewise, all the operators below the table spool show 303 actual executions, not 1000. It is possible to disable the 'performance spool' optimization with trace flag 8690. Doing this forces PatternSplitCM to run on all 1000 input rows, degrading performance from 23019 ms to 40588 ms.

The optimizer does not apply this same optimization to the RegEx execution, because it has no idea how expensive that function is. By default it assumes a small cost and calculates (wrongly) that the performance spool would cost more (due to the sort and spool to tempdb) than it would save by avoiding cheap-looking function executions. We can introduce the 'performance spool' optimization to the RegEx execution with trace flag 8691. Doing this reduces the number of executions of the RegEx function from 1000 to 303, with a corresponding reduction in execution time from 24303 ms to 9978 ms.

The situation with SQL Server 2012 is slightly different because returning rows from a CLR streaming table-valued functions has been significantly optimized in this release. Even without the performance spool, processing all 1000 rows through the RegEx takes only 7520 ms. With a performance spool added using TF 8691, the elapsed time is reduced further to 4886 ms.

A secondary consideration is that I see significant sort spills to tempdb when running the PatternSplitCM solution. This is easy to see with the new warning symbols in SQL Server 2012; in previous releases we need to monitor for sort warnings using Profiler. The underlying reason is the guessed 100-row cardinality from the Top operator, which is used to estimate the sort memory requirement. The strings in this test are quite long, and so require more than 100 numbers from the on-the-fly numbers table.



Paul White
SQLPerformance.com
SQLblog.com
@SQL_Kiwi
dwain.c
dwain.c
SSC Guru
SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)

Group: General Forum Members
Points: 79693 Visits: 6432
Paul,

You have aptly demonstrated why you are an MVP and I am but a plebe. :-)

My mistake, using the same string replicated for the test. I knew I shouldn't - have read that at least a couple of times. But I didn't think to apply it here for some reason (too much holiday spirit I suspect).

New test harness (excuse my crude attempt at generating more unique strings, there's probably a better way) and results.


CREATE TABLE #t1 (MyString VARCHAR(8000))
DECLARE @Pattern VARCHAR(500) = '[0-9]'

;WITH Tally (n) AS (
SELECT TOP 1000 ROW_NUMBER() OVER (ORDER BY (SELECT NULL))
FROM sys.all_columns a CROSS JOIN sys.all_columns b)
INSERT INTO #t1
SELECT REPLICATE(
SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)
+ SUBSTRING('00112233445566778899', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)
+ REVERSE(SUBSTRING('ABCDEFGHIJKLMNOPQRSTUVWXYZ', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))
+ REVERSE(SUBSTRING('00112233445566778899', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))
+ SUBSTRING('JUYWXFZHQAMLKNOPIRSTBVDECG', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)
+ SUBSTRING('01012323454567678989', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)
+ REVERSE(SUBSTRING('JUYWXFZHQAMLKNOPIRSTBVDECG', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))
+ REVERSE(SUBSTRING('01012323454567678989', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))
+ SUBSTRING('AAABBBCCCDDDEEEFFFGGGHHHII', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16)
+ SUBSTRING('00011122233344455566677', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10)
+ REVERSE(SUBSTRING('AAABBBCCCDDDEEEFFFGGGHHHII', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 16))
+ REVERSE(SUBSTRING('00011122233344455566677', 1 + ABS(CHECKSUM(NEWID())) % 10, 1 + ABS(CHECKSUM(NEWID())) % 10))
, 1 + ABS(CHECKSUM(NEWID())) % 75)
--SELECT REPLICATE('abc0123hij456nopq789uvw1y', 1 + ABS(CHECKSUM(NEWID())) % 315)
FROM Tally

--SELECT * FROM #t1
--SELECT MAX(DATALENGTH(MyString)) FROM #t1

DECLARE @MyString VARCHAR(8000), @ItemNumber INT, @Item VARCHAR(8000)

PRINT 'PatternSplitCM'
SET STATISTICS TIME ON
SELECT @MyString=MyString, @ItemNumber=ItemNumber, @Item=Item
--SELECT MyString, ItemNumber, Item, [Matched]
FROM #t1
CROSS APPLY PatternSplitCM(MyString, @Pattern)
WHERE [Matched]=1
SET STATISTICS TIME OFF

PRINT 'SQL#.RegEx_Matches'
SET STATISTICS TIME ON
SELECT @MyString=MyString, @ItemNumber=MatchNum, @Item=Value
--SELECT MyString, MatchNum, Value
FROM #t1
CROSS APPLY SQL#.RegEx_Matches(MyString, N'\d{1,}', 1, NULL);
SET STATISTICS TIME OFF

DROP TABLE #t1





(1000 row(s) affected)
PatternSplitCM

SQL Server Execution Times:
CPU time = 11343 ms, elapsed time = 6576 ms.
SQL#.RegEx_Matches

SQL Server Execution Times:
CPU time = 5975 ms, elapsed time = 6008 ms.



Unfortunately, I don't have SQL 2012 to confirm your results on. Not that they need confirmation given the source.

I find it diabolically clever that the SQL optimizer is able to identify replicated patterns in the strings when constructing the execution plan. Your analysis of that was not only spot on, but extremely easy to comprehend.

Thanks for taking the time to look into this and correct an inappropriate conclusion.

Happy New Year to you sir!


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
fregatepllada
fregatepllada
SSCommitted
SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)

Group: General Forum Members
Points: 1506 Visits: 322
Does anybody try regex and XQUERY?
dwain.c
dwain.c
SSC Guru
SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)SSC Guru (79K reputation)

Group: General Forum Members
Points: 79693 Visits: 6432
fregatepllada (9/12/2014)
Does anybody try regex and XQUERY?


For this problem I personally have not. The solution is designed for folks that can't use CLR on their servers.

Thanks for taking a look though.


My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
My temporal SQL musings: Calendar Tables, an Easter SQL, Time Slots and Self-maintaining, Contiguous Effective Dates in Temporal Tables
fregatepllada
fregatepllada
SSCommitted
SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)

Group: General Forum Members
Points: 1506 Visits: 322
Dwain you do not need CLR to use regex Smile
There is an old way to call Vbscript library
From t-SQL code Wink
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)SSC Guru (896K reputation)

Group: General Forum Members
Points: 896796 Visits: 48286
fregatepllada (9/12/2014)
Dwain you do not need CLR to use regex Smile
There is an old way to call Vbscript library
From t-SQL code Wink


sp_OA???

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

When you put the right degree of spin on it, the number 318 is also a glyph that describes the nature of a DBAs job. Wink

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Alan Burstein
Alan Burstein
SSC Guru
SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)SSC Guru (55K reputation)

Group: General Forum Members
Points: 55563 Visits: 9048
Jeff Moden (9/12/2014)
fregatepllada (9/12/2014)
Dwain you do not need CLR to use regex Smile
There is an old way to call Vbscript library
From t-SQL code Wink


sp_OA???


I suspect they were talking about: http://www.sqlservercentral.com/scripts/Miscellaneous/30963/

Either way, I don't think we can get that into an iTVF.

-- Alan Burstein


Helpful links:

Best practices for getting help on SQLServerCentral -- Jeff Moden
How to Post Performance Problems -- Gail Shaw

Nasty fast set-based string manipulation functions:
For splitting strings try DelimitedSplit8K or DelimitedSplit8K_LEAD (SQL Server 2012+)
To split strings based on patterns try PatternSplitCM
Need to clean or transform a string? try NGrams, PatExclude8K, PatReplace8K, DigitsOnlyEE, or Translate8K

I cant stress enough the importance of switching from a sequential files mindset to set-based thinking. After you make the switch, you can spend your time tuning and optimizing your queries instead of maintaining lengthy, poor-performing code. -- Itzik Ben-Gan 2001

fregatepllada
fregatepllada
SSCommitted
SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)SSCommitted (1.5K reputation)

Group: General Forum Members
Points: 1506 Visits: 322
Yes, an antiquated COM way Smile
Gerald Britton
Gerald Britton
SSChampion
SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)SSChampion (12K reputation)

Group: General Forum Members
Points: 12359 Visits: 1828
Just thought I add my minimalist CTE-based tally table generator:


declare @howmany int = 50;
with n1(n) as (select 1 from (values (1),(1),(1),(1),(1),(1),(1),(1),(1),(1)) n(n)),
n2(n) as (select 1 from n1, n1 n),
n4(n) as (select 1 from n2, n2 n),
n8(n) as (select 1 from n4, n4 n),
N(n) as (select top(@howmany) ROW_NUMBER() over(order by (select 1)) from n8)
select n from N



Note that I prefer (1),(1),... to (0), (0), ... since the latter makes my eyes go funny.

Also note that order by (select 1) works as well as order by (select null) and is shorter to write (Ok, so I'm lazy!)

Finally, my resulting table is called N to reflect its origin in the set ℕ -- the natural numbers

Gerald Britton, MCSE-DP, MVPToronto PASS Chapter
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum









































































































































































SQLServerCentral


Search