RE: RegEx for DBAs – SQLServerCentral

SSCoach

Points: 16430

May 13, 2012 at 4:25 pm

Cadavre (5/10/2012)
What I like about this site, what keeps me coming back, is that I can have a discussion with complete strangers that often have far more experience than me (and greater knowledge!), where I disagree with what they've said and instead of getting angry they will discuss the topic. Kudos.

Thank you. I see this as a good discussion and there is rarely, if ever, a reason to get angry about this stuff. It is either a difference of opinion or a miscommunication. And showing frustration and/or anger certainly isn't going to help mitigate those. I see the current issue as being one of communication so I will again try to clarify 😉 (and I also plan on using the explanation/examples below as part of my forth-coming article).

Cadavre (5/10/2012)
Anyway, back to the topic at hand. You're right of course, the WHERE OR combination is simpler (I overlooked it to be honest).

I think the OR structure was slightly faster as it allows for short-circuiting and does not need to process all 6 patterns if any prior ones in the list produce a TRUE.

Cadavre (5/10/2012)
As for the more complex patterns, well if you have anything to hand then I'd love to see some performance results. The fact is that I'd love to use regular expressions instead of building up more complicated T-SQL, it'd make my job much easier. But I'm paid for doing the best I can in the time frame available, which means I can't justify using something that I have never seen outperform T-SQL.
I'm not arguing against using RegEx per se, I just wanted to see some performance results to give me a reason to use it. Don't get me wrong here, I have a RegEx CLR that I wrote on my server. I do tests to see if it's worth using, it's part of my process to explore a few options to every problem. I've just not come across a place where it beats the T-SQL alternative (except for string splitting, and it was so close that the T-SQL alternative was implemented so that we wouldn't need to convince our clients to enable CLR on their servers).

Ok. So the heart of the matter and confusion here (and this goes for Jeff and others here as well) is not in how Regular Expressions work, but what they can REALLY do. As has been said a few times already, the rather simple patterns discussed so far are more for educational purposes than real-world usage. Please (please!) do not get caught up on performance differences between these examples and what can be done in not-so-complicated T-SQL. Nobody is recommending that even moderately complex T-SQL be replaced by Regular Expressions. Regular Expressions are used to go beyond what T-SQL can do and looking at uniform patterns masks much of the power of RegEx.

Regular Expressions really shine when looking for non-uniform patterns. SSNs, Phone Numbers, Postal Codes, etc. all have a somewhat fixed pattern. But when the width, or even existence of segments of patterns is variable, the number of LIKE or PATINDEX can become impossible to do. To put it another way, if you have a pattern that can be tested side-by-side with a T-SQL equivalent then maybe you should just use the T-SQL. Hence, what RegEx should be used for is situations that are not going to be able to be performance tested because there will be no T-SQL equivalent to compare against.

In T-SQL we have the ability to do a single-character range using square brackets: [ and ]. But specifying that range does not allow for situations where that positions doesn't exist OR that single-character spec repeats to a variable degree. It also does not handle very well (or at all) specifying control characters, tab, newlines, unicode, etc. in that range whereas Regular Expressions allow for everything. But again, the variable nature of the pattern is really the power here. To simply say you have a pattern of "1 or more alpha characters (case-sensitive) followed by 1 or more white-spaces characters followed by 5 to 5000 digits followed by a single, optional ^ or % followed by 1 or more word characters (non-case-sensitive)" is represented by the following Regular Expression:

[a-z]+\s+\d{5,5000}[^%]?\w+

Now please represent that using straight T-SQL. Keep in mind that this is a subset of whatever string is being tested. Meaning it can either be the entire string or a fragment anywhere within a larger string. And there might be multiple instance of that pattern within the string.

But we also are not just testing for the existence of the pattern (i.e. IsMatch). We typically want to extract that pattern from the larger string, similar to using SUBSTRING (i.e. Match, Matches, and CaptureGroup).

I have some code below that shows a working example to hopefully give a more visual idea of what Regular Expressions really do for us. The example is very contrived so it might seem a bit silly but it does show a few different things that can be done not only with patterns, but also how Match and CaptureGroup work to go beyond mere existence testing.

The pattern is (emoticons are ": (" without the space between them):

([a-y]+)\d+(?:(?:(test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b

and it means:

One or more alpha a-y characters (saved as Group1)
followed by one or more digits
followed by EITHER the word "test" xor "demo" followed by a dash followed by one or more digits (saved as Group2)
XOR the word "real" xor one or more characters a,b, or c followed by three or more white-space characters
followed by one or more characters x, y, or z (saved as Group3)
followed by the string ":"
followed by whatever was captured as Group1
followed by the string ":SKU{"
followed by whatever was captured as Group2 or Group3
followed by the string "}"
followed by whatever was captured as Group1
followed by a word-boundary (i.e. white-space, end of line, end of string, most punctuation)

If you have SQL# installed you should be able to just copy and paste the code below (emoticons are ": (" without the space between them) to get a sense of what we are really talking about with regards to Regular Expressions.

DECLARE @RegularExpression NVARCHAR(4000)

DECLARE @TestCases TABLE

(

TestID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,

Test NVARCHAR(4000) NOT NULL,

IsMatchResult BIT

)

SET @RegularExpression =

N'([a-y]+)\d+(?:(?:(test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b'

INSERT INTO @TestCases (Test, IsMatchResult)

SELECT'zzz bob12312test-65:bob:SKU{test}bob' AS [Test], 1 AS [IsMatchResult]

UNION ALL

SELECT'zzz bob12312real:bob:SKU{real}bob', 1

UNION ALL

SELECT'zzz bob12312real:bob:SKU{real}bob-', 1

UNION ALL

SELECT'zzz bob12312real:bob:SKU{real}boby', 0 -- last character negates the word-boundary

UNION ALL

SELECT'intro struffs..bob12312real:bob:SKU{real}bob.extra stuffs at the end', 1 -- intro and extra stuffs excluded from match

UNION ALL

SELECT'zzz bob12312aba z:bob:SKU{aba z}bob', 1

UNION ALL

SELECT'zzz bob12312aba

zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx:bob:SKU{aba

zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx}bob', 1

UNION ALL

SELECT'zzz bob12312aba z:bob:SKU{aba z}bob', 0 -- only 2 spaces between "aba" and "z" for group 3

UNION ALL

SELECT'zzzbob12312aba z:bob:SKU{aba z}bob', 1 -- still works as "z" is not part of group 1 pattern

UNION ALL

SELECT'yyybob12312aba z:bob:SKU{aba z}bob', 1 -- "y" could be part of group 1 but ignored as "yyybob" doesn't repeat

UNION ALL

SELECT'yyyyybob12312aba z:yyybob:SKU{aba z}yyybob', 1 -- first 3 "y" now included in group 1 as that pattern does repeat

SELECTtc.*,

'---' AS [---],

match.*,

'---' AS [---],

SQL#.RegEx_IsMatch(tc.Test, @RegularExpression, 1, '') AS [IsMatch],

SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 1, '', 1, -1, '') AS [CaptureGroup1],

SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 3, '', 1, -1, '') AS [CaptureGroup3]

FROM @TestCases tc

CROSS APPLY SQL#.RegEx_Match(tc.Test, @RegularExpression, 1, '') match

Assuming you have run the code above and can see what IsMatch, Match, and CaptureGroup are doing as well as what does and does not match that pattern, please represent that in straight T-SQL so we can do a performance comparison. And be sure that your code allows for the pattern to exist multiple times within the larger string and can extract each instance. This is done via the Matches function which I have not shown here but will in my future article.

Hopefully it is a bit clearer now as to why Regular Expressions are such a powerful tool. It is doubtful that you will need to use them frequently, but just knowing about them and what they can do will hopefully help when encountering such situations :-).

Take care,

Solomon...

SQL# — https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
Sql Quantum Lift — https://SqlQuantumLift.com/ ( company )
Sql Quantum Leap — https://SqlQuantumLeap.com/ ( blog )
Info sites — Collations • Module Signing • SQLCLR