Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««23456»»

RegEx for DBAs Expand / Collapse
Author
Message
Posted Sunday, May 13, 2012 4:25 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, October 28, 2014 3:00 PM
Points: 368, Visits: 1,957
Cadavre (5/10/2012)
What I like about this site, what keeps me coming back, is that I can have a discussion with complete strangers that often have far more experience than me (and greater knowledge!), where I disagree with what they've said and instead of getting angry they will discuss the topic. Kudos.


Thank you. I see this as a good discussion and there is rarely, if ever, a reason to get angry about this stuff. It is either a difference of opinion or a miscommunication. And showing frustration and/or anger certainly isn't going to help mitigate those. I see the current issue as being one of communication so I will again try to clarify (and I also plan on using the explanation/examples below as part of my forth-coming article).

Cadavre (5/10/2012)
Anyway, back to the topic at hand. You're right of course, the WHERE OR combination is simpler (I overlooked it to be honest).


I think the OR structure was slightly faster as it allows for short-circuiting and does not need to process all 6 patterns if any prior ones in the list produce a TRUE.

Cadavre (5/10/2012)
As for the more complex patterns, well if you have anything to hand then I'd love to see some performance results. The fact is that I'd love to use regular expressions instead of building up more complicated T-SQL, it'd make my job much easier. But I'm paid for doing the best I can in the time frame available, which means I can't justify using something that I have never seen outperform T-SQL.

I'm not arguing against using RegEx per se, I just wanted to see some performance results to give me a reason to use it. Don't get me wrong here, I have a RegEx CLR that I wrote on my server. I do tests to see if it's worth using, it's part of my process to explore a few options to every problem. I've just not come across a place where it beats the T-SQL alternative (except for string splitting, and it was so close that the T-SQL alternative was implemented so that we wouldn't need to convince our clients to enable CLR on their servers).


Ok. So the heart of the matter and confusion here (and this goes for Jeff and others here as well) is not in how Regular Expressions work, but what they can REALLY do. As has been said a few times already, the rather simple patterns discussed so far are more for educational purposes than real-world usage. Please (please!) do not get caught up on performance differences between these examples and what can be done in not-so-complicated T-SQL. Nobody is recommending that even moderately complex T-SQL be replaced by Regular Expressions. Regular Expressions are used to go beyond what T-SQL can do and looking at uniform patterns masks much of the power of RegEx.

Regular Expressions really shine when looking for non-uniform patterns. SSNs, Phone Numbers, Postal Codes, etc. all have a somewhat fixed pattern. But when the width, or even existence of segments of patterns is variable, the number of LIKE or PATINDEX can become impossible to do. To put it another way, if you have a pattern that can be tested side-by-side with a T-SQL equivalent then maybe you should just use the T-SQL. Hence, what RegEx should be used for is situations that are not going to be able to be performance tested because there will be no T-SQL equivalent to compare against.

In T-SQL we have the ability to do a single-character range using square brackets: [ and ]. But specifying that range does not allow for situations where that positions doesn't exist OR that single-character spec repeats to a variable degree. It also does not handle very well (or at all) specifying control characters, tab, newlines, unicode, etc. in that range whereas Regular Expressions allow for everything. But again, the variable nature of the pattern is really the power here. To simply say you have a pattern of "1 or more alpha characters (case-sensitive) followed by 1 or more white-spaces characters followed by 5 to 5000 digits followed by a single, optional ^ or % followed by 1 or more word characters (non-case-sensitive)" is represented by the following Regular Expression:

[a-z]+\s+\d{5,5000}[^%]?\w+

Now please represent that using straight T-SQL. Keep in mind that this is a subset of whatever string is being tested. Meaning it can either be the entire string or a fragment anywhere within a larger string. And there might be multiple instance of that pattern within the string.

But we also are not just testing for the existence of the pattern (i.e. IsMatch). We typically want to extract that pattern from the larger string, similar to using SUBSTRING (i.e. Match, Matches, and CaptureGroup).

I have some code below that shows a working example to hopefully give a more visual idea of what Regular Expressions really do for us. The example is very contrived so it might seem a bit silly but it does show a few different things that can be done not only with patterns, but also how Match and CaptureGroup work to go beyond mere existence testing.

The pattern is (emoticons are ": (" without the space between them):

([a-y]+)\d+(??test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b

and it means:


  • One or more alpha a-y characters (saved as Group1)

  • followed by one or more digits

  • followed by EITHER the word "test" xor "demo" followed by a dash followed by one or more digits (saved as Group2)
    XOR the word "real" xor one or more characters a,b, or c followed by three or more white-space characters
    followed by one or more characters x, y, or z (saved as Group3)

  • followed by the string ":"

  • followed by whatever was captured as Group1

  • followed by the string ":SKU{"

  • followed by whatever was captured as Group2 or Group3

  • followed by the string "}"

  • followed by whatever was captured as Group1

  • followed by a word-boundary (i.e. white-space, end of line, end of string, most punctuation)




If you have SQL# installed you should be able to just copy and paste the code below (emoticons are ": (" without the space between them) to get a sense of what we are really talking about with regards to Regular Expressions.

DECLARE @RegularExpression NVARCHAR(4000)
DECLARE @TestCases TABLE
(
TestID INT NOT NULL IDENTITY(1, 1) PRIMARY KEY,
Test NVARCHAR(4000) NOT NULL,
IsMatchResult BIT
)

SET @RegularExpression =
N'([a-y]+)\d+(??test|demo)-\d+)|(real|[abc]+\s{3,}[xyz]+)):\1:SKU{(?:\2|\3)}\1\b'

INSERT INTO @TestCases (Test, IsMatchResult)
SELECT 'zzz bob12312test-65:bob:SKU{test}bob' AS [Test], 1 AS [IsMatchResult]
UNION ALL
SELECT 'zzz bob12312real:bob:SKU{real}bob', 1
UNION ALL
SELECT 'zzz bob12312real:bob:SKU{real}bob-', 1
UNION ALL
SELECT 'zzz bob12312real:bob:SKU{real}boby', 0 -- last character negates the word-boundary
UNION ALL
SELECT 'intro struffs..bob12312real:bob:SKU{real}bob.extra stuffs at the end', 1 -- intro and extra stuffs excluded from match
UNION ALL
SELECT 'zzz bob12312aba z:bob:SKU{aba z}bob', 1
UNION ALL
SELECT 'zzz bob12312aba

zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx:bob:SKU{aba

zyzyzyzyzyzyzyzyzyzyzxxzyzyzyzyzyzyzyzyzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzzxx}bob', 1
UNION ALL
SELECT 'zzz bob12312aba z:bob:SKU{aba z}bob', 0 -- only 2 spaces between "aba" and "z" for group 3
UNION ALL
SELECT 'zzzbob12312aba z:bob:SKU{aba z}bob', 1 -- still works as "z" is not part of group 1 pattern
UNION ALL
SELECT 'yyybob12312aba z:bob:SKU{aba z}bob', 1 -- "y" could be part of group 1 but ignored as "yyybob" doesn't repeat
UNION ALL
SELECT 'yyyyybob12312aba z:yyybob:SKU{aba z}yyybob', 1 -- first 3 "y" now included in group 1 as that pattern does repeat


SELECT tc.*,
'---' AS [---],
match.*,
'---' AS [---],
SQL#.RegEx_IsMatch(tc.Test, @RegularExpression, 1, '') AS [IsMatch],
SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 1, '', 1, -1, '') AS [CaptureGroup1],
SQL#.RegEx_CaptureGroup(tc.Test, @RegularExpression, 3, '', 1, -1, '') AS [CaptureGroup3]
FROM @TestCases tc
CROSS APPLY SQL#.RegEx_Match(tc.Test, @RegularExpression, 1, '') match

Assuming you have run the code above and can see what IsMatch, Match, and CaptureGroup are doing as well as what does and does not match that pattern, please represent that in straight T-SQL so we can do a performance comparison. And be sure that your code allows for the pattern to exist multiple times within the larger string and can extract each instance. This is done via the Matches function which I have not shown here but will in my future article.

Hopefully it is a bit clearer now as to why Regular Expressions are such a powerful tool. It is doubtful that you will need to use them frequently, but just knowing about them and what they can do will hopefully help when encountering such situations .


Take care,
Solomon...





SQL# - http://www.SQLsharp.com/
Post #1299300
Posted Sunday, May 13, 2012 6:09 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:23 PM
Points: 35,403, Visits: 31,965
I very much appreciate the time you've spent on that great explanation, Solomon. And I absolutely believe that you've hit the nail on the head when you said...

...if you have a pattern that can be tested side-by-side with a T-SQL equivalent then maybe you should just use the T-SQL. Hence, what RegEx should be used for is situations that are not going to be able to be performance tested because there will be no T-SQL equivalent to compare against.


That's the whole crux of the matter and the very much needed point that people tend to leave out when they write articles about SQL Server and Regex. Without mentioning that point, there's going to be a lot of inexperienced folks that use Regex for things that they really should be using T-SQL for. I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server until it's been cleaned up and sorted out a bit. It's kind of like people trying to use Full Text Search to do the things that you really need a more dedicated product for.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1299311
Posted Monday, May 14, 2012 3:16 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:31 AM
Points: 2,909, Visits: 1,837
Jeff Moden (5/13/2012)
I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server.


...and then there is "Big Data".


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1299909
Posted Monday, May 14, 2012 4:02 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:23 PM
Points: 35,403, Visits: 31,965
David.Poole (5/14/2012)
Jeff Moden (5/13/2012)
I might go so far as to say that if youre trying to find items like the example you gave, then the data probably shouldn't be allowed anywhere near SQL Server.


...and then there is "Big Data".


Not sure what you mean by that, David but "Big Data" is nothing more than little data on steroids.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1299927
Posted Tuesday, May 15, 2012 7:27 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Saturday, October 25, 2014 11:42 AM
Points: 2,380, Visits: 7,591
Thanks for the explanation Solomon, I haven't forgotten but I'm quite busy at the moment. I intend to attempt to convert your more complicated RegEx into T-SQL, but I suspect that the amount of manipulation required may be too much for a particularly performant solution.


Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1300205
Posted Wednesday, May 16, 2012 2:08 PM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:31 AM
Points: 2,909, Visits: 1,837
You are half right Jeff,

We'll leave out the vagueness of the term "Big Data" which is more marketing than useful. I think Buck Woody has the closest to a useful definition http://blogs.msdn.com/b/buckwoody/archive/2011/10/18/big-data-and-the-cloud-more-hype-or-a-real-workload.aspx

Volume and velocity of data are the bits that will be most familiar to DBAs. Variety and variability are the bits where we start pushing the envelope.
Microsoft are making a big play around SQOOP allowing SQL2012 to play nicely with Hadoop and I think DBAs in the DW space need to start looking at Hadoop.

For me the most interesting aspect of "Big Data" is solving the problems of mining useful information out of the "unstructured" stuff. For example, mining bulleting boards and forums to derive useful and leveragable information by means of an automated process. To be able to do that sort of stuff you need "Big Brains" and this is where the discipline of "Data Scientist" comes in.

Text parsing and term extraction form part of it and this is where RegEx becomes useful. You and others are absolutely right to point out that it has a big performance penalty and I can't imagine anyone sane using it in the data tier for OLTP. In a data warehouse, and in particular the staging area of a data warehouse it is a very powerful weapon.

If you are worried about performance then you'll blow a gasket when you see what a data profiler does to a database server. Horrible it may be but entirely necessary in the DW space.

Steve Jones posted an editorial a while back on the evolution and possible future evolution of SQL Server that used the analogy of Grandpa's axe. It's had 4 handles and 5 heads but it is still Grandpa's axe. Similarly SQL Server started out as a RDBMS but is now touching all sorts of things outside of the RDBMS space but is still SQL Server.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1301343
Posted Thursday, May 17, 2012 4:20 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Today @ 12:23 PM
Points: 35,403, Visits: 31,965
I used to work in the "E Discovery" business where millions of documents where compared and screened for similarities, key words, and other patterns. We had tools such as Equivio that would do all of that and tools such a Lucene for certain full text searches and other pattern recognition. Those tools may have used Regex but it seems to may that with the speed they actually had, they probably used something a whole lot more directed and effecient and that makes a whole lot of sense because code to do something specific is almost always a whole lot faster than generic code.

I appreciate the thoughts you folks have written down but, except for possible ad hoc one off searches where performance doesn't really seem to matter to some folks, I'm not seeing much that I like in Regex. I'll keep an open mind about certain aspects of it but I'm not seeing it as a valuable large volume processing tool.

Heh... and you might be able to cut yourself or the guy standing behind you when you swing a 4 handled, 5 headed axe and it'll look really cool on the shelf, but you're not going to cut much wood with it.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1301618
Posted Wednesday, December 26, 2012 12:30 AM


Hall of Fame

Hall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of FameHall of Fame

Group: General Forum Members
Last Login: Today @ 3:02 AM
Points: 3,421, Visits: 5,361
Gentlemen... so sorry I'm late to the party!

Paul White recently turned me on to SQL# and I just happened to do a comparison of the SQL# RegEx string splitter against a pattern splitting function in the article I authored (fourth link in my signature line).

You can see those performance results here: http://www.sqlservercentral.com/Forums/Topic1390297-3122-5.aspx

Just hoping to make a very minor contribution to this most interesting article and thread. Thanks to a good friend (you know who you are) for tuning me in to this.



My mantra: No loops! No CURSORs! No RBAR! Hoo-uh!

My thought question: Have you ever been told that your query runs too fast?

My advice:
INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.


Need to UNPIVOT? Why not CROSS APPLY VALUES instead?
Since random numbers are too important to be left to chance, let's generate some!
Learn to understand recursive CTEs by example.
Splitting strings based on patterns can be fast!
Post #1400164
Posted Thursday, September 4, 2014 11:12 PM
SSC Rookie

SSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC RookieSSC Rookie

Group: General Forum Members
Last Login: Thursday, October 2, 2014 5:17 PM
Points: 35, Visits: 172
OLD School (compliments of Stack Overflow forum) for less fortunate users who do not allow to have a CLR-enabled SQL Server:
http://stackoverflow.com/questions/15523789/sql-email-verification-function-using-regex
Solution is based on legacy COM VBScript library, but still works.

My $0.02 - I personally would use Regex expression for data quality check, in production just to validate a scalar values.
Post #1610738
Posted Friday, September 5, 2014 7:01 AM


SSC Eights!

SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!SSC Eights!

Group: General Forum Members
Last Login: 2 days ago @ 12:11 PM
Points: 822, Visits: 1,470
Nice article and nice experiment, but unfortunately from security perspective I am not able to even give the needed SQL# a try simply because SQL# has lots highly restricted features like CLR, Assembly, ...

But good work!
Post #1610880
« Prev Topic | Next Topic »

Add to briefcase «««23456»»

Permissions Expand / Collapse