Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase «««23456»»»

RegEx for DBAs Expand / Collapse
Author
Message
Posted Wednesday, May 9, 2012 4:07 PM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Tuesday, October 28, 2014 3:00 PM
Points: 368, Visits: 1,957
Cadavre (5/9/2012)
Thanks for the reply, although I think I need more convincing. I don't see it as being a "win" for RegEx due to it being more complicated to do in T-SQL. I'd always rather have a fast method than a slow method, complications can be well documented and everything goes through rigorous testing to ensure bugs are found.


Thank you for doing that test (really). I simply don't have the time these days and it is good to see solid numbers. For this particular example I do agree that the T-SQL equivalent is not unreadable and in fact I figured it would be an even simpler form along the lines of:

WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')

But again, to be fair to the intent of David's article, this was an intro to Regular Expressions for people and not a shining example of the full power of super-complex patterns (Corey Lawson, in the post following your post that I am replying to, alluded to some of the more complex operations). The PostalCode example is just that: something that is easily digestible for people to comprehend what the syntax is doing. For the most part, I did kinda expect that the pure T-SQL condition would perform better, even if not as pretty.

And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes. There are definitely some situations at work, typically ad-hoc debugging (i.e. functionality outweighs performance), where I would have GREATLY benefited from some RegEx functions, especially where the patterns in the data are not uniform like they are here. I am working on a related article (albeit very slowly due to being unable to teach the kids how to change their own diapers ) that goes over various situations that I, and some others, have encountered where RegEx proved indispensable. Maybe when I get farther along with that I can have you and Jeff try to come up with pure T-SQL equivalents. Might be kinda fun .

Take care,
Solomon...





SQL# - http://www.SQLsharp.com/
Post #1297491
Posted Wednesday, May 9, 2012 7:05 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 10:15 PM
Points: 35,399, Visits: 31,959
corey lawson (5/9/2012)
We should all try not to be too dogmatic about things. Even CURSORs and cross-joins have their uses, after all (and not everything tastes better with tabasco sauce on it).


Who's being dogmatic here? I saw an article that was praising some RegEx SQLLRs and thought that might be a good thing but also saw nothing of comparative performance tests.

As many others have stated, there are many things to consider when considering which tools to use and I know that very well because I've been burned by the best of them! Instead of getting lectured (yet again, I may add!) on that fact that there are indeed other things to consider other than performance, can't you Regex folks just meet my simple request to do some broad spectrum million row performance tests so people don't get burned?

It's what any self respecting vendor, author, or endorser would do.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1297528
Posted Wednesday, May 9, 2012 7:28 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 10:15 PM
Points: 35,399, Visits: 31,959
sherifffruitfly (5/8/2012)
As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

-sff


You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1297534
Posted Wednesday, May 9, 2012 7:56 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: 2 days ago @ 6:05 PM
Points: 117, Visits: 582
Jeff Moden (5/9/2012)
sherifffruitfly (5/8/2012)
As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

-sff


You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.

Thanks, Jeff. I was unsure what relevance the post had to a suggestion for a simple RegEx testing tool...

The next bit isn't in any way argumentative, just illustrative...

I'd add that the RegEx I use is on low row-count but high string-count-per-row data - data that's completely isolated to our custom app. Nasty stuff like pulling any and all of four possible custom operators and their parameters from a string pulled from XML built by the UI. Yuk but it works very well.

e.g.

using '\[\d+[\:\-*\d+]+\]' to shred:
WHERE site_id = 81 AND ([3154:2]-[3154:1]=0 AND [3154:3]-[3154:2]=0 AND [3154:4]-[3154:3]=0 AND [3154:5]-[3154:4]=0 AND [3154:6]-[3154:5]=0 AND [3154:7]-[3154:6]=0 AND [3154:8]-[3154:7]=0 AND [3154:9]-[3154:8]=0)

or 'DailySum\(\[\d+[\:\-*\d+]+\]\)' and '\[\d+[\:\-*\d+]+\]' to shred:
WHERE site_id = 29 AND (DailySum([1116]) > 0 or DailySum([1048]) > 0 and [1117] > 0 or DailySum([1027]) > 0 or DailySum([1026]) > 0 or DailySum([1126]) > 0 or DailySum([1141]) > 0 or DailySum([1088]) > 0 or DailySum([1090]) > 0 and DailySum([1092]) > 0 or DailySum([1128]) > 0)

Can't quite bring myself to make it go faster with custom T-SQL for the sake of a few updates per day.

Another great "it depends" article and series of comments! (EXCEPT FOR THE MORONS POLLUTING THE SITE WITH THEIR CRASS ATTEMPTS AT ATTRACTING BUSINESS - THEIR ONLY SUCCESS: ATTRACTING VILIFICATION!)
Post #1297545
Posted Wednesday, May 9, 2012 8:07 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 10:15 PM
Points: 35,399, Visits: 31,959
Solomon Rutzky (5/9/2012)
And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes.


I couldn't agree more! I just wanted to know what the performance curves on these "powerful and flexible" tools actually are because I've been burned by people that brought such tools into the shop.

Although performance certainly isn't the only consideration in the proper evaluation of any tool, it becomes a major negative consideration when the sophisticated tool gets beat 45 to 1 by the simple and much less elegant native tool. Would YOU knowingly use a tool for a simple task if you knew it was going to be 45 times slower than another method? I would hope not.

The other part of the problem is that if that tool is so slow on such a simple task, people really want to know what it's going to do on a more complex task where that "Powerful and Flexible tool" is supposed to really show it's stuff. Because there isn't much performance testing available on these tools (never mind any comparative testing), this thread looks like we've been introduced to a tool that's guaranteed to be 45 times slower according to the bit of testing done on this thread. That's a LOT slower in anyone's book and all other beneficial claims will not withstand that major shortcoming in most people's eyes.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1297548
Posted Wednesday, May 9, 2012 8:21 PM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 10:15 PM
Points: 35,399, Visits: 31,959
SAinCA (5/9/2012)
Jeff Moden (5/9/2012)
sherifffruitfly (5/8/2012)
As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

-sff


You're totally out of line. There's no pissing match when something gets its butt kicked 45 to 1. You don't need to use inflamatory terms like "religious fanatics" when good people have found an extreme performance problem. And if you're so proud of Expresso, you might want to provide a couple of your own performance tests.

Thanks, Jeff. I was unsure what relevance the post had to a suggestion for a simple RegEx testing tool...

The next bit isn't in any way argumentative, just illustrative...

I'd add that the RegEx I use is on low row-count but high string-count-per-row data - data that's completely isolated to our custom app. Nasty stuff like pulling any and all of four possible custom operators and their parameters from a string pulled from XML built by the UI. Yuk but it works very well.

e.g.

using '\[\d+[\:\-*\d+]+\]' to shred:
WHERE site_id = 81 AND ([3154:2]-[3154:1]=0 AND [3154:3]-[3154:2]=0 AND [3154:4]-[3154:3]=0 AND [3154:5]-[3154:4]=0 AND [3154:6]-[3154:5]=0 AND [3154:7]-[3154:6]=0 AND [3154:8]-[3154:7]=0 AND [3154:9]-[3154:8]=0)

or 'DailySum\(\[\d+[\:\-*\d+]+\]\)' and '\[\d+[\:\-*\d+]+\]' to shred:
WHERE site_id = 29 AND (DailySum([1116]) > 0 or DailySum([1048]) > 0 and [1117] > 0 or DailySum([1027]) > 0 or DailySum([1026]) > 0 or DailySum([1126]) > 0 or DailySum([1141]) > 0 or DailySum([1088]) > 0 or DailySum([1090]) > 0 and DailySum([1092]) > 0 or DailySum([1128]) > 0)

Can't quite bring myself to make it go faster with custom T-SQL for the sake of a few updates per day.

Another great "it depends" article and series of comments! (EXCEPT FOR THE MORONS POLLUTING THE SITE WITH THEIR CRASS ATTEMPTS AT ATTRACTING BUSINESS - THEIR ONLY SUCCESS: ATTRACTING VILIFICATION!)


Gosh. Admittedly, I've spent no time learning Regex (just haven't needed it yet, I guess) and can only guess at what such parsing formulas do. That's why I was so happy that Craig posted some simple tests that I actually understand. I am definitely with you on not working on something for the sake of a few updates per day. There are much bigger fish to fry. But, as Solomon suggested, if I knew what these did, it might be fun to try to figure out a T-SQL way to do it.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1297549
Posted Thursday, May 10, 2012 1:51 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:31 AM
Points: 2,909, Visits: 1,837
Guys, the reason I wrote the article was not to say RegEx is king or you must use it. I've got away with not using it for a very long time.

I wrote the article with two intents
1. To show a basic primer to RegEx
2. To open up the possibility of it use as a weapon in the DBA arsenal.

The examples I have given are VERY simple. This is deliberate to illustrate the basics of RegEx syntax.

The performance tests posted by Cadavre
are interesting. If I am reading them correctly they are saying the RegEx CPU usage is more intensive but that reads and scans are significantly lower.

RegEx CPU = 52625 Reads = 2964
T-SQL CPU = 22672 Reads = 11856

After indexing the T-SQL reads drop to 3252

If you are CPU bound then be very careful using RegEx.
If you are IO bound then perhaps its got legs.
If you have a tight execution window then a longer running process isn't good news.

It is all about understanding the constraints that are applicable to your environment.

When I use RegEx it tends to be for RegEx.Replace().
I use normal T-SQL in the WHERE clause to get to close to the recordset that should be affected.

As a DBA the natural inclination is to tune everything to its limits. Great if you have the time and resource but sometimes you have to take the triage approach.

The puzzles on the BeyondRealtional site illustrate what is possible from T-SQL but a lot of it falls into the realms of expert T-SQL. Judging by the T-SQL I see day in day out most developers and DBAs would struggle to produce some of that code. Elegant and efficient it may be but that is because it is produced by master craftsman. I've done a lot of interviewing and there ain't many of those about.



LinkedIn Profile
Newbie on www.simple-talk.com
Post #1297671
Posted Thursday, May 10, 2012 1:53 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Saturday, October 25, 2014 11:42 AM
Points: 2,380, Visits: 7,591
Solomon Rutzky (5/9/2012)
Thank you for doing that test (really). I simply don't have the time these days and it is good to see solid numbers. For this particular example I do agree that the T-SQL equivalent is not unreadable and in fact I figured it would be an even simpler form along the lines of:

WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')

But again, to be fair to the intent of David's article, this was an intro to Regular Expressions for people and not a shining example of the full power of super-complex patterns (Corey Lawson, in the post following your post that I am replying to, alluded to some of the more complex operations). The PostalCode example is just that: something that is easily digestible for people to comprehend what the syntax is doing. For the most part, I did kinda expect that the pure T-SQL condition would perform better, even if not as pretty.

And again, I don't think David is trying to convince anyone that they simply must use Regular Expressions. It is a matter of exposing people to a powerful and flexible tool that does come in handy sometimes. There are definitely some situations at work, typically ad-hoc debugging (i.e. functionality outweighs performance), where I would have GREATLY benefited from some RegEx functions, especially where the patterns in the data are not uniform like they are here. I am working on a related article (albeit very slowly due to being unable to teach the kids how to change their own diapers ) that goes over various situations that I, and some others, have encountered where RegEx proved indispensable. Maybe when I get farther along with that I can have you and Jeff try to come up with pure T-SQL equivalents. Might be kinda fun .

Take care,
Solomon...


What I like about this site, what keeps me coming back, is that I can have a discussion with complete strangers that often have far more experience than me (and greater knowledge!), where I disagree with what they've said and instead of getting angry they will discuss the topic. Kudos.

Anyway, back to the topic at hand. You're right of course, the WHERE OR combination is simpler (I overlooked it to be honest).

Here's some performance results: -
SET STATISTICS IO ON;
SET STATISTICS TIME ON;
SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE (postCode LIKE '[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'
OR postCode LIKE '[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]')
SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

No index
-----------
530604

Table 'testEnvironment'. Scan count 1, logical reads 3212, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1422 ms, elapsed time = 1413 ms.


With index
-----------
530604

Table 'testEnvironment'. Scan count 1, logical reads 2606, physical reads 0, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1453 ms, elapsed time = 1447 ms.


As for the more complex patterns, well if you have anything to hand then I'd love to see some performance results. The fact is that I'd love to use regular expressions instead of building up more complicated T-SQL, it'd make my job much easier. But I'm paid for doing the best I can in the time frame available, which means I can't justify using something that I have never seen outperform T-SQL.


--edit--


David.Poole (5/10/2012)
Guys, the reason I wrote the article was not to say RegEx is king or you must use it. I've got away with not using it for a very long time.

I wrote the article with two intents
1. To show a basic primer to RegEx
2. To open up the possibility of it use as a weapon in the DBA arsenal.

The examples I have given are VERY simple. This is deliberate to illustrate the basics of RegEx syntax.


Sorry, you must have posted while I was typing.

I'm not arguing against using RegEx per se, I just wanted to see some performance results to give me a reason to use it. Don't get me wrong here, I have a RegEx CLR that I wrote on my server. I do tests to see if it's worth using, it's part of my process to explore a few options to every problem. I've just not come across a place where it beats the T-SQL alternative (except for string splitting, and it was so close that the T-SQL alternative was implemented so that we wouldn't need to convince our clients to enable CLR on their servers).

David.Poole (5/10/2012)
If I am reading them correctly they are saying the RegEx CPU usage is more intensive but that reads and scans are significantly lower.


Yes, I read the results the same. So in an environment where CPU is king and memory is limited, it's probably better to use a CLR. Perhaps I'll set up a VM to simulate such an environment at a later date to test that theory.

David.Poole (5/10/2012)
When I use RegEx it tends to be for RegEx.Replace().
I use normal T-SQL in the WHERE clause to get to close to the recordset that should be affected.


I'll give this sort of task a go at some point, but I suspect that the T-SQL will have a very good chance of keeping up with the RegEx.

Shall we take an example of an address string where we want to extract the post-code? Or do you have something else in mind? I'll set up a test for this scenario later on and we'll take a look.

David.Poole (5/10/2012)
The puzzles on the BeyondRealtional site illustrate what is possible from T-SQL but a lot of it falls into the realms of expert T-SQL. Judging by the T-SQL I see day in day out most developers and DBAs would struggle to produce some of that code. Elegant and efficient it may be but that is because it is produced by master craftsman. I've done a lot of interviewing and there ain't many of those about.


I've been a developer for about 3 years now. I'm competent, but certainly not a master (or even a journeyman). I've had days where "right now" was more important that elegance and efficiency (not in my current job, where performance is king), but even in those cases it'd just mean that I'd go back to it later when time was more on my side.



Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1297672
Posted Sunday, May 13, 2012 10:03 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 2:31 AM
Points: 2,909, Visits: 1,837
Jeff Moden (5/12/2012)
SPAM reported.


Time for a spot of RegEx pattern matching


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1299259
Posted Sunday, May 13, 2012 10:11 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: Yesterday @ 10:15 PM
Points: 35,399, Visits: 31,959
David.Poole (5/13/2012)
Jeff Moden (5/12/2012)
SPAM reported.


Time for a spot of RegEx pattern matching


Only if it doesn't take too long.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1299265
« Prev Topic | Next Topic »

Add to briefcase «««23456»»»

Permissions Expand / Collapse