Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase ««12345»»»

RegEx for DBAs Expand / Collapse
Author
Message
Posted Tuesday, May 8, 2012 7:40 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, August 29, 2014 3:16 PM
Points: 355, Visits: 1,881
To david.rowland and Jeff Moden:

With regards to performance, whether compared to LIKE / PATINDEX or even a custom SQLCLR function, please keep in mind that the main "win" here for Regular Expressions is functionality (and to a lesser degree: readability). Yes, a custom .Net string parsing function will be faster than the generalized Regular Expression functions, but there are pros and cons to most decisions. Likely people have a great variety of patterns to look for and it might not be feasible to have 30 or so custom functions. But it would be quite reasonable to do general matching via RegEx and when you find a place that truly needs performance, then make that situation a custom solution, just as you (david) did. I mean, if you absolutely need the performance then by all means take the approach that you did, but that does not take away from the overall benefit of people using Regular Expressions more.

And regarding the PATINDEX comparison that Jeff asked for: along those same lines, even if the performance were slightly better with a combination of LIKE, PATINDEX, and SUBSTRING, for many patterns the sheer number of those functions required to do the same basic task as RegEx would render a straight T-SQL solution entirely unreadable or sometimes even impossible. Given that PATINDEX and LIKE can only do VERY simplistic patterns (maybe 3 of the 30+ RegEx functions) it would be hard to compare them fairly.





SQL# - http://www.SQLsharp.com/
Post #1296458
Posted Tuesday, May 8, 2012 8:27 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, August 29, 2014 3:16 PM
Points: 355, Visits: 1,881
Jeffrey Bijkerk (5/8/2012)
If you use regex combined with a where clause it can not use the index anymore. I think this is very important.


Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

Take care,
Solomon...





SQL# - http://www.SQLsharp.com/
Post #1296496
Posted Tuesday, May 8, 2012 8:31 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: 2 days ago @ 9:58 AM
Points: 36,995, Visits: 31,517
I completely agree, Solomon... that's why I'm discussing the "con" side of this.

What I've found is that people will take some nice tools like you've developed and use them to make up for their lack of knowledge in areas like T-SQL and sometimes it just doesn't work out. For example, before I got there, the folks at work had 8 fairly wide tables (>100 columns). As if that weren't bad enough, they wanted to audit changes to an "EAV" style of audit table because the changes would usually only happen to a small handful of columns from any given table. Because they didn't want to write IF UPDATE() for every column in the trigger, they used a generic SQLCLR trigger on the tables... that would take nearly 4 minutes to audit changes to only 5 columns on a paltry 10,000 rows. The T-SQL replacements for those triggers work nearly instantaneously for 100,000 rows.

Now, I'm not saying that you've written "slow" code because I don't actually know and neither does the author because no one has done a performance test in this article. Yes, yes... I understand the grand utility of modular general purpose code. My general caution to people is to test any new thing they're going to add to their servers before they add it and that's my intent here because it really might not be worth the trade off in performance... or it might.

Like my ol friend Sergiy says, "A Developer must not guess. A Developer must KNOW!"

So, what say yea? How about some performance tests for those things that can also be done in T-SQL?


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1296499
Posted Tuesday, May 8, 2012 8:37 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: 2 days ago @ 9:58 AM
Points: 36,995, Visits: 31,517
borisk 57245 (5/8/2012)
RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.


Cool! Prove it. Show us the test code you used.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1296506
Posted Tuesday, May 8, 2012 8:40 AM


Old Hand

Old HandOld HandOld HandOld HandOld HandOld HandOld HandOld Hand

Group: General Forum Members
Last Login: Friday, August 29, 2014 3:16 PM
Points: 355, Visits: 1,881
williams.frd (5/8/2012)
Am I out of line in thinking this is something that should be handled in a high-level object-oriented language?


Hello. I am going to say "yes", but I am not being flippant about it . Here are my two reasons:

1) SQL already includes similar albeit less functional equivalents in the form of: LIKE, PATINDEX, SUBSTRING, and REPLACE.

2) In terms of using Regular Expressions in WHERE clauses, this is something that SHOULD be in the DB layer as pulling back hundreds of thousands of rows to the app just to do filtering there is rather inefficient.


Take care,
Solomon...





SQL# - http://www.SQLsharp.com/
Post #1296508
Posted Tuesday, May 8, 2012 8:41 AM


SSC-Dedicated

SSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-DedicatedSSC-Dedicated

Group: General Forum Members
Last Login: 2 days ago @ 9:58 AM
Points: 36,995, Visits: 31,517
Solomon Rutzky (5/8/2012)
Jeffrey Bijkerk (5/8/2012)
If you use regex combined with a where clause it can not use the index anymore. I think this is very important.


Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

Take care,
Solomon...


I do have to agree with Solomon here. It really does depend on how it's used. You can get some very powerful index seeks followed by the desired row scan even when wrapping some columns in the likes of ISNULL(). It very much depends on what the columns of the index and the WHERE clause are.


--Jeff Moden
"RBAR is pronounced "ree-bar" and is a "Modenism" for "Row-By-Agonizing-Row".

First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column."

(play on words) "Just because you CAN do something in T-SQL, doesn't mean you SHOULDN'T." --22 Aug 2013

Helpful Links:
How to post code problems
How to post performance problems
Post #1296509
Posted Tuesday, May 8, 2012 9:08 AM
Forum Newbie

Forum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum NewbieForum Newbie

Group: General Forum Members
Last Login: Thursday, July 3, 2014 7:59 AM
Points: 8, Visits: 39
Jeff Moden (5/8/2012)
borisk 57245 (5/8/2012)
RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.


Cool! Prove it. Show us the test code you used.

Check Attachment ...
Input: 4000 Chars, A with delimmited COMMA
Test: tokenizer split with SQLCLR[RegEx.Matches, String.Split], and TSQL split

Run time - String.Split
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.


Run time - RegEx.Matches
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 3 ms.


Run time - TSQL Tokenizer
SQL Server Execution Times:
CPU time = 577 ms, elapsed time = 589 ms.


nothing to add basically :)
as for now, could not find any possible reason not to use SQL CLR RegEx or other CLR function other than readability, PATTINDEX = LIKE are much less efficient both in performance and versatility...


  Post Attachments 
RegEx.StringSplit.TSQL.Comparision.Tokenizer.jpg (15 views, 202.96 KB)
Post #1296545
Posted Tuesday, May 8, 2012 9:33 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Friday, August 29, 2014 3:17 AM
Points: 2,434, Visits: 7,513
Let's create some sample data to play with first.

SET NOCOUNT ON;
IF object_id('dbo.testEnvironment') IS NOT NULL
BEGIN
DROP TABLE dbo.testEnvironment;
END;

WITH t1(N) AS (SELECT 1 UNION ALL SELECT 1),
t2(N) AS (SELECT 1 FROM t1 x, t1 y),
t3(N) AS (SELECT 1 FROM t2 x, t2 y),
Tally(N) AS (SELECT TOP 98 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),
Tally2(N) AS (SELECT TOP 5 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),
Combinations(N) AS (SELECT LTRIM(RTRIM(RTRIM(SUBSTRING(poss,a.N,2)) + SUBSTRING(vowels,b.N,1)))
FROM Tally a
CROSS JOIN Tally2 b
CROSS APPLY (SELECT 'B C D F G H J K L M N P R S T V W Z SCSKKNSNSPSTBLCLFLGLPLSLBRCRDRFRGRPRTRVRSHSMGHCHPHRHWHBWCWSWTW') d(poss)
CROSS APPLY (SELECT 'AEIOU') e(vowels)),
Words (N) AS (SELECT a.N + b.N
FROM Combinations a
CROSS JOIN Combinations b)
SELECT IDENTITY(INT,1,1) AS ID, a.N + b.N AS randomName,
RAND(CHECKSUM(NEWID())) * 30000 /*(Number of days in range)*/ + CAST('1945' AS DATETIME) /*(Start date, e.g. '1945-01-01 00:00:00*/ AS randomDateTime,
DATEADD(DAY,((ABS(CHECKSUM(NEWID())) % 366 /*(Number of days in range)*/) + 1),CAST('1945' AS DATE) /*(Start date, e.g. '1945-01-01*/) AS randomDate,
ABS(CHECKSUM(NEWID())) AS randomBigInt,
(ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,
RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,
RAND(CHECKSUM(NEWID())) AS randomTinyDec,
RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,
CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney
INTO dbo.testEnvironment
FROM (SELECT TOP 1000 N
FROM Words
ORDER BY NEWID()) a
CROSS JOIN (SELECT TOP 1000 N
FROM Words
ORDER BY NEWID()) b;


SELECT COUNT(*)
FROM dbo.testEnvironment;

OK, so now we've got 1 million rows, so let's have a look at performance.

First, with no indexes.

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

The above returns: -
-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 46233 ms, elapsed time = 12223 ms.


Let's have a look at the LIKE equivalent: -
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

What do you think? Slower or faster?
-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 954 ms, elapsed time = 245 ms.


Whoops, not what we were expecting!

OK, the PATINDEX equivalent?
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

As expected, pretty much identical to LIKE.

-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 922 ms, elapsed time = 242 ms.


OK, let's add a couple of indexes.
CREATE CLUSTERED INDEX cidx_testEnvironment ON dbo.testEnvironment (ID);
CREATE NONCLUSTERED INDEX ncidx_testEnvironment ON dbo.testEnvironment (randomName);

Try again. . .
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

Returns: -
-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 42656 ms, elapsed time = 45748 ms.


Oh dear, it's worse. OK, how about LIKE?

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

Returns: -
-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1000 ms, elapsed time = 1004 ms.


Also worse. . . and PATINDEX?

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;

Returns: -
-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 938 ms, elapsed time = 925 ms.


Same as LIKE again.

Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database



Not a DBA, just trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


LinkedIn | Blog coming soon (for sufficiently large values of "soon" )!
Post #1296565
Posted Tuesday, May 8, 2012 10:29 AM
SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: Wednesday, August 27, 2014 1:07 AM
Points: 2,901, Visits: 1,805
Cadavre (5/8/2012)

Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database


Just as the advice is not to use CLR functions to do data access best done with T-SQL so the advice has to be not to use RegEX for things that are best done with either custom CLR functions or out-of-the-box SQL.

Your test results illustrate this very nicely.

With the caveats of performance I'd like to reitterate the point that RegEx can offer functionality far beyond simple LIKE, PATINDEX, CHARINDEX, REPLACE operations. It's a trade-off between functionality and performance.


Solomon and I had several exchanges of email over this article and there were quite a few things that were left out simply because the article was getting too big and too unwieldy. There is definitely a case for an article on the performance angle for RegEx particularly with the more complex RegEx operations.

The simple examples in this article were really more of a primer for RegEx. Again, the point that it is a dangerous weapon in the hands of the inexperienced is one fairly made.


LinkedIn Profile
Newbie on www.simple-talk.com
Post #1296615
Posted Tuesday, May 8, 2012 12:10 PM


SSC-Enthusiastic

SSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-EnthusiasticSSC-Enthusiastic

Group: General Forum Members
Last Login: Thursday, August 21, 2014 12:21 PM
Points: 117, Visits: 572
For anyone interested, I found http://regexpal.com/ very useful when building and testing RegEx's. Just paste/key in your data, then build the regex and the results are instantaneously highlighted. Simple, but effective, and it does educate one while one pokes and prods away at different expressions.

As David pointed out, quite rightly, there are cases when the built-in's won't cut it, as in '\[\d+[\:\-*\d+]+\]' being a very simple pattern match argument where there may be several strings matching this pattern in a given input string. Converting this into LIKE/PATINDEX/CHARINDEX/et al, likely with WHILE loops, would make for horrendous code that a simple CLR pattern matching function eats for breakfast.

Thanks for excellent article David. I wish it had been published about 2 months ago!
Post #1296672
« Prev Topic | Next Topic »

Add to briefcase ««12345»»»

Permissions Expand / Collapse