SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


RegEx for DBAs


RegEx for DBAs

Author
Message
Solomon Rutzky
Solomon Rutzky
SSCommitted
SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)

Group: General Forum Members
Points: 1727 Visits: 2965
To david.rowland and Jeff Moden:

With regards to performance, whether compared to LIKE / PATINDEX or even a custom SQLCLR function, please keep in mind that the main "win" here for Regular Expressions is functionality (and to a lesser degree: readability). Yes, a custom .Net string parsing function will be faster than the generalized Regular Expression functions, but there are pros and cons to most decisions. Likely people have a great variety of patterns to look for and it might not be feasible to have 30 or so custom functions. But it would be quite reasonable to do general matching via RegEx and when you find a place that truly needs performance, then make that situation a custom solution, just as you (david) did. I mean, if you absolutely need the performance then by all means take the approach that you did, but that does not take away from the overall benefit of people using Regular Expressions more.

And regarding the PATINDEX comparison that Jeff asked for: along those same lines, even if the performance were slightly better with a combination of LIKE, PATINDEX, and SUBSTRING, for many patterns the sheer number of those functions required to do the same basic task as RegEx would render a straight T-SQL solution entirely unreadable or sometimes even impossible. Given that PATINDEX and LIKE can only do VERY simplistic patterns (maybe 3 of the 30+ RegEx functions) it would be hard to compare them fairly.

SQL# - http://www.SQLsharp.com/
Solomon Rutzky
Solomon Rutzky
SSCommitted
SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)

Group: General Forum Members
Points: 1727 Visits: 2965
Jeffrey Bijkerk (5/8/2012)
If you use regex combined with a where clause it can not use the index anymore. I think this is very important.


Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

Take care,
Solomon...

SQL# - http://www.SQLsharp.com/
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86510 Visits: 41098
I completely agree, Solomon... that's why I'm discussing the "con" side of this.

What I've found is that people will take some nice tools like you've developed and use them to make up for their lack of knowledge in areas like T-SQL and sometimes it just doesn't work out. For example, before I got there, the folks at work had 8 fairly wide tables (>100 columns). As if that weren't bad enough, they wanted to audit changes to an "EAV" style of audit table because the changes would usually only happen to a small handful of columns from any given table. Because they didn't want to write IF UPDATE() for every column in the trigger, they used a generic SQLCLR trigger on the tables... that would take nearly 4 minutes to audit changes to only 5 columns on a paltry 10,000 rows. The T-SQL replacements for those triggers work nearly instantaneously for 100,000 rows.

Now, I'm not saying that you've written "slow" code because I don't actually know and neither does the author because no one has done a performance test in this article. Yes, yes... I understand the grand utility of modular general purpose code. My general caution to people is to test any new thing they're going to add to their servers before they add it and that's my intent here because it really might not be worth the trade off in performance... or it might.

Like my ol friend Sergiy says, "A Developer must not guess. A Developer must KNOW!"

So, what say yea? How about some performance tests for those things that can also be done in T-SQL?

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86510 Visits: 41098
borisk 57245 (5/8/2012)
RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.


Cool! Prove it. Show us the test code you used.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
Solomon Rutzky
Solomon Rutzky
SSCommitted
SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)SSCommitted (1.7K reputation)

Group: General Forum Members
Points: 1727 Visits: 2965
williams.frd (5/8/2012)
Am I out of line in thinking this is something that should be handled in a high-level object-oriented language?


Hello. I am going to say "yes", but I am not being flippant about it ;-). Here are my two reasons:

1) SQL already includes similar albeit less functional equivalents in the form of: LIKE, PATINDEX, SUBSTRING, and REPLACE.

2) In terms of using Regular Expressions in WHERE clauses, this is something that SHOULD be in the DB layer as pulling back hundreds of thousands of rows to the app just to do filtering there is rather inefficient.


Take care,
Solomon...

SQL# - http://www.SQLsharp.com/
Jeff Moden
Jeff Moden
SSC Guru
SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)SSC Guru (86K reputation)

Group: General Forum Members
Points: 86510 Visits: 41098
Solomon Rutzky (5/8/2012)
Jeffrey Bijkerk (5/8/2012)
If you use regex combined with a where clause it can not use the index anymore. I think this is very important.


Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

Take care,
Solomon...


I do have to agree with Solomon here. It really does depend on how it's used. You can get some very powerful index seeks followed by the desired row scan even when wrapping some columns in the likes of ISNULL(). It very much depends on what the columns of the index and the WHERE clause are.

--Jeff Moden

RBAR is pronounced ree-bar and is a Modenism for Row-By-Agonizing-Row.
First step towards the paradigm shift of writing Set Based code:
Stop thinking about what you want to do to a row... think, instead, of what you want to do to a column.
If you think its expensive to hire a professional to do the job, wait until you hire an amateur. -- Red Adair

Helpful Links:
How to post code problems
How to post performance problems
Forum FAQs
borisk 57245
borisk 57245
Grasshopper
Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)Grasshopper (18 reputation)

Group: General Forum Members
Points: 18 Visits: 39
Jeff Moden (5/8/2012)
borisk 57245 (5/8/2012)
RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.


Cool! Prove it. Show us the test code you used.

Check Attachment ...
Input: 4000 Chars, A with delimmited COMMA
Test: tokenizer split with SQLCLR[RegEx.Matches, String.Split], and TSQL split

Run time - String.Split
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 2 ms.


Run time - RegEx.Matches
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 3 ms.


Run time - TSQL Tokenizer
SQL Server Execution Times:
CPU time = 577 ms, elapsed time = 589 ms.


nothing to add basically Smile
as for now, could not find any possible reason not to use SQL CLR RegEx or other CLR function other than readability, PATTINDEX = LIKE are much less efficient both in performance and versatility...
Attachments
Cadavre
Cadavre
Hall of Fame
Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)Hall of Fame (3.9K reputation)

Group: General Forum Members
Points: 3904 Visits: 8472
Let's create some sample data to play with first.

SET NOCOUNT ON;
IF object_id('dbo.testEnvironment') IS NOT NULL
BEGIN
DROP TABLE dbo.testEnvironment;
END;

WITH t1(N) AS (SELECT 1 UNION ALL SELECT 1),
t2(N) AS (SELECT 1 FROM t1 x, t1 y),
t3(N) AS (SELECT 1 FROM t2 x, t2 y),
Tally(N) AS (SELECT TOP 98 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),
Tally2(N) AS (SELECT TOP 5 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),
Combinations(N) AS (SELECT LTRIM(RTRIM(RTRIM(SUBSTRING(poss,a.N,2)) + SUBSTRING(vowels,b.N,1)))
FROM Tally a
CROSS JOIN Tally2 b
CROSS APPLY (SELECT 'B C D F G H J K L M N P R S T V W Z SCSKKNSNSPSTBLCLFLGLPLSLBRCRDRFRGRPRTRVRSHSMGHCHPHRHWHBWCWSWTW') d(poss)
CROSS APPLY (SELECT 'AEIOU') e(vowels)),
Words (N) AS (SELECT a.N + b.N
FROM Combinations a
CROSS JOIN Combinations b)
SELECT IDENTITY(INT,1,1) AS ID, a.N + b.N AS randomName,
RAND(CHECKSUM(NEWID())) * 30000 /*(Number of days in range)*/ + CAST('1945' AS DATETIME) /*(Start date, e.g. '1945-01-01 00:00:00*/ AS randomDateTime,
DATEADD(DAY,((ABS(CHECKSUM(NEWID())) % 366 /*(Number of days in range)*/) + 1),CAST('1945' AS DATE) /*(Start date, e.g. '1945-01-01*/) AS randomDate,
ABS(CHECKSUM(NEWID())) AS randomBigInt,
(ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,
RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,
RAND(CHECKSUM(NEWID())) AS randomTinyDec,
RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,
CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney
INTO dbo.testEnvironment
FROM (SELECT TOP 1000 N
FROM Words
ORDER BY NEWID()) a
CROSS JOIN (SELECT TOP 1000 N
FROM Words
ORDER BY NEWID()) b;


SELECT COUNT(*)
FROM dbo.testEnvironment;



OK, so now we've got 1 million rows, so let's have a look at performance.

First, with no indexes.

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



The above returns: -

-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 46233 ms, elapsed time = 12223 ms.


Let's have a look at the LIKE equivalent: -
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



What do you think? Slower or faster?

-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 954 ms, elapsed time = 245 ms.


Whoops, not what we were expecting! :-P

OK, the PATINDEX equivalent?
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



As expected, pretty much identical to LIKE.


-----------
104165

Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 922 ms, elapsed time = 242 ms.


OK, let's add a couple of indexes.
CREATE CLUSTERED INDEX cidx_testEnvironment ON dbo.testEnvironment (ID);
CREATE NONCLUSTERED INDEX ncidx_testEnvironment ON dbo.testEnvironment (randomName);



Try again. . .
SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



Returns: -
-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 42656 ms, elapsed time = 45748 ms.


Oh dear, it's worse. OK, how about LIKE?

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



Returns: -

-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 1000 ms, elapsed time = 1004 ms.


Also worse. . . and PATINDEX?

SET STATISTICS IO ON;
SET STATISTICS TIME ON;

SELECT COUNT(*)
FROM dbo.testEnvironment
WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

SET STATISTICS TIME OFF;
SET STATISTICS IO OFF;



Returns: -

-----------
104165

Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

SQL Server Execution Times:
CPU time = 938 ms, elapsed time = 925 ms.


Same as LIKE again.

Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database :-P


Forever trying to learn

For better, quicker answers on T-SQL questions, click on the following...
http://www.sqlservercentral.com/articles/Best+Practices/61537/

For better, quicker answers on SQL Server performance related questions, click on the following...
http://www.sqlservercentral.com/articles/SQLServerCentral/66909/



If you litter your database queries with nolock query hints, are you aware of the side effects?
Try reading a few of these links...

(*) Missing rows with nolock
(*) Allocation order scans with nolock
(*) Consistency issues with nolock
(*) Transient Corruption Errors in SQL Server error log caused by nolock
(*) Dirty reads, read errors, reading rows twice and missing rows with nolock


Craig Wilkinson - Software Engineer
LinkedIn
David.Poole
David.Poole
SSCertifiable
SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)

Group: General Forum Members
Points: 7637 Visits: 3286
Cadavre (5/8/2012)

Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database :-P


Just as the advice is not to use CLR functions to do data access best done with T-SQL so the advice has to be not to use RegEX for things that are best done with either custom CLR functions or out-of-the-box SQL.

Your test results illustrate this very nicely.

With the caveats of performance I'd like to reitterate the point that RegEx can offer functionality far beyond simple LIKE, PATINDEX, CHARINDEX, REPLACE operations. It's a trade-off between functionality and performance.


Solomon and I had several exchanges of email over this article and there were quite a few things that were left out simply because the article was getting too big and too unwieldy. There is definitely a case for an article on the performance angle for RegEx particularly with the more complex RegEx operations.

The simple examples in this article were really more of a primer for RegEx. Again, the point that it is a dangerous weapon in the hands of the inexperienced is one fairly made.

LinkedIn Profile

Newbie on www.simple-talk.com
SAinCA
SAinCA
SSC-Enthusiastic
SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)SSC-Enthusiastic (190 reputation)

Group: General Forum Members
Points: 190 Visits: 695
For anyone interested, I found http://regexpal.com/ very useful when building and testing RegEx's. Just paste/key in your data, then build the regex and the results are instantaneously highlighted. Simple, but effective, and it does educate one while one pokes and prods away at different expressions.

As David pointed out, quite rightly, there are cases when the built-in's won't cut it, as in '\[\d+[\:\-*\d+]+\]' being a very simple pattern match argument where there may be several strings matching this pattern in a given input string. Converting this into LIKE/PATINDEX/CHARINDEX/et al, likely with WHILE loops, would make for horrendous code that a simple CLR pattern matching function eats for breakfast.

Thanks for excellent article David. I wish it had been published about 2 months ago!
Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search