RegEx for DBAs

  • Solomon Rutzky (5/8/2012)


    Jeffrey Bijkerk (5/8/2012)


    If you use regex combined with a where clause it can not use the index anymore. I think this is very important.

    Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

    Take care,

    Solomon...

    I do have to agree with Solomon here. It really does depend on how it's used. You can get some very powerful index seeks followed by the desired row scan even when wrapping some columns in the likes of ISNULL(). It very much depends on what the columns of the index and the WHERE clause are.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden (5/8/2012)


    borisk 57245 (5/8/2012)


    RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.

    Cool! Prove it. Show us the test code you used.

    Check Attachment ...

    Input: 4000 Chars, A with delimmited COMMA

    Test: tokenizer split with SQLCLR[RegEx.Matches, String.Split], and TSQL split

    Run time - String.Split

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 2 ms.

    Run time - RegEx.Matches

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 3 ms.

    Run time - TSQL Tokenizer

    SQL Server Execution Times:

    CPU time = 577 ms, elapsed time = 589 ms.

    nothing to add basically 🙂

    as for now, could not find any possible reason not to use SQL CLR RegEx or other CLR function other than readability, PATTINDEX = LIKE are much less efficient both in performance and versatility...

  • Let's create some sample data to play with first.

    SET NOCOUNT ON;

    IF object_id('dbo.testEnvironment') IS NOT NULL

    BEGIN

    DROP TABLE dbo.testEnvironment;

    END;

    WITH t1(N) AS (SELECT 1 UNION ALL SELECT 1),

    t2(N) AS (SELECT 1 FROM t1 x, t1 y),

    t3(N) AS (SELECT 1 FROM t2 x, t2 y),

    Tally(N) AS (SELECT TOP 98 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),

    Tally2(N) AS (SELECT TOP 5 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),

    Combinations(N) AS (SELECT LTRIM(RTRIM(RTRIM(SUBSTRING(poss,a.N,2)) + SUBSTRING(vowels,b.N,1)))

    FROM Tally a

    CROSS JOIN Tally2 b

    CROSS APPLY (SELECT 'B C D F G H J K L M N P R S T V W Z SCSKKNSNSPSTBLCLFLGLPLSLBRCRDRFRGRPRTRVRSHSMGHCHPHRHWHBWCWSWTW') d(poss)

    CROSS APPLY (SELECT 'AEIOU') e(vowels)),

    Words (N) AS (SELECT a.N + b.N

    FROM Combinations a

    CROSS JOIN Combinations b)

    SELECT IDENTITY(INT,1,1) AS ID, a.N + b.N AS randomName,

    RAND(CHECKSUM(NEWID())) * 30000 /*(Number of days in range)*/ + CAST('1945' AS DATETIME) /*(Start date, e.g. '1945-01-01 00:00:00*/ AS randomDateTime,

    DATEADD(DAY,((ABS(CHECKSUM(NEWID())) % 366 /*(Number of days in range)*/) + 1),CAST('1945' AS DATE) /*(Start date, e.g. '1945-01-01*/) AS randomDate,

    ABS(CHECKSUM(NEWID())) AS randomBigInt,

    (ABS(CHECKSUM(NEWID())) % 100) + 1 AS randomSmallInt,

    RAND(CHECKSUM(NEWID())) * 100 AS randomSmallDec,

    RAND(CHECKSUM(NEWID())) AS randomTinyDec,

    RAND(CHECKSUM(NEWID())) * 100000 AS randomBigDec,

    CONVERT(VARCHAR(6),CONVERT(MONEY,RAND(CHECKSUM(NEWID())) * 100),0) AS randomMoney

    INTO dbo.testEnvironment

    FROM (SELECT TOP 1000 N

    FROM Words

    ORDER BY NEWID()) a

    CROSS JOIN (SELECT TOP 1000 N

    FROM Words

    ORDER BY NEWID()) b;

    SELECT COUNT(*)

    FROM dbo.testEnvironment;

    OK, so now we've got 1 million rows, so let's have a look at performance.

    First, with no indexes.

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    The above returns: -

    -----------

    104165

    Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 46233 ms, elapsed time = 12223 ms.

    Let's have a look at the LIKE equivalent: -

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    What do you think? Slower or faster?

    -----------

    104165

    Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 954 ms, elapsed time = 245 ms.

    Whoops, not what we were expecting! 😛

    OK, the PATINDEX equivalent?

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    As expected, pretty much identical to LIKE.

    -----------

    104165

    Table 'testEnvironment'. Scan count 5, logical reads 9730, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 922 ms, elapsed time = 242 ms.

    OK, let's add a couple of indexes.

    CREATE CLUSTERED INDEX cidx_testEnvironment ON dbo.testEnvironment (ID);

    CREATE NONCLUSTERED INDEX ncidx_testEnvironment ON dbo.testEnvironment (randomName);

    Try again. . .

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE SQL#.RegEx_IsMatch(randomName,'[A-C][A-F][P-Z]',1,'IgnoreCase')>0

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Returns: -

    -----------

    104165

    Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 42656 ms, elapsed time = 45748 ms.

    Oh dear, it's worse. OK, how about LIKE?

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE UPPER(randomName) LIKE '%[A-C][A-F][P-Z]%'

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Returns: -

    -----------

    104165

    Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 1000 ms, elapsed time = 1004 ms.

    Also worse. . . and PATINDEX?

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE PATINDEX('%[A-C][A-F][P-Z]%',UPPER(randomName))>0

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Returns: -

    -----------

    104165

    Table 'testEnvironment'. Scan count 1, logical reads 3046, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 938 ms, elapsed time = 925 ms.

    Same as LIKE again.

    Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database 😛


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Cadavre (5/8/2012)


    Any disputes about my test results? Because so far I'm thinking that I don't want RegEx in my database 😛

    Just as the advice is not to use CLR functions to do data access best done with T-SQL so the advice has to be not to use RegEX for things that are best done with either custom CLR functions or out-of-the-box SQL.

    Your test results illustrate this very nicely.

    With the caveats of performance I'd like to reitterate the point that RegEx can offer functionality far beyond simple LIKE, PATINDEX, CHARINDEX, REPLACE operations. It's a trade-off between functionality and performance.

    Solomon and I had several exchanges of email over this article and there were quite a few things that were left out simply because the article was getting too big and too unwieldy. There is definitely a case for an article on the performance angle for RegEx particularly with the more complex RegEx operations.

    The simple examples in this article were really more of a primer for RegEx. Again, the point that it is a dangerous weapon in the hands of the inexperienced is one fairly made.

  • For anyone interested, I found http://regexpal.com/ very useful when building and testing RegEx's. Just paste/key in your data, then build the regex and the results are instantaneously highlighted. Simple, but effective, and it does educate one while one pokes and prods away at different expressions.

    As David pointed out, quite rightly, there are cases when the built-in's won't cut it, as in '\[\d+[\:\-*\d+]+\]' being a very simple pattern match argument where there may be several strings matching this pattern in a given input string. Converting this into LIKE/PATINDEX/CHARINDEX/et al, likely with WHILE loops, would make for horrendous code that a simple CLR pattern matching function eats for breakfast.

    Thanks for excellent article David. I wish it had been published about 2 months ago!

  • borisk 57245 (5/8/2012)


    Jeff Moden (5/8/2012)


    borisk 57245 (5/8/2012)


    RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.

    Cool! Prove it. Show us the test code you used.

    Check Attachment ...

    Input: 4000 Chars, A with delimmited COMMA

    Test: tokenizer split with SQLCLR[RegEx.Matches, String.Split], and TSQL split

    Run time - String.Split

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 2 ms.

    Run time - RegEx.Matches

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 3 ms.

    Run time - TSQL Tokenizer

    SQL Server Execution Times:

    CPU time = 577 ms, elapsed time = 589 ms.

    nothing to add basically 🙂

    as for now, could not find any possible reason not to use SQL CLR RegEx or other CLR function other than readability, PATTINDEX = LIKE are much less efficient both in performance and versatility...

    Results are one thing but without the code to support them, they mean absolutely nothing. Please post the code for your functions.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Cadavre (5/8/2012)


    Let's create some sample data to play with first.

    I should have known it would be you. 🙂 Very well done, Craig! Thanks for stepping up to the plate!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • SAinCA (5/8/2012)


    For anyone interested, I found http://regexpal.com/ very useful when building and testing RegEx's.

    I've always used expresso http://www.ultrapico.com/Expresso.htm - makes code for a bunch of platforms, has a great tutorial & other stuff.

    As for the pissing match - meh. There's times to use stuff and times not to. Never been a fan of the religious fanatics on either the ALWAYSUSEIT and NEVERUSEIT sides. Both sides seem to me to substitute easy-to-remember rules in place of actually thinking about the needs of the current task.

    -sff

  • Jeff Moden (5/8/2012)


    david.rowland (5/7/2012)


    Although RegEx is powerful, I have tried using it once and found that coding my own CLR function to manipulate strings to what I want was 100 times faster than the RegEx function was.

    If you are running the code against a small dataset this should not matter however I was running the code against millions of rows and this had a major impact on performance.

    What did your custom CLR do to beat Regex so soundly?

    It was a few years ago so cannot remember the full details.

    I was building a database which was sourced from several others and required a deduplication on the data to create a single list of unique names. As each database was sourced from different places the process first had to standardise the names into a common name and use advance string matching like RatcliffObershelb to match.

    The BA used a RegEX clr function to help come up with rules to manipulate the strings on a small dataset which worked well, however the full dataset was based on millions of rows. When I ran the same RegEx on the large datasets the performance was terrible and in some cases could have taken days to run. By coding my own string manipulation functions in CLR I was able to increase performance down to a couple of hours at most, some down to less than an hour.

    I never really had a choice as the final proces had a limited window to run in and, with quite a number of steps to run I had to get performance of the whole process running as fast as possible.

    EDIT: I should add that the access to the RegEx function was good in order to help come up with string manipulation and matching rules. Also allowing access to this function for Analysts who do not have the ability to build their own CLR functions like I could also has benefits.

    I will try to get my code working as fast as possible and if RegEx is proven to work then I would use it however if I can improve performance greatly by writting my own CLR functions I would. If Regex was only 5% slower than my own function I would have used the RegEx one though simply for ease of mantenance, however we are talking about a different of quite a large margin.

    I am lucky to be a DBA with a lot of years in both database design and programming of systems so am able to code C# well. The approach I took might not suit all DBA's.

  • As I said performance is always the issue with regular expressions. It can be very handy but I work with large db's and RegEx is not coming in!

  • Jeff Moden (5/8/2012)


    borisk 57245 (5/8/2012)


    Jeff Moden (5/8/2012)


    borisk 57245 (5/8/2012)


    RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.

    Cool! Prove it. Show us the test code you used.

    Check Attachment ...

    Input: 4000 Chars, A with delimmited COMMA

    Test: tokenizer split with SQLCLR[RegEx.Matches, String.Split], and TSQL split

    Run time - String.Split

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 2 ms.

    Run time - RegEx.Matches

    SQL Server Execution Times:

    CPU time = 0 ms, elapsed time = 3 ms.

    Run time - TSQL Tokenizer

    SQL Server Execution Times:

    CPU time = 577 ms, elapsed time = 589 ms.

    nothing to add basically 🙂

    as for now, could not find any possible reason not to use SQL CLR RegEx or other CLR function other than readability, PATTINDEX = LIKE are much less efficient both in performance and versatility...

    Results are one thing but without the code to support them, they mean absolutely nothing. Please post the code for your functions.

    Ok, NP.., Kinda new in SQL forums 😉

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

    Tokenizer SQL CLR, using String Object:

    [SqlFunction(FillRowMethodName = "FillRowTokenizer", TableDefinition = "value NVARCHAR(2000)")]

    public static IEnumerable Tokenizer(SqlString input)

    {

    return input.Value.Split(',');

    }

    public static void FillRowTokenizer(Object obj, [param: SqlFacet(MaxSize = 1000)] out SqlString matchString)

    {

    matchString = ((String)obj);

    }

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

    Tokenizer SQL CLR, Using RegEx:

    [SqlFunction(FillRowMethodName = "FillRowMatchList", TableDefinition = "MatchString NVARCHAR(1000)")]

    public static IEnumerable MatchList(SqlString input, [param: SqlFacet(MaxSize = 1000)] SqlString pattern)

    {

    MatchCollection matchs =

    Regex.Matches(input.ToString(), pattern.ToString(),

    RegexOptions.IgnoreCase | RegexOptions.IgnorePatternWhitespace);

    return matchs;

    }

    public static void FillRowMatchList(Object obj, [param: SqlFacet(MaxSize = 1000)] out SqlString matchString)

    {

    matchString = ((Match)obj).Value;

    }

    * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * * *

    TSQL Tokenizer:

    CREATE FUNCTION [dbo].[fn_Tokenizer]

    ( @List varchar(MAX) )

    RETURNS @ListTable TABLE

    ( value varchar(255) )

    AS

    BEGIN

    IF rtrim(@List) = ''

    RETURN

    DECLARE @pointer int, @previous_pointer int

    SET @previous_pointer = 0

    SET @pointer = CHARINDEX(',', @List, @previous_pointer)

    WHILE ( @pointer > 0 )

    BEGIN

    INSERT INTO @ListTable

    SELECT LTRIM(RTRIM(SUBSTRING( @List, @previous_pointer + 1,

    @pointer - @previous_pointer - 1 )))

    SET @previous_pointer = @pointer

    SET @pointer = CHARINDEX(',', @List, @previous_pointer + 1)

    END

    IF @previous_pointer > 0

    SET @previous_pointer = @previous_pointer + 1

    INSERT INTO @ListTable

    SELECT LTRIM(RTRIM(SUBSTRING ( @List, @previous_pointer,

    LEN(@List) - @previous_pointer + 1 )))

    RETURN

    END

  • borisk 57245 (5/9/2012)


    Ok, NP.., Kinda new in SQL forums

    NP. Thanks for the post and it turns out to be pretty much as I suspected. You use an "inch worm" While Loop technique to do the splitting in T-SQL. Please see the following article. It won't beat a proper SQLCLR splitter but it'll come a whole lot closer than the While loop version. It might, however, give Regex a real run for its money.

    The one to use isn't actually from the article itself. Yet another improvement was discovered after the article was published and the improved functions are included in the "Resources" link near the end of the article.

    http://www.sqlservercentral.com/articles/Tally+Table/72993/

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • To Cadavre:

    I do not dispute your results, but I will dispute your test itself on the grounds that it is not relevant to a situation where someone should or would use Regular Expressions. As David mentioned in his reply, using Regular Expressions (or any SQLCLR functionality or XP) is a matter of accomplishing something that in straight T-SQL would be: impossible, slower, or so complex that it is unreadable / unmaintainable / error-prone. So, if you want do to a real test (something that I would never dissuade someone from doing), then at bare-minimum you need to test something that is not possible with simple T-SQL. So take the example from the article of checking UK postal codes. There are 6 patterns and the article provides the Regular Expression to match them. Just do the equivalent using LIKE and/or PATINDEX and then we have something to talk about. Of course, even if that is faster in T-SQL, there is still the point again that using Regular Expressions is not a means to replace LIKE, PATINDEX, etc. as I noted above.


    To borisk 57245:

    Along those same lines, I would say that the tokenizer / split test is equally invalid due to it being a scenario that absolutely should not be done with RegEx. If the splitting is done on a simple character or string, then there is no reason at all to use a RegEx split. A RegEx split is only done when splitting something that simply cannot be done via String.Split() in SQLCLR or the XML / Tally-Table methods noted by Jeff in his reply.


    To Jeff Moden, Jeffrey Bijkerk, et al:

    There is really no point in arguing the merits of Regular Expressions on the basis of performance when performance is not the reason to use Regular Expressions. Again, this is a tool to extend capabilities when standard T-SQL either cannot do something or cannot do something in a reasonable manner. It is not being said that everyone must, or even should, use this tool. And it is absolutely not being said that anything should go untested. Testing and education on proper usage should be implied, or else we need to update the MSDN documentation to include these caveats on more BOL pages than just Index Hints; Cursors, Table Variables, sql_invarient, goto statements, etc are all things to generally be avoided but that do have purposes, even if few and far between. This issue at hand is that this is a tool that DBA's should at least be aware of as it can come in handy in many situations.

    P.S. The mere fact that it will be misused by some does not imply much given places that have eight tables with 100+ columns and that SQL Server actually allows for, what, 999 non-clustered indexes on a table ;-).

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Solomon Rutzky (5/9/2012)


    To Cadavre:

    I do not dispute your results, but I will dispute your test itself on the grounds that it is not relevant to a situation where someone should or would use Regular Expressions. As David mentioned in his reply, using Regular Expressions (or any SQLCLR functionality or XP) is a matter of accomplishing something that in straight T-SQL would be: impossible, slower, or so complex that it is unreadable / unmaintainable / error-prone. So, if you want do to a real test (something that I would never dissuade someone from doing), then at bare-minimum you need to test something that is not possible with simple T-SQL. So take the example from the article of checking UK postal codes. There are 6 patterns and the article provides the Regular Expression to match them. Just do the equivalent using LIKE and/or PATINDEX and then we have something to talk about. Of course, even if that is faster in T-SQL, there is still the point again that using Regular Expressions is not a means to replace LIKE, PATINDEX, etc. as I noted above.

    Thanks for the reply, although I think I need more convincing. I don't see it as being a "win" for RegEx due to it being more complicated to do in T-SQL. I'd always rather have a fast method than a slow method, complications can be well documented and everything goes through rigorous testing to ensure bugs are found.

    Take the post code RegEx, for example. Here's 1,000,000 rows of sample data: -

    SET NOCOUNT ON;

    IF object_id('dbo.testEnvironment') IS NOT NULL

    BEGIN

    DROP TABLE dbo.testEnvironment;

    END;

    WITH t1(N) AS (SELECT 1 UNION ALL SELECT 1),

    t2(N) AS (SELECT 1 FROM t1 x, t1 y),

    t3(N) AS (SELECT 1 FROM t2 x, t2 y),

    Tally(N) AS (SELECT TOP 51 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),

    Tally2(N) AS (SELECT TOP 26 ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM t3 x, t3 y),

    letters(N) AS (SELECT LTRIM(RTRIM(SUBSTRING(c.poss,a.N,2)))+LTRIM(RTRIM(SUBSTRING(d.poss,b.N,1)))

    FROM Tally a

    CROSS JOIN Tally b

    CROSS APPLY (SELECT 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z') c(poss)

    CROSS APPLY (SELECT 'A B C D E F G H I J K L M N O P Q R S T U V W X Y Z') d(poss)),

    bignumbers(N) AS (SELECT TOP 100 CAST((ABS(CHECKSUM(NEWID())) % 99) + 1 AS VARCHAR(2)) FROM t3 x, t3 y),

    smallnumbers(N) AS (SELECT TOP 100 CAST((ABS(CHECKSUM(NEWID())) % 9) + 1 AS VARCHAR(1)) FROM t3 x, t3 y)

    SELECT TOP 1000000 IDENTITY(INT,1,1) AS ID, a.N+b.N+' '+c.N+d.N AS postCode

    INTO dbo.testEnvironment

    FROM letters a

    CROSS JOIN bignumbers b

    CROSS JOIN smallnumbers c

    CROSS JOIN letters d;

    Now, let's take a look at the RegEx

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE SQL#.RegEx_IsMatch(postCode,'[A-Z]{1,2}\d([A-Z]|\d){0,1} \d[A-Z]{2}',1,'IgnoreCase')=1;

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;;

    Results: -

    -----------

    530604

    Table 'testEnvironment'. Scan count 1, logical reads 2964, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 52625 ms, elapsed time = 54294 ms.

    OK, now a T-SQL alternative.

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT SUM(matches)

    FROM (SELECT MAX(PATINDEX(patterns,postCode)), ID

    FROM (VALUES ('[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'))regEx(patterns)

    CROSS JOIN dbo.testEnvironment

    GROUP BY ID

    HAVING MAX(PATINDEX(patterns,postCode))>0) searchMatch(matches,ID);

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Is that really more difficult to read? Doesn't look like it to me, looks pretty simple. I agree that you will end up with more lines of code, but as we all know, less code doesn't mean more efficient code.

    So, how about the T-SQL results?

    -----------

    530604

    Table 'Worktable'. Scan count 2, logical reads 11576074, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'testEnvironment'. Scan count 4, logical reads 11856, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 22672 ms, elapsed time = 6245 ms.

    OK, we'll add some indexes as we did last time and try again.

    CREATE CLUSTERED INDEX cidx_testEnvironment ON dbo.testEnvironment (ID);

    CREATE NONCLUSTERED INDEX ncidx_testEnvironment ON dbo.testEnvironment (postCode);

    So, RegEx: -

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT COUNT(*)

    FROM dbo.testEnvironment

    WHERE SQL#.RegEx_IsMatch(postCode,'[A-Z]{1,2}\d([A-Z]|\d){0,1} \d[A-Z]{2}',1,'IgnoreCase')=1;

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Results: -

    -----------

    530604

    Table 'testEnvironment'. Scan count 1, logical reads 2606, physical reads 0, read-ahead reads 16, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 52344 ms, elapsed time = 54336 ms.

    And T-SQL: -

    SET STATISTICS IO ON;

    SET STATISTICS TIME ON;

    SELECT SUM(matches)

    FROM (SELECT MAX(PATINDEX(patterns,postCode)), ID

    FROM (VALUES ('[A-Z][A-Z0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z0-9][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9][0-9] [0-9][A-Z][A-Z]'),

    ('[A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'),

    ('[A-Z][A-Z][0-9][A-Z] [0-9][A-Z][A-Z]'))regEx(patterns)

    CROSS JOIN dbo.testEnvironment

    GROUP BY ID

    HAVING MAX(PATINDEX(patterns,postCode))>0) searchMatch(matches,ID);

    SET STATISTICS TIME OFF;

    SET STATISTICS IO OFF;

    Results: -

    -----------

    530604

    Table 'testEnvironment'. Scan count 5, logical reads 3252, physical reads 0, read-ahead reads 10, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

    SQL Server Execution Times:

    CPU time = 3189 ms, elapsed time = 2800 ms.

    I just can't see how anyone could justify the RegEx method when using T-SQL can net you something that is so much faster.


    Forever trying to learn
    My blog - http://www.cadavre.co.uk/
    For better, quicker answers on T-SQL questions, click on the following...http://www.sqlservercentral.com/articles/Best+Practices/61537/
    For better, quicker answers on SQL Server performance related questions, click on the following...http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • I'll bite, as I use regex's much for cases like this (again, on recordsets > million_rows, too).

    The patindex is fine, but now, isolate the matched substring.

    The regex functions on my dbs are just this...

    regextest()

    regexmatchgroup()

    regexreplace()

    Me, personally, I'd rather modify a regex pattern in the future by just adding more cases to the pattern expression rather than adding more and more layers of TSQL if-then-else logic, substring functions, etc, especially as the pattern perhaps gets complicated with look-aheads or look-behinds (e.g., match if the following is also like...). With the matchgroup function, it becomes trivial to pick out the matched values. And regexreplace can replace some pretty roccoco TSQL code doing the same thing.

    RegEx's have a sweet spot that, while possible to do in other ways (not just as an adjunct to T-SQL), can be very hard to beat, both in performance and coding/upkeep Yes, it's possible to write some stinker regex's that bring things to a halt.

    We should all try not to be too dogmatic about things. Even CURSORs and cross-joins have their uses, after all (and not everything tastes better with tabasco sauce on it).

Viewing 15 posts - 16 through 30 (of 57 total)

You must be logged in to reply to this topic. Login to reply