RegEx for DBAs

  • Comments posted to this topic are about the item RegEx for DBAs

  • Very nice Article.:-)

  • Although RegEx is powerful, I have tried using it once and found that coding my own CLR function to manipulate strings to what I want was 100 times faster than the RegEx function was.

    If you are running the code against a small dataset this should not matter however I was running the code against millions of rows and this had a major impact on performance.

  • Great and very nicely put together article, David. My hat is off to you. But what of performance? Have you done any comparisons between the Regex function and a moderately complex PatIndex?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • david.rowland (5/7/2012)


    Although RegEx is powerful, I have tried using it once and found that coding my own CLR function to manipulate strings to what I want was 100 times faster than the RegEx function was.

    If you are running the code against a small dataset this should not matter however I was running the code against millions of rows and this had a major impact on performance.

    What did your custom CLR do to beat Regex so soundly?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • I've used our own in-house implementations of some of the .Net RexEx functions on seemingly large datasets (several million records each), mostly to help clean up, validate and parse addresses, city-state-zip (or province), etc.

    Beats pure SQL functions like Hulk did to Loki in the movie, especially in understandability. I'll take teasing a complicated Rexex pattern over scads of nested SQL functions trying to do the same thing.

    one I really have done lots of is to do is keep expanding on a pattern like this: '(Company|corp|corporation|inc|llp|llc...)'

    beats continuing to add "or field like '%new_match%' ..."

  • RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.

    in regards to good editor, try this one: http://www.ultrapico.com/Expresso.htm

    moreover, consider using FullText indexing from time to time, as it might even ease searching for Words other than using RegEx.

  • If you use regex combined with a where clause it can not use the index anymore. I think this is very important.

  • Am I out of line in thinking this is something that should be handled in a high-level object-oriented language?

  • I'm not particularly familiar with the CLR--does processing stuff via RegEx in this way effectively mean you're using RBAR, or is it a bit cleverer than that?

  • To david.rowland and Jeff Moden:

    With regards to performance, whether compared to LIKE / PATINDEX or even a custom SQLCLR function, please keep in mind that the main "win" here for Regular Expressions is functionality (and to a lesser degree: readability). Yes, a custom .Net string parsing function will be faster than the generalized Regular Expression functions, but there are pros and cons to most decisions. Likely people have a great variety of patterns to look for and it might not be feasible to have 30 or so custom functions. But it would be quite reasonable to do general matching via RegEx and when you find a place that truly needs performance, then make that situation a custom solution, just as you (david) did. I mean, if you absolutely need the performance then by all means take the approach that you did, but that does not take away from the overall benefit of people using Regular Expressions more.

    And regarding the PATINDEX comparison that Jeff asked for: along those same lines, even if the performance were slightly better with a combination of LIKE, PATINDEX, and SUBSTRING, for many patterns the sheer number of those functions required to do the same basic task as RegEx would render a straight T-SQL solution entirely unreadable or sometimes even impossible. Given that PATINDEX and LIKE can only do VERY simplistic patterns (maybe 3 of the 30+ RegEx functions) it would be hard to compare them fairly.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Jeffrey Bijkerk (5/8/2012)


    If you use regex combined with a where clause it can not use the index anymore. I think this is very important.

    Hi Jeffrey. It is important to a degree. Sometimes you need to do queries that are not going to use an index, just like using a LIKE statement where the pattern starts with % or _. Also, not all columns are indexed. So, this alone is not a reason to not use Regular Expressions, but it is a reason to not use ANY function in some situations where performance is critical.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • I completely agree, Solomon... that's why I'm discussing the "con" side of this.

    What I've found is that people will take some nice tools like you've developed and use them to make up for their lack of knowledge in areas like T-SQL and sometimes it just doesn't work out. For example, before I got there, the folks at work had 8 fairly wide tables (>100 columns). As if that weren't bad enough, they wanted to audit changes to an "EAV" style of audit table because the changes would usually only happen to a small handful of columns from any given table. Because they didn't want to write IF UPDATE() for every column in the trigger, they used a generic SQLCLR trigger on the tables... that would take nearly 4 minutes to audit changes to only 5 columns on a paltry 10,000 rows. The T-SQL replacements for those triggers work nearly instantaneously for 100,000 rows.

    Now, I'm not saying that you've written "slow" code because I don't actually know and neither does the author because no one has done a performance test in this article. Yes, yes... I understand the grand utility of modular general purpose code. My general caution to people is to test any new thing they're going to add to their servers before they add it and that's my intent here because it really might not be worth the trade off in performance... or it might.

    Like my ol friend Sergiy says, "A Developer must not guess. A Developer must KNOW!"

    So, what say yea? How about some performance tests for those things that can also be done in T-SQL?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • borisk 57245 (5/8/2012)


    RegEx is Thee best tool in order to find any pattern within given string[did couple of test including .Net String manipulation and limited RegEx in SQL , Like/PATIndex....], and it proves to be much faster than any other way.

    Cool! Prove it. Show us the test code you used.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • williams.frd (5/8/2012)


    Am I out of line in thinking this is something that should be handled in a high-level object-oriented language?

    Hello. I am going to say "yes", but I am not being flippant about it ;-). Here are my two reasons:

    1) SQL already includes similar albeit less functional equivalents in the form of: LIKE, PATINDEX, SUBSTRING, and REPLACE.

    2) In terms of using Regular Expressions in WHERE clauses, this is something that SHOULD be in the DB layer as pulling back hundreds of thousands of rows to the app just to do filtering there is rather inefficient.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 15 posts - 1 through 15 (of 57 total)

You must be logged in to reply to this topic. Login to reply