RegEx for DBAs

  • You are half right Jeff,

    We'll leave out the vagueness of the term "Big Data" which is more marketing than useful. I think Buck Woody has the closest to a useful definition http://blogs.msdn.com/b/buckwoody/archive/2011/10/18/big-data-and-the-cloud-more-hype-or-a-real-workload.aspx

    Volume and velocity of data are the bits that will be most familiar to DBAs. Variety and variability are the bits where we start pushing the envelope.

    Microsoft are making a big play around SQOOP allowing SQL2012 to play nicely with Hadoop and I think DBAs in the DW space need to start looking at Hadoop.

    For me the most interesting aspect of "Big Data" is solving the problems of mining useful information out of the "unstructured" stuff. For example, mining bulleting boards and forums to derive useful and leveragable information by means of an automated process. To be able to do that sort of stuff you need "Big Brains" and this is where the discipline of "Data Scientist" comes in.

    Text parsing and term extraction form part of it and this is where RegEx becomes useful. You and others are absolutely right to point out that it has a big performance penalty and I can't imagine anyone sane using it in the data tier for OLTP. In a data warehouse, and in particular the staging area of a data warehouse it is a very powerful weapon.

    If you are worried about performance then you'll blow a gasket when you see what a data profiler does to a database server. Horrible it may be but entirely necessary in the DW space.

    Steve Jones posted an editorial a while back on the evolution and possible future evolution of SQL Server that used the analogy of Grandpa's axe. It's had 4 handles and 5 heads but it is still Grandpa's axe. Similarly SQL Server started out as a RDBMS but is now touching all sorts of things outside of the RDBMS space but is still SQL Server.

  • I used to work in the "E Discovery" business where millions of documents where compared and screened for similarities, key words, and other patterns. We had tools such as Equivio that would do all of that and tools such a Lucene for certain full text searches and other pattern recognition. Those tools may have used Regex but it seems to may that with the speed they actually had, they probably used something a whole lot more directed and effecient and that makes a whole lot of sense because code to do something specific is almost always a whole lot faster than generic code.

    I appreciate the thoughts you folks have written down but, except for possible ad hoc one off searches where performance doesn't really seem to matter to some folks, I'm not seeing much that I like in Regex. I'll keep an open mind about certain aspects of it but I'm not seeing it as a valuable large volume processing tool.

    Heh... and you might be able to cut yourself or the guy standing behind you when you swing a 4 handled, 5 headed axe and it'll look really cool on the shelf, but you're not going to cut much wood with it. 😛

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Gentlemen... so sorry I'm late to the party!

    Paul White recently turned me on to SQL# and I just happened to do a comparison of the SQL# RegEx string splitter against a pattern splitting function in the article I authored (fourth link in my signature line).

    You can see those performance results here: http://www.sqlservercentral.com/Forums/Topic1390297-3122-5.aspx

    Just hoping to make a very minor contribution to this most interesting article and thread. Thanks to a good friend (you know who you are) for tuning me in to this.


    My mantra: No loops! No CURSORs! No RBAR! Hoo-uh![/I]

    My thought question: Have you ever been told that your query runs too fast?

    My advice:
    INDEXing a poor-performing query is like putting sugar on cat food. Yeah, it probably tastes better but are you sure you want to eat it?
    The path of least resistance can be a slippery slope. Take care that fixing your fixes of fixes doesn't snowball and end up costing you more than fixing the root cause would have in the first place.

    Need to UNPIVOT? Why not CROSS APPLY VALUES instead?[/url]
    Since random numbers are too important to be left to chance, let's generate some![/url]
    Learn to understand recursive CTEs by example.[/url]
    [url url=http://www.sqlservercentral.com/articles/St

  • OLD School (compliments of Stack Overflow forum) for less fortunate users who do not allow to have a CLR-enabled SQL Server:

    http://stackoverflow.com/questions/15523789/sql-email-verification-function-using-regex

    Solution is based on legacy COM VBScript library, but still works.

    My $0.02 - I personally would use Regex expression for data quality check, in production just to validate a scalar values. 😎

  • Nice article and nice experiment, but unfortunately from security perspective I am not able to even give the needed SQL# a try simply because SQL# has lots highly restricted features like CLR, Assembly, ...

    But good work!

  • halifaxdal (9/5/2014)


    Nice article and nice experiment, but unfortunately from security perspective I am not able to even give the needed SQL# a try simply because SQL# has lots highly restricted features like CLR, Assembly, ...

    Hi there. I am the creator of SQL# and I was just curious as to what the security concern is? The Regular Expression functions operate in the default SAFE setting. While the SQL# library does contain functions and procedures that do require EXTERNAL_ACCESS, or in a few cases, UNSAFE, those are optional and many people choose to only use the functionality available when all of the assemblies are set to SAFE (which includes all of the RegEx and String functions). I realize that the concern is with SQLCLR in general and not specific to SQL#, but I thought I would at least clarify.

    Regarding running CLR Integrated code in general, the concern about it presenting a security risk is most often a misplaced concern. There is unfortunately a wide-spread lack of understanding of what CLR-based code can and cannot do and of the security mechanisms in place. I have written an article, also here on SSC, that addresses this issue specifically and shows that CLR-based code can be very secure, and is by default:

    Stairway to SQLCLR Level 3: Security (General and SAFE Assemblies)[/url]

    Take care,

    Solomon..

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • Your removing duplicate spaces pattern should be ' +'

    The pattern in your article only has a single space followed by a plus character. Such a pattern matches one or more consecutive space characters. What you want to match are two or more consecutive space characters. Your pattern will replace single space characters with single space characters, which is very inefficient.

  • David Poole wrote:

    I had two specific projects that involved parsing and extracting content from HTML, XML and MS Word documents then converting them into content fragments within a content management system.

    Good article. I completely agree RegEx is a highly useful tool; but when XML is involved, the extracting and transforming of name/value pairs is often the initial objective. For that, I would consider using T-SQL's implementation of XQuery.

  • corey lawson (5/9/2012)


    I'll bite, as I use regex's much for cases like this ...

    The patindex is fine, but now, isolate the matched substring ...

    Me, personally, I'd rather modify a regex pattern in the future by just adding more cases to the pattern expression rather than adding more and more layers of TSQL if-then-else logic, substring functions, etc, especially as the pattern perhaps gets complicated with look-aheads or look-behinds (e.g., match if the following is also like...). With the matchgroup function, it becomes trivial to pick out the matched values. And regexreplace can replace some pretty roccoco TSQL code doing the same thing.

    I agree. It's about the right tool for the job. There are some cases where running one regex command to pattern match and cleanse the data is much more efficient than trying to code this in TSQL. In some cases TSQL is like using a butter knife on a flathead screw instead of a drill with flathead bit.

  • I liked this article and in fact I just leveraged a CLR RegEx in one of the databases our team supports, in order to generate an XML extract.

    In a perfect world, the data in the tables would be clean and pristine. Over here in the trenches of this world, I had an extract to get out and all kinds of illegal characters in multiple, concantanated varchar(max) fields breaking the interface. The RegEx CLR was already in the database (an I now know why!). I used it. It was actually faster for me to pattern match and eliminate the characters, than it was to keep adding replace statements. I won't even get into the readability factor. Granted, these are datasets of 10K or less records, but for this purpose it was the right tool and performance isn't an issue.

    I did have some time to look at the SQL# code. I assume that the hex in create assembly statements are the actual assemblies themselves, meaning there is no high-level code to read through? Admittedly, the CLR's do work black box, unless you happen to know where to find the code used to create the assemblies. I just can't see being able to load black-box, third-party code like this in our environment.

    Nice job though with the article and the SQL#!

  • boumerlin (9/11/2014)


    I did have some time to look at the SQL# code. I assume that the hex in create assembly statements are the actual assemblies themselves, meaning there is no high-level code to read through? Admittedly, the CLR's do work black box, unless you happen to know where to find the code used to create the assemblies. I just can't see being able to load black-box, third-party code like this in our environment.

    Hi there. You are correct about the hex code in the CREATE ASSEMBLY statements: it is the actual Assembly. And correct, there is no source code that is publicly available. I am not sure why that would be an impediment for most users (outside of "sensitive" systems possibly), but I do understand that some people feel that way. If you have any specific questions, feel free to contact me via Private Message or Email from my forum profile.

    Take care,

    Solomon...

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

  • halifaxdal (9/5/2014)


    Nice article and nice experiment, but unfortunately from security perspective I am not able to even give the needed SQL# a try simply because SQL# has lots highly restricted features like CLR, Assembly, ...

    But good work!

    Like I said you could this in COM way without enabling CLR in your SQL Server;-)

    http://stackoverflow.com/questions/15523789/sql-email-verification-function-using-regex

  • fregatepllada (9/11/2014)


    halifaxdal (9/5/2014)


    Nice article and nice experiment, but unfortunately from security perspective I am not able to even give the needed SQL# a try simply because SQL# has lots highly restricted features like CLR, Assembly, ...

    Like I said you could this in COM way without enabling CLR in your SQL Server;-)

    http://stackoverflow.com/questions/15523789/sql-email-verification-function-using-regex%5B/quote%5D

    But then you still have to enable 'OLE Automation', which is obsolete, slower, and less secure than 'CLR Integration'. And the user needs to be in the sysadmin server role. I am not sure of there being any benefit to using the sp_OA* functions, nor do I suspect that any company concerned about "security implications" of CLR would be ok with enabling the OLE Automation stuff.

    SQL#https://SQLsharp.com/ ( SQLCLR library ofover 340 Functions and Procedures)
    Sql Quantum Lifthttps://SqlQuantumLift.com/ ( company )
    Sql Quantum Leaphttps://SqlQuantumLeap.com/ ( blog )
    Info sitesCollations     •     Module Signing     •     SQLCLR

Viewing 13 posts - 46 through 57 (of 57 total)

You must be logged in to reply to this topic. Login to reply