SQLServerCentral Article

RegEx for DBAs

,

I was introduced to Regular Expressions or RegEx when I was wearing my non-DBA hat in a content management consultancy.  I had two specific projects that involved parsing and extracting content from HTML, XML and MS Word documents then converting them into content fragments within a content management system.

Of course I could have achieved the same ends using  the ECMA script equivalent of LIKE, PATINDEX, CHARINDEX, SUBSTRING and REPLACE but RegEx offered a number of advantages: 

  • A rich and varied means of pattern matching strings albeit with a steep learning curve
  • Wild card replace functionality
  • The ability to split strings and handle the individual elements separately
  • A simpler, less error prone means of parsing those strings.

For much of my time as a DBA the traditional string handling functions have been more than adequate for what I needed to do so my RegEx skills have been rusting quietly for 6 years.

With the advent of a big data warehousing project RegEx suddenly leapt back to the fore as a useful skill to have. If I can liken the SQL functions mentioned earlier to a Ford Focus diesel estate then RegEx is an all-options checked Audi A7 3.0TDI. It is possible to drive the Audi and get 50+ mpg but if you make full use of its capabilities you will see it dip below 30mpg.  The same applies to RegEx.  You can produce RegEx that is fairly efficient, but it is also possible to write expressions that are extremely expensive to run.

Nevertheless I've often wished SQL Server had RegEx functions so when SQL2005 introduced CLR functions that was on top of my list of additions to my SQL toolbelt.

SQL# and AdventureWorks

Solomon Rutzky put together an incredibly useful toolkit of CLR functions known as SQL# and several RegEx functions are included in that toolkit.

The toolkit has a community edition that is free and a commercial offering that has additional functionality.  It is distributed as a SQL file that you simply run in the database in which you wish to implement the toolkit.

There is also a well written manual for the SQL# toolkit with examples that act as a good primer to help you get started with RegEx.

In this article I am going to provide some basic RegEx examples which, for simplicity's sake, will use the SQL# toolkit and AdventureWorks to demonstrate some of the uses and the power of RegEx.I am going to focus on RegEx.Replace as wild card replacement functionality is one of those wish list items that really demonstrate the power of regular expressions. 

Data profiling with RegEx.Replace

If you are going to build a data warehouse it is essential to understand both the structure and the content of the data that is going to go into that warehouse.  There are a number of techniques that can be used to inspect what the quality of your data actually is and amongst those is data profiling.

Arkady Maydanchick gives a good general description of the issues and techniques in his book "Data Quality Assessment".

For each character field we would list the following: -

  • COUNT(<string field>) - Count the non-null occurrences
  • COUNT(DISTINCT <string field> - Count the unique occurrences
  • MIN(LEN(<string field>)) - Shortest length
  • MAX(LEN(<string field>)) - Longest length
  • MIN(<string field>) - First alphabetic occurrence
  • MAX(<string field>) - Last alphabetic occurrence

What we could also do is count the separate profiles of the strings.  For example, UK postcodes have the following profiles

  • A9 9AA
  • A99 9AA
  • AA9 9AA
  • AA99 9AA
  • AA9A 9AA
  • A9A 9AA

The latter two are London specific.

If we saw profiles other than the ones listed in a list of addresses that were supposed to be UK only then this would flag that we could have a data quality issue.

So if we wanted to replace all letters with capital A using the SQL#.RegEx_Replace, we would do something similar to the following:-

USE AdventureWorks
go
SELECT
PostalCode,
PostalCodeProfile = SQL#.RegEx_Replace(
PostalCode, -- The field we want to profile
'[A-Z]', -- The RegEx comparison for the range of letters we want to replace
'A', -- What we want to replace them with
-1, -- The number of occurrences that we will replace, in this case all of them.
1, -- The start position within the string
'IgnoreCase' -- RegEx defaults to case sensitive so this overrides that behaviour
)
FROM Person.Address

There are a couple of things to notice here

  • Although we want to change all letters to capital A we still specify the range [A-Z] and not [B-Z]. That is because if we had a postcode entered as Sa99 9BF the a would not change to an A.
  • The IgnoreCase option turns the entire RegEx comparison into case insensitive.

There are occassions when we may want selective case insensitivity.  We could have written  our RegEx comparison as follows

SELECT
PostalCode,
PostalCodeProfile = SQL#.RegEx_Replace(
PostalCode, -- The field we want to profile
'(?i)[A-Z]', -- The RegEx comparison for the range of letters we want to replace
'A', -- What we want to replace them with
-1, -- The number of occurrences that we will replace, in this case all of them.
1, -- The start position within the string
'' -- Accept RegEx as case sensitive
)
FROM Person.Address

Of course we also want to replace any digits with a 9 so our full code to run a profile becomes

USE AdventureWorks
go
WITH PostCodeSet(PostalCode,PostalCodeProfile) AS
(
    SELECT
        PostalCode,
        PostalCodeProfile = SQL#.RegEx_Replace(
            SQL#.RegEx_Replace(PostalCode,'[A-Z]','A',-1,1,'IgnoreCase'),
            '\d','9',-1,1,'')
    FROM Person.Address
)
SELECT PostalCodeProfile,Occurrences=COUNT(*)
FROM PostCodeSet
GROUP BY PostalCodeProfile
ORDER BY 2 DESC

In the 2nd replace notice the \d comparison.  This instructs RegEx to match against any digit.  We could equally have written [0-9].

If you run the query above you will see some strange pattern occurrences which is precisely why we carry out such profiling activity in data quality assessments.

Stripping multiple spaces with RegEx_Replace

One of the most common problems with parsing strings is the occurrences of double spaces.  Jeff Moden wrote a neat article on how to replace multiple spaces with one.

This is something that is also possible with RegEx_Replace.

DECLARE @MyString VARCHAR(1000)
SET @MyString='This   is  a         test of my   strange   spaces'
PRINT SQL#.RegEx_Replace(
@MyString,
' +',-- The RegEx comparison for the range of letters we want to replace
        ' ',-- What we want to replace them with
-1,-- The number of occurrences that we will replace, in this case all of them.
         1,-- The start position within the string
''-- Accept RegEx as case sensitive
)

Stripping double characters with RegEx_Replace

When we do phonetic matching it is often useful to get rid of double characters, or for that matter and number of multiple characters.  Again RegEx can do this for us.

DECLARE @MyString VARCHAR(1000)
SET @MyString='Cool Cattery Commission'
PRINT SQL#.RegEx_Replace(
@MyString,
'(\w)\1+',-- Match against any word character [A-Za-z0-9_]
'$1',-- The variable to which the group match was assigned
-1,-- The number of occurrences that we will replace, in this case all of them.
1,-- The start position within the string
'IgnoreCase'-- Make RegEx as case insensitive
)

The important thing here is that the condition wrapped up in parentheses (\w) identifies a "group" which will be assigned to the $1 variabl which can then be used in the replacement.

What we are really saying is:

  • Match against any word character individually that immediately repeats one or more times.
  • Capture the first character of this pattern to be referenced in the pattern itself (via \1) and in the replacement string (via $1)
  • Replace the complete matched string, no matter how many consecutive characters with the single character held in the $1 variable.

Word boundary matches

One feature of RegEx I have found particulary useful is its ability to look at words within a string.  One particular experiment I am currently leading is to find ways of mining text strings within unstructured data.

During an early iteration of the experiment I wanted to produce a SOUNDEX encoding of every word in a phrase.

One of the rules of SOUNDEX is that all vowels are dropped except when the vowel is the first letter of the word.  RegEx can do this very simply as is illustrated by the example below

DECLARE @String VARCHAR(50)='Annastasia''s anathema befuddled us all',
    @RegEx VARCHAR(50)= '\B[AEIOUWHY]'
SELECT
    correction=SQL#.RegEx_Replace(@String,@RegEx,'',-1,1,'IgnoreCase'),@String AS OriginalString

In this case \B tells RegEx that any vowel (including the Welsh ones) not at the beginning of a word should be replaced.

  • \B means not at a word boundary
  • \b means at a word boundary

Putting the \B or \b at the beginning of an expression tells RegEx that you are interested in the beginning of the word.

Complex patterns with RegEx

If we look at the possible patterns for a UK postal code we can see that there are some matching rules that can be applied

  • A9 9AA
  • A99 9AA
  • AA9 9AA
  • AA99 9AA
  • AA9A 9AA
  • A9A 9AA
Rule RegEx Example
One or two letter [A-Z]{1,2}

A

AA

A digit [0-9] 9
Optionally either a letter or a digit ([A-Z]|\d){0,1}

<nothing>

A

9

A space followed by a digit and two letter  \d[A-Z]{2} 9AA

The following query demonstrates the RegEx pattern in operation

SELECT *
FROM Person.Address
WHERE SQL#.RegEx_IsMatch(PostalCode,'[A-Z]{1,2}\d([A-Z]|\d){0,1} \d[A-Z]{2}',1,'IgnoreCase')=1

This example introduces us to anew RegEx pattern matching rule.

Curly brackets { }, which  allow us to specify how many of the preceding group of characters:-

  • {2} means we must have two of the preceding "group" so [A-Z]{2} means we must have 2 letters.
  • {1,2} means we must have a minimum of one and a maximum of 2 so [A-Z]{1,2} means we must have 1 or two letters.
  • {0,1} means we can either have 0 or up to 1 so ineffect ([A-Z]|[0-9]){0,1} means that we don't to have either but we can have 1 of either a letter or a number.

At this point I have to point out that different implementations of RegEx across languages and operating systems have different dialects.  In some cases {0,1} could be written as {,1} or just a single question mark.  Also 3 or more could be written as {3,}.

The pipe symbol acts as an OR statement

Groups and RegEx

RegEx allows matches to be split into groups and assigned to variables.

For example, let us suppose that we have a system that writes UK Postal codes to the database as a concatenated string with no space between the first and second half of the postal code.

  • CH5 1AA is written as CH51AA
  • SK11 9AS is written as SK119AS
  • ...etc

It is possible to use RegEx_Replace to insert a space at the appropriate place.

The following example shows how to do this using "groups" of letters as identified by () parentheses.

DECLARE @PostalCodes TABLE(PostalCode VARCHAR(8) NOT NULL)
INSERT INTO @PostalCodes(PostalCode)
VALUES ('CH51AA'),('SK119AS'),('M11AA'),('WC1N4DJ'),('S119PF')
/*
SQL 2005 Equivalent
INSERT INTO @PostalCodes(PostalCode)
SELECT 'CH51AA' UNION ALL
SELECT 'SK119AS' UNION ALL
SELECT 'M11AA' UNION ALL
SELECT 'WC1N4DJ' UNION ALL
SELECT 'S119PF'
*/
DECLARE @RegEx VARCHAR(2000)='([A-Z]{1,2}\d([A-Z]|\d){0,1})(\d[A-Z]{2})'
SELECT
    CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'$1 $3',-1,1,'IgnoreCase'),
    PostalCode
FROM @Postalcodes
WHERE SQL#.RegEx_IsMatch(PostalCode,@RegEx,1,'IgnoreCase')=1

I have emphasised the '$1 $3' for a reason.

Variable Description
$1 The first group encapsulates the entire 1st half of the postal code
$2 The 2nd group is nested in the 1st half of the postal code to look after the cases with the optional 4th alphanumeric.
$3 The 3rd group is the last 3 characters of the postal code

By now you are beginning to see the power of RegEx but also the fact that the expressions can become complex and hard to debug.

I could have achieved the same split of postal codes from a slightly different RegEx

DECLARE @RegEx VARCHAR(2000)='([A-Z]{1,2}\d)([A-Z]|\d){0,1}(\d[A-Z]{2})'
SELECT
    CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'$1$2 $3',-1,1,'IgnoreCase'),
Variable Description
$1 The first group handles the mandatory part of the first half of the postal code
$2 The 2nd group looks after the cases with the optional 4th alphanumeric.
$3 The 3rd group is the last 3 characters of the postal code

If the two RegEx expressions are put side by side you will see the subtle differences

DECLARE @RegEx VARCHAR(2000)='([A-Z]{1,2}\d([A-Z]|\d){0,1})(\d[A-Z]{2})'
DECLARE @RegEx VARCHAR(2000)='([A-Z]{1,2}\d)([A-Z]|\d){0,1}(\d[A-Z]{2})'

This illustrates how easy it is to miss potential logic holes in RegEx expressions.

Non-capturing matches

Another useful feature is the ability to specify a pattern match where the entire string must be matched but only certain key segments are assigned to variables.  We would use this for reasons of efficiency as the non-capturing matches are only for evaluation and require no special handling.

Let us suppose that we want to capture the first half of a UK postal code.

  • Match against the full UK postal code to ensure that it actually is a postal code
  • Only capture the first half

If the bit of the RegEx pattern that identifies the 2nd half of the postal code was changed as follows then this can be achieved.

(\d[A-Z]{2})
-- Match but do not capture
(?:\d[A-Z]{2})

Just to emphasise the point consider the following text

"In the UK contractors are allowed to own their own companies and have their employers pay that company.  The contractor can then draw a salary from their company and be paid dividends taxable at a much lower rate than income tax.  This is known as IR35.  By far the worst offending postal code is EC4M 7LS."

If we wanted to extract the postal sector then if we simply ignored the 2nd half of the postal code we would extract IR35 which is completely erroneous.

Named group matches

It is also possible to capture match groups and refer to them by name.  The code below illustrates a 3rd way of splitting out our postal codes.

DECLARE @RegEx VARCHAR(2000)='(?''OutboundPostCode''([A-Z]{1,2}\d([A-Z0-9]){0,1}))(?''InboundPostCode''(\d[A-Z]{2}))'
SELECT
    CorrectedPostalCode=SQL#.RegEx_Replace(PostalCode,@RegEx,'${OutboundPostCode} ${InboundPostCode}',-1,1,'IgnoreCase'),

As RegEx patterns become more complex the ability to assign names that have a business meaning to groups can help in aiding debugging.

RegEx and SEO keyword extraction

If you work for an e-commerce site then the cost of SEM (Search Engine Marketing) can rapidly consume a large part of a marketing budget.  The more traffic you can gain from SEO (Search Engine Optimisation) the better.

SQL#.RegEx_Replace can be used to extract the search terms from the HTTP_REFERER for known search engines.

The example below shows a quick mock up of some search engine URLs and a method for extracting the search terms.

-- Build a table containing search engine urls
DECLARE @URLList TABLE (URL VARCHAR(4096) NOT NULL)
INSERT INTO @URLList(URL)
SELECT    'http://www.google.co.uk/search?q=willier+macclesfield&ie=utf-8&oe=utf-8&aq=t&d=test+search&rls=org.mozilla:en-GB:official&client=firefox-a'
UNION ALL
SELECT    'http://uk.search.yahoo.com/search;_ylt=AkV3R0MiO42H88ok8XhXbng4hJp4;_ylc=X1MDMjAyMzM5MjMxMgRfcgMyBGZyA3lmcC10LTcwMgRuX2dwcwMwBG9yaWdpbgN1ay55YWhvby5jb20EcXVlcnkDZGF3ZXMgbWFjY2xlc2ZpZWxkBHNhbwMx?p=dawes+macclesfield&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-702'
UNION ALL
-- Search engine that does not really exist to show that the url parameter "q" can be used as a default
SELECT    'http://www.phoneysearch.yahoo.com/search?q=trek+holmes+chapel&toggle=1&cop=mss&ei=UTF-8&fr=yfp-t-702'
--  Build a table containing the root url and the url parameter that contains the search terms.
DECLARE @SearchEngineParams TABLE(
    SearchDomain VARCHAR(255) NOT NULL ,
    SearchIdentifier VARCHAR(20) NOT NULL)
INSERT INTO @SearchEngineParams(SearchDomain,SearchIdentifier)
SELECT 'http://www.google.co.uk/search','q='
UNION ALL
SELECT 'http://uk.search.yahoo.com/search','p='
SELECT U.URL,
    SearchTerms=SQL#.RegEx_Replace(URL,'.+(?:[\?&]'+ISNULL(SP.SearchIdentifier,'q=') +')([^&]+)|(.+$)','$1',-1,1,'')
FROM    @URLList AS U
    LEFT JOIN @SearchEngineParams AS SP
    ON U.URL LIKE SP.SearchDomain+'%'

For those search engines for which we have no knowledge the following parts of the SQL code are of relevance.

  • LEFT JOIN ensures that all urls are investigated regardless of whether our @SearchEngineParams has the particular search engine we are interested in
  • The ISNULL statement allows us to specify a default search term identifier of "q=" for any missing search engine

It should be noted that search engines represent search terms in different ways

  • Most use the url parameter q, hence our chosen default value
  • Some use terms=, qry = , p= and a variety of others
  • Some use + as a word separator, other use %20 meaning space (character 32)

Conclusion and useful references

Firstly thanks to Solomon Rutzky for his time peer reviewing this article and feedback.  Constructive criticism is always welcome.

The examples given here are fairly basic to wet your appetite.  Regular Expressions are as much art as science; it is truly amazing what some people have achieved with RegEx patterns.

Regular Expressions have been around for a considerable length of time and over the years RegEx engines have been refined and optimised so despite the complexity of what they are asked to do they are surprisingly efficient.  That said, as with any tool, particularly a powerful one,  it is easy to get carried away and use it inappropriately.

As mentioned earlier different languages and operating systems have slightly different dialects for RegEx.

For further reading consider some of the following: -

It is also worth finding a good RegEx testing tool.  I generally use http://weitz.de/regex-coach/#install however not all RegEx syntax is supported by this particular tool.

One final point to make is that for efficiency sake it may be worth combining the use of T-SQL LIKE statements in the WHERE clause of a query to narrow down the number of records evaluated by RegEx.

This is similar to the concept of storing a CHECKSUM value for a string in an indexed field and then using this in a query such as in the example below.

CREATE PROC dbo.SearchBigString
   @BigStringParam VARCHAR(900)
AS
SET NOCOUNT ON
SELECT *
FROM dbo.BigStringTable
WHERE CheckSumField = CHECKSUM(@BigStringParam) -- Selective enough to get an INDEX SEEK
AND ContentField = @BigStringParam -- drills down to the specific item
RETURN @@ROWCOUNT
GO

Rate

4.67 (39)

You rated this post out of 5. Change rating

Share

Share

Rate

4.67 (39)

You rated this post out of 5. Change rating