SQL 2000 DBA Toolkit Part 2

SQL 2000 DBA Toolkit: Part Two - Regular Expressions

"Begin at the beginning and go on till you come to the end: then stop." - The King, Alice in Wonderland

Introduction

This is Part Two of a series of articles on the SQL 2000 DBA Toolkit. In Part One we discussed the DBA Toolkit's encryption and key management features. In this article we'll take a look at the included regular expression parsing tools.

As always, feel free to use these tools as you see fit. If you find them useful, please drop me a line and let me know which ones you use, how you use them, and what you find most useful.

Enjoy.

Click here to read the first article

Text Processing

Text processing is a big part of SQL Server database and application development. Sometimes you need to transform, find patterns in, or just search text; but text processing is the bread and butter of many SQL-based applications.

In recognition of the importance of text processing on SQL Server, I've included several text processing tools in the toolkit. Some are re-writes of previously released tools, like the Sound Matching and Phonetic Toolkit. Others, like regular expression parsing, are new to the toolkit. This article focuses on regular expression tools in the toolkit; the next article will discuss the phonetic matching tools included.

Installation

Installation of the Toolkit is covered in Part One of this series. Basically, just copy the DLL's to your MSSQL\BINN directory and run the INSTALL.SQL script. Installation troubleshooting steps are covered in the README.RTF file included in the \DOCS directory.

In addition, there is a Windows Compiled Help file (DBATOOLKIT.CHM) included in the \DOCS directory. This Help File is included as your first line for online Q&A.

Regular Expression Tools

Regular expressions (regex's) are a precise notation for approximate matching. One of the most popular styles of regular expression syntax is the Perl syntax. The regular expression tools in the toolkit use the Boost::Regex++ library, which is compliant with Perl syntax. NOTE: Microsoft's regular expression syntax differs somewhat from Perl syntax, so .NET-specific regular expressions might require tweaking to use with these tools.

The Functions

fn_regex_match (@searchexpression, @searchstring, @options)

The first regex tool in the toolkit is the fn_regex_match function. This function accepts three parameters:

@searchexpression is the Perl-syntax regular expression to use in the search,
@searchstring is the string to match against, and
@options are the user-assignable options for this match. The only option supported in this version of the toolkit is 'I+', which turns case-insensitivity on. By default regex matches are considered case-sensitive.

The fn_regex_match function will look for a regular expression match in the search string. If a match is found, fn_regex_match returns a 'Y', otherwise it returns an 'N' to indicate no match was found. The following examples illustrate fn_regex_match.

This example returns 'Y', since a pattern of one or more numbers can be found in the search string:

SELECT dbo.fn_regex_match ('\d+', '$1,234.56', NULL)

This example returns 'N' since the first character of the search string is not a quotation mark (\")

SELECT dbo.fn_regex_match ('^\"', 'He said "Hi"', NULL)

This example returns 'N' since there is a comma in the middle of the search string

SELECT dbo.fn_regex_match ('^\d+$', '123,456', NULL)

This example returns 'Y', but only because we turned case-insensitivity on (I+ option). It will not match if case-insensitivy is off (default).

SELECT dbo.fn_regex_match ('^[A-Z]+$', 'George', 'I+')

Figure 1. Sample fn_regex_match result

xp_regex_search (@searchexpression, @searchstring, @options)

This extended procedure searches for, and returns, all matches of the search expression within the search string. The parameters are:

@searchexpression is the Perl-style regular expression,
@searchstring is the string to search
@options are user options. Currently there are two options supported for xp_regex_search: 'I+' turns case-insensitivity on. 'G+' turns group captures on. When group capture is on, all sub-groups of matches in your regular expression are returned as well as the complete match. See below for an example of group captures.

By default, regex searches are case-sensitive, and group captures are turned off.

The xp_regex_search extended procedure returns a table with three columns: MatchNum, GroupNum and MatchText.

MatchNum is the number of the match. Matches are numbered from left to right as they are encountered.
GroupNum is the number of the sub-group captured in this match. The complete match is always numbered 1; subsequent sub-group captures will have the same MatchNum, but the GroupNum increases by 1 for each sub-group.
MatchText is the text that matched the regular expression, or in the case of sub-group captures, the text that matched the sub-group in the regular expression.

Sub-groups are marked in Perl regular expression syntax by wrapping them in parentheses: i.e., '(\d+)' will capture a sub-group of 1 or more digits [0-9].

Here are some examples of xp_regex_search:

This example returns the rows 1,1,'George' and 2,1,'Harold'. Notice we used the case-insensitive regex searching option:

EXEC dbo.xp_regex_search '[A-Z]+', 'George,Harold', 'I+'

This example returns the rows 1,1,'1/2/2005' and 2,1,'2/3/2006':

EXEC dbo.xp_regex_search '(\d+)/(\d+)/(\d+)', '1/2/2005,2/3/2006', NULL

This example is the same as above, but with sub-group captures turned on. This will return 1,1,'1/2/2005'; 1,2,'1'; 1,3,'3'; 1,4,'2005'; 2,1,'2/3/2006'; 2,2,'2'; 2,3,'3'; and 2,4,'2006'. Subgroups can be returned here because we specified the Perl regex grouping operator '()' in the regular expression. So each set of one or more contiguous digits is a sub-group capture.

EXEC dbo.xp_regex_search '(\d+)/(\d+)/(\d+)', '1/2/2005,2/3/2006', 'G+'

Figure 2. xp_regex_search sample with group capture on

xp_regex_split @searchexpression, @searchstring, @options

This extended procedure is similar to xp_regex_search. Instead of returning matches to the search expression, however, this function uses the search expression to split the search string into tokens. The parameters are similar to xp_regex_search:

@searchexpression is a valid Perl-syntax regular expression,
@searchstring is a string to be searched/split
@options are the user options. This extended procedure supports only the 'I+', case-insensitivity on, option.

The xp_regex_split extended procedure returns a table with two columns: TokenNum and TokenText.

TokenNum is the number of the 'token' - the text returned by the split, and
TokenText is the text returned.

Results are returned by xp_regex_split in increasing order from left to right, as with the xp_regex_search procedure. Here are a few examples:

Splits based on white space. Returns 1,'One'; 2,'Two'; and 3,'Three':

EXEC dbo.xp_regex_split '\s+', 'One Two Three', NULL

This split is a little more complex. It uses Perl regex look-aheads to split a comma-delimited (CSV) format string. The result is 1,"Make"; 2,"A"; 3,"Wish".

EXEC dbo.xp_regex_split '\s*,\s*(?=(?:[^\"]*\"[^\"]*\")*(?![^\"]*\"))\s*', '"Make","A","Wish"', 'I+'

Figure 3. fn_regex_split comma-delimited (CSV) data sample

fn_regex_replace (@searchexpression, @searchstring, @replacement, @options)

The fn_regex_replace function replaces all search expression matches in the search string with the replacement string. The parameters are:

@searchexpression is a valid Perl-syntax regular expression,
@searchstring is a string to be searched
@replacement is the replacement string
@options are the user options. This extended procedure supports only the 'I+', case-insensitivity on, option.

By default the regex replace operation is case-sensitive. Here are a couple of examples of fn_regex_replace usage.

In this first simple example, all white space in the search string is replaced with periods:

SELECT dbo.fn_regex_replace ('\s', 'Hello There How Are You?', '.', 'I+')

In this next example, a list of hexadecimal numbers is prefixed with '0x'. Notice the use of the Perl grouping parentheses and the $1 group specifier in the @replacement string:

SELECT dbo.fn_regex_replace ('([0-9A-F]{2})', '01 1a 4f ff 36 92 87', '0x$1', 'I+')

Perl Regex Syntax

Regular Expressions are as much an art as they are a science, but they are built on a very specific syntax. Regular expressions are built up by combining single characters and metacharacters known collectively as 'atoms'. Here's a quick overview of some of the most common 'atoms' in Perl regular expression syntax:

Symbol	Character
A B C	literal characters like 'A' are matched exactly. Lowercase characters are matched exactly as well, unless you are performing a case-insensitive search. Numbers (1, 2, 3...) are likewise matched exactly.
\	The backslash character is used to denote special meta-characters. Use \\ to denote an exact match of a backslash.
\d	Matches a digit (0 - 9) character.
\D	Matches a non-digit character.
\w	Matches a 'word' character, which includes letters A-Z, a-z, 0-9 and the underscore character ('_').
\W	Matches a non-word character.
\s	Matches a whitespace character.
\S	Matches a non-whitespace character.
\n	Matches a new-line character.
\xff	Matches a hexadecimal character (0xff in this instance).
.	Matches any character. Use \. to match a period in your search string.

In addition you may also use repetition characters. Repetition characters normally follow a literal character, metacharacter or group of characters:

Metacharacter	Meaning
^	Matches from the beginning of the search string.
$	Matches at the end of the search string.
*	Matches the preceding atom or group 0 or more times.
+	Matches the preceding atom or group 1 or more times.
?	Matches the preceding atom or group 1 or 0 times.
{n}	Matches the preceding atom or group exactly n times.
{n,}	Matches the preceding atom or group at least n times.
{n, m}	Matches the preceding atom or group at least n times, but no more than m times.
[AEIOU]	Character classes are placed in brackets. The example [AEIOU] matches any of the characters 'A', 'E', 'I', 'O', or 'U'.
[A-Z]	Character class ranges can be specified by using a dash between two characters.
[^AEIOU]	A caret (^) at the beginning of a bracket indicates that any character other than the ones specified are a match. In this instance, any character but 'A', 'E', 'I', 'O' or 'U' is considered a good match.

You can also group atoms together using parentheses. We did this in some of the examples above. Here are the specifics:

Symbols	Meaning
( )	Parentheses indicate that anything between them should be matched as a group.
(BEN\|JERRY)	The 'or' ('\|') operator indicates either of the items on the left or right of it is a match. In this example, 'BEN' or 'JERRY' is a good match.

This is just a brief overview of some of the options available. Many very long books have been written on specific regular expression details. An excellent resource for beginners is Jukka Korpela's Regular Expressions in Perl. On this site, Jukka gives descriptions of the basic Perl regex operations, and examples.

Another great resource is Jeffrey Friedl's book Mastering Regular Expressions or Nathan A. Good's Regular Expression Recipes: A Problem-Solution Approach.

DBA Toolkit (Regular Expressions) Quick Reference

User-Defined Functions	Description
fn_regex_match (@searchexpression, @searchstring, @options)	Returns 'Y' if the search string is a match for the search expression, 'N' otherwise.
fn_regex_replace (@searchexpression, @searchstring, @replacement, @options)	Returns the search string with all occurrences of the search expression replaced with the specified replacement string.
Extended Procedures	Description
xp_regex_match @searchexpression, @searchstring, @options, @ret OUTPUT	Returns 'Y' if the search string is a match for the search expression, 'N' otherwise.
xp_regex_search @searchexpression, @searchstring, @options	Returns all matches of the search expression from the search string.
xp_regex_split @searchexpression, @searchstring, @options	Uses the search expression to split the search string, and returns all tokens produced by the split.
xp_regex_replace @searchexpression, @searchstring, @replacement, @ret OUTPUT, @options	Replaces all search expression matches in the search string with the replacement string.

Summary

This is the end of Part Two of this series on the DBA Toolkit. In this article we talked about regular expression tools included in the toolkit. In Part Three we will discuss phonetic matching tools.

Download the DBA Toolkit Here

SQL 2000 DBA Toolkit: Part Two - Regular Expressions

Introduction

Text Processing

Installation

Regular Expression Tools

The Functions

Perl Regex Syntax

DBA Toolkit (Regular Expressions) Quick Reference

Summary

Rate

Share

Categories

Share

Rate

SQL 2000 DBA Toolkit Part 2

SQL 2000 DBA Toolkit: Part Two - Regular Expressions

Introduction

Text Processing

Installation

Regular Expression Tools

The Functions

Perl Regex Syntax

DBA Toolkit (Regular Expressions) Quick Reference

Summary

Rate

Share

Categories

Share

Rate

Related content

Tame Those Strings - Using SUBSTRING

Tame Those Strings! Part 2 - Using CHARINDEX

Tame Those Strings! Part 5 - Using STUFF

SP_MSFOREACHDB - Getting a list of databases via TSQL.

SP_MSForeachtable - Life Without Cursors.

Cookies on SQLServerCentral