Blog Post

RegEx in SQL Server for searching text

,

As a database developer or DBA, or even software developer, you must have met up with the difficulties of finding some strings/sub-strings in the text columns. In such situations, we usually make two or more queries in order to cover the various cases, or even don’t try it because it’s complex. For example, the following query is not the smartest approach for finding Users who have some kind of URL in the description column.

SELECT * 
FROM dbo.Users
WHERE [Description] LIKE '%http://%' OR [Description] LIKE '%ftp://%'

You probably know why, and if you don’t know, then one reason is the possible performance impact when your table is big. However, more importantly, the question is “does it help us in finding all the cases?”.

T-SQL offers some mechanisms for tackling with those problems. Examples are the in-built functions: LIKE, PATINDEX, CHARINDEX, SUBSTRING or REPLACE, which you can combine with others and obtain quite sophisticated and complex queries. Maintenance and development of complex queries is more time spent.

However, sometimes problems are really difficult to be resolved with the in-built functions and in such situations, it’s better to switch your approaching. The Regular expressions (RegEx) are the definite advice for such queries. They have some general advantages for searching text:

  • A rich and varied means of pattern matching strings
  • Wildcards replace functionality
  • A simpler, less error prone means of parsing strings

In this post, I’m showing how by using some Regular expressions we can easily make advanced searching in the text/string defined columns of tables.

If you’re not familiar with RegEx, then a very good reference is the following http://www.regular-expressions.info.

The SqlRegex database project

I have created and built a SQL Server Database project in Visual Studio, named SqlRegex, with some user-defined functions. The produced SqlRegex.dll file from the project is imported into my database and I’m able to create the functions that follow.


Listing 1. Functions in the SqlRegex.dll and their short description

In order to be able to import any CLR in your SQL Server database, you must enable the server-level system configuration setting “clr enable” in the sys.configurations table. You can do that, by having appropriate permissions on the instance server, with the following command.

EXEC sys.sp_configure 'clr enabled',1
RECONFIGURE

Then in the SSMS Assembly folder under the Programmability node of a database, you can import the CLR by simply adding the SqlRegex.dll file as a new assembly. You can simply use the SSMS UI to import an assembly. The C# methods in the SqlRegex project are created under the UserDefinedFunctions class, which you regularly specify in the CREATE T-SQL functions syntax. The file is built with the PERMISSION_SET = SAFE option.

How to use SqlRegex?

I can now create T-SQL functions based on the CLR methods from the SqlRegex.dll. The methods with their short descriptions are given above in Listing 1. For example, the next function ContainsUrl() will return a boolean value true in the case when a column from the table contains a valid Url string.

CREATE FUNCTION dbo.ContainsUrl(@UrlStr NVARCHAR(max))
RETURNS BIT 
WITH EXECUTE AS CALLER
EXTERNAL NAME SqlRegex.UserDefinedFunctions.ContainsUrl;
GO

The next query uses ContainsUrl() to check for URLs in the Description column.

SELECT r.RuleID,r.[Name],r.[Description]
FROM dbo.BusinessRules r
WHERE dbo.ContainsUrl(r.[Description])=1;

The output result is in the next screen-shot.

As you can see in the description column there are URLs. In the first row the only description text is the URL itself, but in the second and third row, there is some other text before or after the URLs as well.

If I want to find the rows that have only a valid Url in the Description column, then I can use another method called IsValidUrl() from my SqlRegex.dll, in the following way.

CREATE FUNCTION dbo.IsValidUrl(@UrlStr NVARCHAR(max))
RETURNS BIT 
WITH EXECUTE AS CALLER
EXTERNAL NAME SqlRegex.UserDefinedFunctions.IsValidUrl;
GO

Running the query in the next screen will now find only the rows with pure URLs, in this case only one row.

How I designed the methods with C#

Next code is how the method ContainsUrl() looks like in C#. It actually uses the .NET Regex class and makes use of the Match class with its bool Success property. Usage of Microsoft.SqlServer.Server namespace is done by specifying the [SqlFunction] attribute. Additionally, it’s required that you use the System.Data.SqlTypes namespace for the SQL Server data types, like SqlString instead of string. I put the methods in a SQL Server Database Project.

//contains Url
[SqlFunction]
    public static bool ContainsUrl(SqlString UrlStr)
    {
        string input = Convert.ToString(UrlStr);
        string pattern = @"(ht|f)tp(s?)://[0-9a-zA-Z]([-.w]*[0-9a-zA-Z])*(:(0-9)*)*(/?)([a-zA-Z0-9-.?,'/\+&%$#_]*)?";
        Match m = Regex.Match(input, pattern);
        return m.Success;
    }

Most of the C# methods are simple and similar. One method ContainsString() is different from all the others, and it works with an additional pattern parameter. It can cover most of the methods from Listing 1, and it also offers flexibility for working with regex patterns.

If for example you don’t trust the above pattern for a valid Url, or you simply want to use another, then you can use your own and by making use of the ContainsString() function. As just mentioned, it additionally requires a second parameter @pattern of type NVARCHAR(max). You need to create a T-SQL user-defined function in the following way for example.

CREATE FUNCTION dbo.ContainsString(@Text NVARCHAR(max),@pattern NVARCHAR(max))
RETURNS BIT 
WITH EXECUTE AS CALLER
EXTERNAL NAME SqlRegex.UserDefinedFunctions.ContainsString;
GO

Then by passing the pattern as a second parameter, you’re able to use your own regex. An example is the following query followed by a results screen-shot:

DECLARE @pattern NVARCHAR(MAX);
SET @pattern='(ht|f)tp(s?)://[0-9a-zA-Z]([-.w]*[0-9a-zA-Z])*(:(0-9)*)*(/?)([a-zA-Z0-9-.?,''/\+&%$#_]*)?'
SELECT r.RuleID,r.[Name],r.[Description]
FROM dbo.BusinessRules r
WHERE dbo.ContainsString(r.[Description],@pattern)=1;

Note: Take care for the single quotes () in the patterns when passing to the ContainsString() T-SQL function. In the case when you have a not well-formed pattern you’ll be given an error message like this one

There are specific Regexs

Not all is covered with the listed methods. Specific patterns would be different for the countries. Such are the ZIP/Postal codes or the Phone numbers formats.

The ZIP/Postal codes usually have different formats. Next is an example pattern for ZIP codes. This expression matches three different formats of postal codes: 5 digit US ZIP code, 5 digits US ZIP code + 4, and 6 digits alphanumeric Canadian Postal Code. Next, I’m using the ContainsString() function for finding such formats.

DECLARE @zip_pattern NVARCHAR(MAX);
SET @zip_pattern = N'd{5}-d{4}|d{5}|[A-Z]d[A-Z]d[A-Z]d'
SELECT u.UserID,u.Postcode
FROM dbo.Users u
WHERE dbo.ContainsString(u.Postcode,@zip_pattern)=1;

If I need to use another pattern for zip/postal codes, then I’ll only change the pattern and call the ContainsString() function again.

The Phone numbers use different formatting too, either the wire or the cell phone numbers. The following example uses a pattern for finding valid Macedonian cell phone numbers.

DECLARE @mk_cell NVARCHAR(MAX);
SET @mk_cell = N'((00)?+?[389]{3})?[/-s*.]?((((0))|0)?s*7d{1})[/-s*.,]?([d]{3})[/-s*.,]?([d]{3})'

Not all is Regex: In the SqlRegex project, I’ve added the IsValidDate() function which is checking if a date/datetime is a valid one. Many often I want to ensure that all the data I’m importing is valid and don’t want an error to occur during import. I use this function for example. The .NET DateTime class is used to achieve that. For example, the function is able to determine the 29th of February as a valid date in a Leap year. This kind of checking is sometimes difficult to be done with Regex.

However, sometimes I just need to find out date formatting in the text columns. For those purposes I’ve added a ContainsDate() method which enables to find the following dates-formats:

[d|dd]/[m|mm]/[yy|yyy|yyyy]; [d|dd].[m|mm].[yy|yyy|yyyy]; [d|dd]-[m|mm]-[yy|yyy|yyyy];
  [yy|yyy|yyyy]/[m|mm]/[d|dd]; [yy|yyy|yyyy].[m|mm].[d|dd]; [yy|yyy|yyyy]-[m|mm]-[d|dd];
  [yy|yyy|yyyy]/[d|dd]/[m|mm]; [yy|yyy|yyyy].[d|dd].[m|mm]; [yy|yyy|yyyy]-[d|dd]-[m|mm];
  yyyy [d|dd] [m|mm]; yyyy [m|mm] [d|dd]

For some specific dates, you can simply use the ContainsString() method with your own designed regex pattern.

Note that in two of the functions (ContainsInteger() and ContainsNumeric()) of “Create Functions Script.txt”, there is an optional parameter @isDistinguish of a boolean type with a default value of true. That parameter is for separation, i.e. if you want the strings you’re searching for, be rounded with (true)/without (false) white spaces in the text/string. The default schema for the objects is “dbo“.

By using some of the listed methods from Listing 1, I showed how I’m resolving some complex text-searching issues, and in a similar way use the other methods. I usually use the methods when I’m importing data from flat files, or other tables without data validation. It’s good to also appoint the advantage of regex over the in-built functions when working with text-searching.

Here I’m noting that the mentioned regular expressions are mine choice and I’m not guarantying that they are going to cover all your problems, but I’m sure they can be of help to you. However, you can make use of ContainsString() and use your own designed patterns as I showed with some examples.

In the attached file Create Functions Script.txt you can see how the T-SQL functions are created as well.

I’ve built the SqlRegex project for all versions of SQL Server, including the Azure SQL database version V12. You can download SqlRegex.dll attached files.

SELECT u.UserID,u.Phone FROM dbo.Users u WHERE dbo.ContainsString(u.Phone,@mk_cell)=1;

Not all is Regex: In the SqlRegex project, I’ve added the IsValidDate() function which is checking if a date/datetime is a valid one. Many often I want to ensure that all the data I’m importing is valid and don’t want an error to occur during import. I use this function for example. The .NET DateTime class is used to achieve that. For example, the function is able to determine the 29th of February as a valid date in a Leap year. This kind of checking is sometimes difficult to be done with Regex.

However, sometimes I just need to find out date formatting in the text columns. For those purposes, I’ve added a ContainsDate() method which enables to find the following dates-formats:

For some specific dates, you can simply use the ContainsString() method with your own designed regex pattern.

Note that in two of the functions (ContainsInteger() and ContainsNumeric()) of “Create Functions Script.txt”, there is an optional parameter @isDistinguish of a boolean type with a default value of true. That parameter is for separation, i.e. if you want the strings you’re searching for, be rounded with (true)/without (false) white spaces in the text/string. The default schema for the objects is “dbo“.

By using some of the listed methods from Listing 1, I showed how I’m resolving some complex text-searching issues, and in a similar way using the other methods. I usually use the methods when I’m importing data from flat files, or other tables without data validation. It’s good to also appoint the advantage of regex over the in-built functions when working with text-searching.

Here I’m noting that the mentioned regular expressions are mine choice and I’m not guarantying that they are going to cover all your problems, but I’m sure they can be of help to you. However, you can make use of ContainsString() and use your own designed patterns as I showed with some examples.

In the attached file Create Functions Script.txt you can see how the T-SQL functions are created as well.

I’ve built the SqlRegex project for all versions of SQL Server, including the Azure SQL database version V12. You can download SqlRegex.dll attached files.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating