Regular Expression Based Token Replacement in ASP.NET

Damon Armstrong presents an extremely powerful and flexible token replacement mechanism for your ASP.NET applications. It is based on regular expressions so allows you to search for dynamic text, instead of just a static token, in a given string.

In my last article, I covered a technique for Token Replacement in ASP.NET, which works from anywhere in your application. Simply put, token replacement is a matter of searching through a string for a token and replacing it with a value of your choice. All of the code for making the actual string replacement is nicely packaged in the Replace method on the String and StringBuilder objects in the .NET Framework, so the real task is getting the webpage to render as a string with which you can work

NOTE:
If you have not already done so, please read the previous Token Replacement article because the solution presented here draws on work discussed previously. In particular, details on replacements in the String and StringBuilder instances, along with the string rendering technique, were all covered in the previous article and will not be detailed again here.

This article expands on that technique and introduces Regular Expression-based token replacement into the process, giving you even more power and flexibility in your applications.

This technique is similar to normal token replacement, but instead of searching for a static token in a string, you can search for dynamic text using regular expressions.

What if you don’t know anything about Regular Expressions? No problem! I’ve got a fairly cool regular expression that should allow you to do just about anything you want to do in code, so you don’t have to understand regular expressions at all for this article. I would suggest, however, that you try to read up on them as much as possible because they are extremely useful in many circumstances.

What are regular expressions?

Regular Expressions are an extremely powerful tool for executing complex string searching routines. You work with Regular Expressions in .NET using the Regex class found in the System.Text.RegularExpressions namespace.I tend to think of Regular Expressions in three parts; the engine, the pattern, and the string:

  • The engine is the component responsible for searching through and locating matches in astring, based on a set of instructions.
  • A pattern is the set of instructions, or rules, which inform the engine what to match in the string.
  • The string is the set of character data being searched. You pass a string and a pattern into the regular expression engine and the engine returns information identifying all of the locations in the string that matched the pattern, as well as information about each match.

Patterns use their own syntax to let the engine know the details of what is being sought in the string, and that syntax can look pretty complex and intimidating to people who haven’t worked with it before. Regular Expression patterns are normally pretty easy to identify because they look like complete gibberish. Here’s the one we’ll be using later on in the article:

It looks impossibly complicated, but it’s not. It’s just long. I will be dissecting it later on in the article, so don’t worry about it too much right now.

Solution overview

We’ve got a static token replacement solution in place from the last article that renders the ASP.NET page to a StringBulder, and then passes that StringBuilder to two different functions: RunPageReplacements and RunGlobalReplacements. So we have the string that we need to search already setup and ready to go, we just need to search though it and run our Regular Expression based replacements. One of our design goals is to keep Regular Expression searching to a minimum. Although Regular Expressions are efficient, considering what they do, running too many of them on each page request can quickly impact performance. So, our goal is to run only one search per page request. We should also strive to make the token replacement mechanism flexible enough to handle a virtually unlimited number of different token replacement scenarios. This may seem like a conflicting goals considering we only want to run one regular expression search per page, but Regular Expressions are cool enough to handle it. Here’s the plan: allow for the definition of Token Functions inside the string where replacements are to be made. A token function is essentially a function name and parameters surrounded by brackets and dollar signs. NOTE: The term “Token Function” is my own invention, by the way, just in case you go searching for it on Google or Wikipedia! Here are a couple of examples:

We then search the page use a single regular expression pattern to locate all of the Token Functions on the page. One of the cool things about Regular Expressions is that you can execute a single search, locate multiple matches, and return those matches along with specifics about each match. In this case, the specifics include where in the page a Token Function was located, the name of the token function, and even a list of the parameters defined in the Token Function.

Armed with this information, we can then create “handlers” in the code-behind that are responsible for generating the replacement text for the Token Function based on the Token Function name and parameters. This pushes the work of defining Token Function names and replacement text logic into the code-behind, and keeps you from having to update the Regular Expression pattern when you want to add new Token Function definitions.

Matches and captures

Terminology-wise, it’s important to understand the difference matches and captures. When you run a regular expression against a string, you are searching for parts of the string that “match” the regular expression as a whole. Matching answers a Yes/No question: did you find a section of the string that matched the entire regular expression.

Remember, a regular expression finds matches based on rules. Thus, each match can have a different text value behind it. When the regular expression engine finds a match, it “captures” information about that matched text so you can quickly access it later without having to go back through the string.

You can also define capturing groups inside of a regular expression. A capturing group is a section of a pattern for which the regular expression engine acquires capture information. In other words, a capturing group tells the regular expression engine to capture the text for a particular sub-section of the regular expression. This allows you to easily acquire values for specific items inside the matched text. In our case, we’ll use capturing groups to expose the function name and parameters in our token function. For example, this is a generic form letter containing a series of Token Functions:

[$GetContent(FormLetter, Salutation)$] Thank you for registering on our website. Please use the following information to login to your account: Username: [$UserName()$] Password: [$Password()$] If you have any problems, please feel free to email us at:     [$GetContent(ContactInfo, CustomerSupportEmail)$].   Thank You, Customer Service When we run our Regular Expression against this string, we get the following breakdown of match and capturing group information:

  • Match – [$GetContent(FormLetter, Salutation)$]
    • Capture Group – functionName
      • GetContent
    • Capture Group – params
      • FormLetter
      • Salutation
  • Match – [$UserName()$]
    • Capture Group – functionName
      • UserName
    • Capture Group – params
  • Match – [$Password()$]
    • Capture Group – functionName
      • Password
    • Capture Group – params
  • Match – [$GetContent(ContactInfo, CustomerSupportEmail)$]
    • Capture Group – functionName
      • GetContent
    • Capture Group – params
      • ContactInfo
      • CustomerSupportEmail

Notice that the match itself captures the entire token function, including the surrounding brackets and dollar signs, the function name, parenthesis, parameters, and commas that separate the parameters. Capturing group information contains a better breakdown of the items that make up the token function including the function name and an individualized list of parameter values.

Regular expression concepts used in the solution

You don’t need an in depth knowledge of regular expressions for this article, but it does help to have a basic understanding of the regular expression pattern we’re going to use to identify Token Functions. In pursuit of that understanding, we’ll do two things: cover some of the regular expression syntax used in the pattern, and break the pattern down into its individual components to discuss what each component does. First, let’s touch on syntax:

Backslash (\) modifier

A backslash (\) is a modifier that informs the regular expression engine to handle the character immediately following the backslash in a manner other than it’s normal context. For example, a dollar sign ($) is a special character that normally represents “match the start of a string” in a regular expression pattern. But it’s certainly possible that you may want to match an actual $ character. To do so, you escape the $ by placing a backslash in front it: \$. Similarly, the characters w, s, and d are normally interpreted as the actual characters w, s, and d. If you modify them using a backslash (\w, \s, and \d), then they represent “any word character”, “any whitespace character”, and “any numeric character,” respectively.

Grouping constructs

You can create a group by surrounding any portion of an expression with parentheses. There are a number of different types of groups in regular expression patterns, but for today you only need to know about five:

  • Unnamed groups – any time you surround an expression with parentheses it becomes an unnamed group. You can also think of unnamed groups as the default group. The syntax for an unnamed group is: (expression)
  • Named groups – you can give an unnamed group a name using the following syntax: (?<Group Name>expression)
  • Capturing groups -. Both unnamed and named groups are capturing groups. Unnamed groups are accessed by numbered index, whereas a named group can be accessed by numbered index or by name. A group’s numbered index is determined by the location of the group in relation to other groups in the expression. I recommend using named groups for readability and maintenance purposes.
  • Positive Lookahead – normally, an element in a regular expression pattern “consumes” characters that it matches. In other words, the regular expression engine checks a pattern element against characters in the string, and if the engine determines the characters match the pattern element, it moves on to the next element in the pattern and the next set of characters in the string. The positive lookahead allows you to check for the existence of a character (or series of characters) without consuming those characters. By definition, a positive lookahead group is non-capturing. It’s syntax looks like this: (?=expression)
  • Non-Capturing groups – when the regular expression engine processes a non-capturing group, it does not capture any additional information about matches found in that group. This helps keep processing and data exchange to a minimum if you don’t need additional capture information for the group. You can change an unnamed group into a non-capturing group by adding a ?: after the opening parenthesis. The syntax looks like this: (?:expression)

Character sets

Character sets allow you to match against a collection of characters, and are defined by enclosing the characters you want included in the set with square brackets. You may specify the characters individually, as a character range, or as a mix of individual characters and ranges. For example, a character set containing all English vowels may looks like [AEIOU]. A character set containing all lowercase alphabetic characters may looks like [a-z]. And a character set containing only consonants may look like [bcdfghj-np-tvwxyz]. Placing a ^ after the opening bracket of a character set makes the set exclusionary. In other words, the character set includes all characters except the ones in the set. So, [^a-z] means any characters except the letters a-z. You have to be a bit careful in your thinking with exclusionary sets. You may, for example, be tempted to define consonants as [^aeiou]. But [^aeiou] technically includes digits like 0-9 and special characters like (){}[]!@#$%^&*.

Quantifiers

Quantifiers inform the regular expression engine how many times to match part of the regular expression pattern. For example, if you want to find all numbers in a document, then you could use a regular expression like:

Remember, \d means “all numeric characters” and + is a “quantifier” that means “1 or more matches”. Thus, when you have a sentence like “There are 10 people in 4 cars, driving 125 miles” the regular expression will find the numbers 10, 4, and 125 even though each number has a different number of digits. Other quantifiers include:

*

Zero or more matches

+

One or more matches

*?

Zero or more matches, but as few matches as possible

+?

One or more matches, but as few matches as possible

{n}

Exactly n number of matches

{n,}

At least n number of matches (infinite matches possible)

{n,m}

Between n and m number of matches

One of the most important parts of understanding regular expressions is knowing about the “as few as possible” quantifiers. Many times, your regular expression will “overshoot” the text you are trying to capture. For example, let’s say you want to capture all text inside the parenthesis if you have a string:
Jack (who is a boy) and Jill (who is a girl) ran up the hill. You may initially try a regular expression like:

This regular expression says look for an open parenthsis \(, then find zero or more characters .*, until you reach a closing parenthsis \). But when you run this against the string, it captures:

Why? Because, at least as far as the engine is concerned, there are three sets of open and close parenthesis in the string:

  1. (who is a boy) and Jill (who is a girl)
  2. (who is a boy)
  3. (who is a girl)

The * quantifier tells the engine to match the most characters possible while still meeting the criteria of having an open and close parenthesis around those characters. Hence, it captures the longer string. When you switch to the *? quantifier, like this:

\(.*?\)

You tell the engine to match the fewest characters possible while still meeting the criteria of having an open and close parenthesis around those characters. So you end up capturing the two individual parenthetical statements:

(who is a boy) (who is a girl) This is important for us because we’re capturing function information inside [$ $] characters and we don’t want to accidentally suck in all the page text between two functions.

Breaking down the token function regular expression pattern

Below you will find the regular expression pattern used to locate token functions that appear throughout your ASP.NET page:

\[\$(?<functionName>[^\$]*?)\((?:(?<params>.*?)(?:,|(?=\))))*?\)\$\]

To simplify discussion of this pattern, I’m going to break it down into its individual elements and present each element on its own line. In the following table, you will find the pattern element in the left column, and a description of that element in the right column. General commentary appears in the shaded rows.

 

We start by looking for the opening [$ that denotes the start of a token function

\[

Open bracket literal

\$

Dollar sign literal

Capture the function name in a named group called functionName. The function name includes any characters that appears between the opening [$ of the token function and the first parenthesis, excluding the dollar sign character. It is important to exclude the dollar sign because doing so helps the regular expression avoid accidentally matching on a normal token like [$TOKEN$].

(?<functionName>

Named group start (functionName)

[^

Exclusionary character set start

\$

Dollar sign literal

]

Exclusionary character set end

*?

Quantifier – matches zero or more times, as few times as possible

)

Named group end

Look for the opening parenthesis that marks the beginning of the function parameters

\(

Open parenthesis literal

Create a named group that acts as a containing group for:

1.) The parameter name
2.) The comma the separates the parameters OR the ending parenthesis

(?:

Unnamed group start

Capture the parameter name in a named group called params

(?<params>

Named group start (param)

.

Any character

*?

Quantifier – matches zero or more times, as few times as possible

)

Named group end

Next, find the character that denotes the “end” of the parameter. This could either be a comma that separates the parameter from another parameter, or a closing parenthesis that marks the end of the parameter list altogether.

We do not, however, want to consume the ending parenthesis in this section because the possibility exists that a function may have zero parameters. If a function has zero parameters, then this section of the pattern isn’t really used. Since the ending parenthesis is always present, so we want to consume it in a section of the pattern that is always used.

(?:

Unnamed group start

,

Comma literal

|

OR operator

(?=

Start positive lookahead group

\)

Open parenthesis literal

)

End positive lookahead group

)

End unnamed group

End the unnamed group that contains the parameter. Use the zero or more times, as few times as possible quantifier after the unnamed group to tell the regular expression engine to capture as many parameters as are present, and to capture them individually instead of as one giant parameter string

)

End named group

*?

Quantifier – matches zero or more times, as few times as possible

Consume the ending parenthesis

\)

 

Finish by looking for the closing $]

\$

Looks for the closing $

\]

Looks for the closing ]

Now you should have a pretty good idea of what the regular expression does, so it should make a lot more sense when you’re working with it in code.

Adding regular expressions to the TokenReplacementPage class

We need to do a few minor things to add regular expression based token replacements to the TokenReplacementPage class. Again, if you have no idea what the TokenReplacementPage class is, then you need to read my last article on Token Replacement in ASP.NET because it contains the foundation for what we’re covering here.

Below, you will find the updated code for the TokenReplacementPage class with the regular expression additions bolded to make them stand out. I’ll discuss each updates below in more detail:

using System; using System.IO; using System.Text; using System.Text.RegularExpressions; using System.Web.UI; using System.Reflection; public abstract class TokenReplacementPage : Page {     private static Regex functionRegex = new Regex(         @”\[\$(?<functionName>[^\$]*?)\(” +         @”(?:(?<params>.*?)(?:,|(?=\))))*?\)\$\]”,         RegexOptions.IgnoreCase |         RegexOptions.Singleline);         protected override void Render(HtmlTextWriter writer)     {         StringBuilder pageSource = new StringBuilder();         StringWriter sw = new StringWriter(pageSource);         HtmlTextWriter htmlWriter = new HtmlTextWriter(sw);         base.Render(htmlWriter);         RunRegularExpressionReplacements(pageSource);         RunPageReplacements(pageSource);         RunGlobalReplacements(pageSource);         writer.Write(pageSource.ToString());     }         protected void RunGlobalReplacements(StringBuilder pageSource)     {         pageSource.Replace(“[$SITECONTACT$]”, “John Smith”);         pageSource.Replace(“[$SITEEMAIL$]”, “john.smith@somecompany.com”);         pageSource.Replace(“[$CURRENTDATE$]”,             DateTime.Now.ToString(“MM/dd/yyyy”));     }     protected virtual void RunRegularExpressionReplacements(         StringBuilder pageSource)     {         //Regular Expression Replacements         MatchCollection matches =             functionRegex.Matches(pageSource.ToString());         //Iterate through all the matches         for (int i = matches.Count-1; i>=0; i–)         {             //Pull function name from the group             string functionName = matches[i].Groups[“functionName”].Value;             string[] paramList = CreateParamList(matches[i]);             string replacementValue = string.Empty;             //Run different code based on the function name             switch (functionName.ToUpper())             {                 case “ADD”:                     int sum = 0;                     for (int i2 = 0; i2 < paramList.Length; i2++)                     {                         sum += int.Parse(paramList[i2]);                     }                     replacementValue = sum.ToString();                     break;                 case “CONTENT”:                     replacementValue =                         ContentManager.GetContent(paramList[0],                             paramList[1]);                     break;                 default:                     replacementValue = String.Format(                         “<!– Could not find case statement for {0}  –>”,                         functionName);                     break;             }             //Make replacements             pageSource.Remove(matches[i].Index, matches[i].Length);             pageSource.Insert(matches[i].Index, replacementValue);         }     }     //Create string array containing each parameter value     protected string[] CreateParamList(Match m)     {         string[] paramList = new string[m.Groups[2].Captures.Count];         for (int i = 0; i < paramList.Length; i++)         {             paramList[i] = m.Groups[“params”].Captures[i].Value;         }         return paramList;     }     protected virtual void RunPageReplacements(StringBuilder pageSource)     {     } }

System.Text.RegularExpressions namespace

You can find all of the regular expression classes in the System.Text.RegularExpressions namespace. We’ve imported this namespace at the top of the class to avoid lots of typing.

Static Regex variable

You can interact with the .NET Framework’s regular expression engine in one of two ways: you can execute static methods off the Regex class, like this:

Regex.Matches(inputString, regularExpressionPattern)

When you take this approach, the runtime compiles a Regex object with the specified pattern, executes the regular expression, then abandons that Regex object. This is pretty wasteful if you’re going to be reusing the same regular expression over and over again.

A better option, especially for our situation, is to create a Regex object and maintain a reference to that object so we can use it over and over again. We do this with the following code that appears near the top of the class:

private static Regex functionRegex = new Regex(         @”\[\$(?<functionName>.*?)\((?:(?<params>.*?)” +         @”(?:,|(?=\))))*?\)\$\]”,         RegexOptions.IgnoreCase | RegexOptions.Singleline); This creates a new Regex object containing the regular expression discussed earlier. For formatting sake, I split the regular expression pattern out onto two lines. You can also see that we’re creating the Regex object with two regular expression options specified: IgnoreCase and SingleLine. IgnoreCase means that the regular expression is not case sensitive. It’s a good option to know about, though it doesn’t matter in this case because we’re not looking for any alphabetic characters. The second option, SingleLine, means that the regular expression engine should process the input string as single line. Normally, when a string contains line breaks, the engine processes the input string as multiple lines.

RunRegularExpressionReplacements method

RunRegularExpressionReplacements accepts a StringBuilder object named pageSource containing the raw HTML output for the page. It begins by passing the page source into the Matches method on the functionRegex variable. The Matches method executes our regular expression pattern, finds all the token functions in the page source, stores information about each match in a Match object, then packages those Match objects into a MatchCollection object which it returns as the result of the function. We then store that MatchCollection in the matches variable. The method also creates a variable named offset and sets its value to zero. Next, the method iterates backwards through each Match object in the MatchCollection using a for loop. I’ll talk about why it goes backwards in a minute. Inside the for loop, we do a few different things. First, we parse out the name of the function in the token function using the following code:

string functionName = matches[i].Groups[“functionName”].Value;

Remember, we made a named group called functionName that captures the name of the function. Accessing the function name is as easy as passing “functionName” into the Group property and pulling back the Value of the text captured by that named group.

Second, we create a string array containing a list of the parameters for the token function using the CreateParamList method. We’ll go over this in more detail in a second, but understand for now that runs through the Match object, checks the params group, and places the values for any captured parameters into a string array.

Next, we create a variable named replacementValue to store the value used to overwrite the token function.

After that, we have a switch statement that allows us to implement whatever functionality we want for regular expression based tokens. All you have to do to include additional functionality is add the function name to the switch statement, implement whatever logic you want for that particular function, and set the replacementValue variable to whatever it is you want the function token to have as its replacement text.

You can see that I’ve thrown in two functions for demonstration purposes. The ADD function runs through each parameter, converts it into an integer, and adds up all the values as it moves along. The CONTENT function accepts two parameters: a group and a key. It passes those values to the ContentManager.GetContent function, which returns the appropriate content.

And lastly, the method uses index and length information from the Match object to remove the token function and insert the value from replacementValue in its place, completing the process.

I mentioned earlier that we iterate through the Match information backwards, and here’s why. Each Match object has an Index property containing the location where we need to make a replacement. It’s very likely that the replacement text and the text being replaced will not be the same length, so updating the string means all index information from that point on is offset by the difference in string lengths. We could keep track of the offset, but it’s easier to avoid the situation by starting at the back of the string and moving towards the front. When you work in this direction, the index information is still valid because you are accessing part of the string which has yet to be lengthened or shortened by replacements.

CreateParamList method

The CreateParamList method is an extremely simple method. You pass in a Match object, and it looks through it and creates a string array containing any parameter values the objected captured. It begins by creating a string array whose size matches the number of captures found in the params group. Then it iterates through each parameter value and assigns it to the appropriate index of the array. It then returns that array.

Notice the slight difference in acquiring values from the params group as compared to the functionName group. When you were accessing the functionName group, you could get the value using:

MatchObject.Groups[“functionName”].Value Because you were only looking for a single value. If you are looking for something that can be captured multiple times, then you can access a list of the captured values via the Captures property, like this:

MatchObject.Groups[“params”].Captures[i].Value

You can still call something like this:

MatchObject.Groups[“params”].Value

But you will only get the last parameter the regular expression engine located.

Updating the Render method

And finally, you call RunRegularExpressionReplacements from the Render method. I have it setup to run before executing the normal token replacement call because it seems more flexible. You may want to return a “standard token” from the Content Management system and have it replaced when you run standard replacements.

Checking out the Demo Application

Download the demo application and extract it to a location of your choosing on your hard drive. Start Visual Studio, and open the web site from wherever it is that you chose to save it. There are five files (not including code-behinds) in the demo:

File Name

Purpose

App_Code\TokenReplacementPage.cs

Contains the TokenReplacementPage class that provides pages with token replacement functionality

App_Code\ContentManager.cs

Contains code for a “smoke-and-mirrors” (e.g. fake) content management system for use in the demo

Default.aspx

Demonstrates regular expression based token replacement functionality

Web.config

Website configuration

Take a look at the markup in the Default.aspx page and notice the various token functions and standard tokens that appear throughout. Also take a look at the code behind file because you will see a token function set in code to demonstrate that you can put a token anywhere and, as long as it is output to the page source, it is replaced by the regular expression token replacement mechanism.

When you run the demo application, you will see that the token functions are replaced with their respective values when the page appears in your browser.

Conclusion

Regardless of whether you fully understand regular expressions or not, this approach should provide you with a fairly powerful regular-expression based token replacement mechanism. Plus, can easily add new token functions without having to rework the regular expression, though I would recommend placing the logic for those functions in their own methods to avoid having a ton of logic directly in one giant case statement.

Happy coding!