12 February 2018

53471 views

19 0

URL Matching in C#

Comparing URLs in C# code is a common task and seems simple. Camilo Reyes shows us that there are many pitfalls to avoid since people can come up with several ways to type the same URL. He then demonstrates how to solve several URL comparison problems.

Ah, URLs. The Unified Resource Locator (URL) is ubiquitous in enterprise software. It doesn’t matter whether it’s a desktop, a web application, or a backend service, URLs have the unique ability to catch you off guard when you least expect it.

One can lean on the ASP.NET framework for URL routing which provides its own way of matching URLs to action methods. But alas, as is often the case, a full feature framework might not get you where you need to be. URL routing has a powerful way to invoke action methods inside MVC controllers but doesn’t help with URL matching.

If you’ve worked with URLs before and found it hard, then you’re doing it right. If it was easy, then this write up is for you. There are many traps hidden inside these URLs. A URL appears harmless on the surface but, when you look closer, it can be perilous.

In this take, I’d like to give you a deep dive into working with URLs in plain C#. In IT, there may come a time when you have this URL from a config and must match it with another one. The URL can come from the web request that you need to intercept through middleware with a match. I’ll stick to real examples I’ve come across in my programming adventures in the enterprise.

To start, let’s define what the internals of a URL looks like:

For our purposes, we care about the scheme, authority, path, query, and fragment. You can think of the scheme as the protocol, i.e., HTTP or HTTPS. The authority is the root or domain, for example, mycompany.com. The path, query, and fragment make up the rest of the URL. The URL spec defines each segment in this specific order. For example, the scheme always comes before the authority. The path comes after the scheme and authority. The query and fragment come after the path if there is one in the URL.

With the textbook definition in place, it’s time for string matching URLs. I’ll stick to the terminology from the figure so it is crystal clear for you.

String URL Match

Given a URL, it is somewhat reasonable to do a string comparison:

const string orig = "https://example.com/abc";

const string dest = "https://example.com/abc";

var isEqual = orig.Equals(dest);

Assert.True(isEqual);

All code samples use xUnit assertions to prove out matching concepts. Note the String.Equals comparison to get a string match with a URL.

One thing to look out for is that URLs are case-insensitive in the spec. This means https://example.com matches HTTPS://example.com. A naïve string comparison with an equals method does not account for this.

To make this more robust, add case-insensitivity to the comparison:

const string orig = "HTTPS://example.com/Abc";

const string dest = "https://example.com/abc";

var isEqual = orig.Equals(dest, StringComparison.OrdinalIgnoreCase);

Assert.True(isEqual);

The StringComparison.OrdinalIgnoreCase enumerator will do a byte match for each character while ignoring casing. This works well for URLs which are made up of ASCII characters. Note that there is an extra parameter to the overloaded the String.Equals method. With this much effort necessary to add robustness, it should appear often in your code.

Another interesting aspect is that the path of the URL can end with a forward slash. For example, /abc/ also matches /abc without a trailing forward slash. The config or app providing the URL can go either way and you must account for this.

Using string manipulation, we can trim the ends then do a match:

const string orig = "https://example.com/abc/";

const string dest = "https://example.com/abc";

var trimmedOrig = orig.TrimEnd('/');

var trimmedDest = dest.TrimEnd('/');

var isEqual = trimmedOrig.Equals(trimmedDest, StringComparison.OrdinalIgnoreCase);

Assert.True(isEqual);

This accounts for many mishaps with string comparisons. You will start to notice you need a good amount of trimming around URLs. When engaging in this line of work, it is best to stay alert and practice defensive coding. It’s difficult to imagine the many radical new ways folks can type in a simple URL. Human beings are not like computers and may find innovative ways to muck up URLs.

C# uses the .NET framework behind the scenes and provides a list of methods that can aid with URL matching. The System.String type, for example, has many extension methods available. It’s like having a full array of tools at your fingertips, time to examine which methods are most useful.

Let’s say we want to match the scheme to make sure it’s HTTPS:

const string orig = "https://example.com/abc/";

const string dest = "https://";

var isEqual = orig.StartsWith(dest, StringComparison.OrdinalIgnoreCase);

Assert.True(isEqual);

Note that it is safe to assume the scheme comes first according to the spec. The String.StartsWith method has a sibling method that can match the end of the string. This is useful for doing a match on the path of the URL. This is assuming your URLs always end with the path only.

So, for example:

const string orig = "https://example.com/abc/";

const string dest = "/abc";

var trimmedOrig = orig.TrimEnd('/');

var isEqual = trimmedOrig.EndsWith(dest, StringComparison.OrdinalIgnoreCase);

Assert.True(isEqual);

One can be clever with string matching in C#. Your string comparisons have an arsenal of methods at your disposal, so you can be as effective as possible. Let’s say, for example, I want to know if a given URL even has a query. The spec defines this as ?key=value. Note that the question mark is a unique character. This question mark character is in the URL spec and does not belong elsewhere.

So, for example:

const string orig = "https://example.com/abc?key=value#fragid";

var hasQueryString = orig.Contains("?");

Assert.True(hasQueryString);

If you can make safe assumptions about your URLs, like in the example above. Feel free to exploit these assumptions to your advantage with string comparison methods. All you need is to know is which method to use and a little imagination.

LINQ URL Match

With URLs coming from a config or any data source, what you might get back is a list. With the .NET framework, you can use LINQ to iterate through URL lists and then do a match. Imagine there is a list of URLs that must match a target URL. All I want to know is whether the URL exists within the list.

Say, for example:

var mockUrls = new []

{

"https://example.com/abc",

"https://target.com/abc/"

};

const string url = "https://target.com/abc";

var hasUrl = mockUrls.Any(

u => u.TrimEnd('/').Equals(url, StringComparison.OrdinalIgnoreCase));

Assert.True(hasUrl);

The IEnumerable.Any method allows you to match a list with a URL. Note the use of a lambda expressions to further refine the match. This becomes quintessential when you need to trim and ignore case sensitivity. At the end, this lambda expression expects a true or false which comes from the equal string comparison. If any items on the list return true then the entire method returns true.

For example, let’s say you have a list of paths that belong to the URL that needs a match. What you need is to combine the paths to the whole URL, then do a match. The string type has a String.Join method you can use to do the job. This join method takes in a list you can further refine using LINQ.

So, for example:

var paths = new []

{

"/abc",

"/123/"

};

const string url = "https://example.com/";

var combinedUrl = url.TrimEnd('/') + "/" + string.Join(

"/", paths.Where(p => !string.IsNullOrWhiteSpace(p)).Select(p => p.Trim('/')));

Assert.Equal("https://example.com/abc/123", combinedUrl);

I am purposely being naughty with the list of paths. One path has a trailing slash while the other does not. The goal here is to illustrate what kind of assumptions you can and cannot make with URLs. The way you write URL matching can have a life of its own depending on the assumptions.

Note the IEnumerable.Where method to filter out empty paths. LINQ has many more methods available you can use for URL matching. What I find is that I tend to use both IEnumerable.Any and IEnumerable.Select() often. These extension methods are part of the IEnumerable interface in C#. This means it can support a wide array of list types including an array of integers.

LINQ gets enabled on a list when you add System.Linq to the using statements. Inside Visual Studio, these extension methods don’t show up in IntelliSense until you do so. Feel free to explore this namespace if you need more ideas when working with URLs.

What you will find in .NET is that each type may have methods that come with it. The string type, for example, has a list of methods through the System namespace. So far, you can see how these methods are useful to you. It is like having a toolbelt with a whole array of functionality available.

URI Match

The .NET framework has a type to encapsulate URLs if necessary. There is a System.Uri type that can parse any valid URL. The string and LINQ methods I have explained so far do not parse but only provide URL matching. The Uri type has a list of methods and properties you can use to break a URL apart for further analysis.

Let’s say you have a URL with a scheme, authority, path, query, and fragment. Attempting to match against each piece requires good Regex skills. The good news is that a Uri type can do matches in an object-oriented fashion. This OOP (object oriented programming) approach can help keep the code nice and tidy.

One gotcha is that the Query property returns a string type, not a dictionary object. This will require that you parse out the string into a key-value pair. When you are working with the query inside a URL, you often need it as a dictionary to do lookups.

So, for example:

var uri = new Uri("https://example.com/abc/123?key1=value&key2#fragid");

var scheme = uri.GetLeftPart(UriPartial.Scheme);

var path = uri.GetLeftPart(UriPartial.Path);

var fragment = uri.Fragment.TrimStart('#');

var splitQuery = uri.Query.TrimStart('?').Split('&');

var queryString = new Dictionary<string, string>();

foreach (var item in splitQuery)

{

var splitItem = item.Split('=');

var itemKey = splitItem[0];

var itemValue = splitItem.Length > 1 ? splitItem[1] : string.Empty;

if (!queryString.ContainsKey(itemKey))

{

queryString.Add(itemKey, itemValue);

}

Assert.Equal("https://", scheme);

Assert.Equal("https://example.com/abc/123", path);

Assert.Equal("fragid", fragment);

Assert.Equal("value", queryString["key1"]);

You can get the schema and path through the Uri.GetLeftPart method. Note the use of the System.UriPartial enumerable to get each segment of the URL. The Fragment property has the fragment of the URL.

For the Query, note that ?key1=value&key2 is a valid query string because the spec is lenient. The String.Split method gives me back an array I can turn into a dictionary object. For duplicate keys, I use a Dictionary.ContainsKey first then a Dictionary.Add if it’s not in the dictionary. This is a defensive way of dealing with potential typos from a bad config, for example. For those in .NET Core 2.0+, there is a shiny new Dictionary.TryAdd that has this same logic as part of the method. Each itemValue can come from the Query or get a default value of string.Empty. Empty keys in the Query are still plausible. The asserts prove out that code above works as expected.

One gotcha comes from the scheme segment. Note that it returns the colon and backslashes as part of the scheme itself. If the goal is to match it against HTTPS, for example, it might be wise to match it with a String.StartsWith and ignore casing.

This covers just about everything you will encounter when matching URLs with a Uri type. I hope you can see it is far from trivial. One nice advantage is you get the Uri type through the System namespace. This namespace often appears inside many using statements in C#.

Conclusion

The .NET framework comes with a set of namespaces useful for working with URLs. So far, you have seen the System and System.Linq namespaces at work. In C#, there are two types of primary concerns which are System.String and System.Uri. These two types have many methods which are useful to you. For the System.String type, keep an eye on String.StartsWith and String.Equal with case insensitivity. For working with a list of URLs, use any list type that implements the IEnumerable interface. The System.Linq namespace will enable a set of extensions methods for your favorite type. To parse the Query into a dictionary type use the System.Collections.Generic namespace.

All these namespaces have been available in .NET since 3.5 and are part of the .NET Standard library. This means this code is guaranteed to work with many implementations of the .NET framework which include .NET Core. Microsoft is pushing for a standards-based approach and these namespaces are part of it. It is nice to have working code that has a commitment and supports a standard.

Because we are talking about the .NET framework and not only niche features in C#, these same namespaces and object-oriented types are available in PowerShell if you have a language version that supports .NET version 3.5 at a minimum. This means you can go all the way back to PowerShell 3.0.

URL Matching in C#

String URL Match

LINQ URL Match

URI Match

Conclusion

Subscribe for more articles

Rate this article

Camilo Reyes

Related articles

Creating Templates with Liquid in ASP.NET Core

C# Cancellation Tokens in AWS

Tags