Stairway to U-SQL

Stairway to U-SQL Level 14: Local U-SQL Scalar Functions with C#

,

So far in this series, we’ve stayed away from the C# aspects of U-SQL, instead concentrating on the more T-SQL-like aspects of the language. That all changes now, as we look at how we can use C# to create local scalar functions.

Creating Functions in Script

You may be asking “Why do I need to use C# to create scalar functions?” There’s a simple answer – you can’t use standard U-SQL script to create scalar functions. You can create Table-Valued Functions (TVFs), Views, Stored Procedures and other objects we know and love from T-SQL – but not scalar functions. But the C# implementation allows us to create something as simple or as complex as we like.

Where does the Code Live?

Kick open Visual Studio and open the SSC_UkPostcodes solution. If you don’t have it (where have you been?), download it from GitHub. If script 700 Postcode Split Function.usql is not present, add it to the project. Navigate to this file in the Solution Explorer and click the + sign next to the file name. An item with the same name will appear underneath it, but with a .cs extension.

It’s this .cs file that will house our C# code. This is known as a code-behind file, and we can add C# methods to this file, which will be available to our U-SQL script.

What the Function will do

Back in article 9, we created a view, which included a couple of C# conditional statements:

p.Postcode.IndexOf(" ") == -1 ? p.Postcode : p.Postcode.Substring(0, p.Postcode.IndexOf(" ")) AS Part1,
p.Postcode.IndexOf(" ") == -1 ? p.Postcode : p.Postcode.Substring(p.Postcode.IndexOf(" ") + 1) AS Part2,

The Part1 line returns the first part of the postcode (everything to the left of the space), the Part2 line returns everything after the space. In the case of L22 1FD, for instance, we’d return:

Part1 = L22

Part2 = 1FD

We’re going to move this code into a function which will return either the first or second part of the postcode, based on a parameter.

The U-SQL Script

Type or paste the code below into 700 Postcode Split Function.usql.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@records =
SELECT p.Postcode,
p.Postcode.IndexOf(" ") == -1 ? p.Postcode : p.Postcode.Substring(0, p.Postcode.IndexOf(" ")) AS Part1,
p.Postcode.IndexOf(" ") == -1 ? p.Postcode : p.Postcode.Substring(p.Postcode.IndexOf(" ") + 1) AS Part2,
c.CountyName,
d.DistrictName
FROM Postcodes AS p
LEFT JOIN Counties AS c ON p.CountyCode == c.CountyCode
LEFT JOIN Districts AS d ON p.DistrictCode == d.DistrictCode
WHERE d.DistrictName == "Fife";
OUTPUT @records TO "/outputs/functiontests.csv" USING Outputters.Csv();

Submit this job locally and in a few moments, you should have a bunch of results in your output file.

This is essentially the view we created in article 9, contained within a script. What we want to do now is extract the Part1 and Part2 lines, and move them into a C# function. Click the small arrow next to 700 Postcode Split Function.usql in the Solution Explorer, so you can see the .cs file below it. Double-click on the .cs file to open it.

As you can see, the file is pretty empty. The using statements at the top import relevant C# libraries – the libraries beginning with Microsoft.Analytics are U-SQL-specific, the others are all generally used C# system libraries. System.IO, for instance, supports input/output using files, memory streams and other mechanisms.

Adding the Class

All .NET code lives within a namespace, which helps form part of the unique name for a C# class. A namespace is useless unless it contains a class, so let’s add that first. Change the code so it looks like this:

namespace SSC_UkPostcodes
{
    public static class FunctionTests
    {
    }
}

The class is called FunctionTests – we can give it any name we wish, as long as it’s a legal C# name. The public identifier means all code items in the project have visibility of the class – we could also have declared the class as private, for example, but our scripts would not be able to access it. These access modifiers can be used to control the parts of your codebase that are accessible in various circumstances. The static keyword tells the compiler that this class doesn’t need to be instantiated. If the class had been declared as a typical C# class, without the static keyword:

public class FunctionTests

We would have to instantiate it to assign it to a variable:

var newFunction = new FunctionTests();

We can’t instantiate objects in U-SQL scripts, hence the need for the static keyword.

Adding the Method to Represent a Function

We’ve already seen an empty namespace is pretty useless, and the same goes for an empty class – there is no point in having a class unless it does something. C# classes don’t contain functions, they contain methods. Let’s add the code for the FunctionTests method to the class.

namespace SSC_UkPostcodes
{
    public static class FunctionTests
    {
        public static string PostcodePart(string postcode, bool part1)
        {
            string pcode = null;
            if (part1)
            {
                pcode = postcode.IndexOf(" ") == -1 ? postcode : postcode.Substring(0, postcode.IndexOf(" "));
            }
            else
            {
                pcode = postcode.IndexOf(" ") == -1 ? postcode : postcode.Substring(postcode.IndexOf(" ") + 1);
            }
            return pcode;
        }
    }
}

This looks a bit more involved, but you should recognise the two postcode.IndexOf lines we used in our initial U-SQL script. The first thing to explain is the method declaration:

public static string PostcodePart(string postcode, bool part1)

Again, we declare public staticpublic so our codebase can see the method, and static as we’ve already declared the class to be static. If the method wasn’t static, we wouldn’t be able to access it as we can’t instantiate the class.

The method is called PostcodePart – this is what we’ll call in the U-SQL script. It accepts two parameters – a string which contains a postcode, and a Boolean. The Boolean parameter is called part1, which indicates it will return part1 if set to true (everything before the space). You’ve no doubt gathered that setting it to false will return part 2 of the postcode, i.e. everything after the space. There isn’t much else to it. A string variable is declared and initialised:

string pcode = null;

From there, we assign the appropriate value to the string, depending upon whether part1 was true or false. This is a simple IF-ELSE statement, which most of you will probably have seen many times in stored procedures and the like.

if (part1)
{
 pcode = postcode.IndexOf(" ") == -1 ? postcode : postcode.Substring(0, postcode.IndexOf(" "));
}
else
{
 pcode = postcode.IndexOf(" ") == -1 ? postcode : postcode.Substring(postcode.IndexOf(" ") + 1);
}

The last step is to return the value assigned to pcode.

return pcode;

Modifying the U-SQL Script to use the Method

To use this method as a function in our U-SQL script, we want to replace the two C# expressions in our U-SQL script with calls to the method. Remember I mentioned the namespace forms part of the unique name? That’s the first part of a three-part name – we call the method using the form:

namespace.classname.methodname(arguments)

Here’s the updated script with the method calls.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@records =
SELECT p.Postcode,
SSC_UkPostcodes.FunctionTests.PostcodePart(p.Postcode, true) AS Part1,
SSC_UkPostcodes.FunctionTests.PostcodePart(p.Postcode, false) AS Part2,
c.CountyName,
d.DistrictName
FROM Postcodes AS p
LEFT JOIN Counties AS c ON p.CountyCode == c.CountyCode
LEFT JOIN Districts AS d ON p.DistrictCode == d.DistrictCode
WHERE d.DistrictName == "Fife";
OUTPUT @records TO "/outputs/functiontests.csv" USING Outputters.Csv();

Let’s take a closer look at one of the method calls:

SSC_UkPostcodes.FunctionTests.PostcodePart(p.Postcode, true) AS Part1

SSC_UkPostcodes is the namespace, FunctionTests is the name of the class, and PostcodePart is the name of the method. The method accepts two parameters. The first is the postcode, so we pass in p.Postcode, which is the Postcode column from the Postcodes table. This particular call is for Part 1 of the postcode, so the part1 parameter is set to true. The call for Part 2 is exactly the same, except the part1 parameter is set to false.

Run this and the same results as the first script we wrote are returned.

Deploying and Running the Script in Azure

So far, we’ve been running everything locally. Let’s make sure things work as expected in Azure. In the Server Explorer, connect to your Azure account. My Data Lakes account is called sqlservercentral, and I have already deployed the data to this account (see previous articles in this series for instructions on how to do this).

Return to script 700 Postcode Split Function.usql. At the top of the editor is a toolbar, with a Submit button and a few dropdown lists.

In the first dropdown list, change (local) to the name of your Azure Data Lakes account, then hit the Submit button. The job should succeed without issue.

Now mosey on along to https://portal.azure.com and access the Data Explorer in your Data Lake storage account.

Our script should have saved a file called functiontests.csv into the outputs folder. Click on the folder, then click on the file to preview it. You should see similar results to those we saw when we ran the job locally.

Summary

We’ve created our first true C# object in U-SQL. It was only a simple method (or function, depending upon what you want to call it), but it’s opened up the door to the wonderland that is C#. We could now write a method that can do pretty much anything we need to do from a data perspective.

There is an issue with what we’ve done so far – the method is only available to a single U-SQL script – the script that owns the code-behind file. This is a bit limiting to say the least, so next time we’ll see how we can create a shared method that is available to all of our U-SQL scripts. Keep querying until then!

This article is part of the parent stairway Stairway to U-SQL

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating