In our previous article, we saw how to create a Data Lake Store and Analytics account in Azure. We even executed a U-SQL job against some of the data we uploaded. The query we used extracted data from a postcode file, ran a filter over the data, and finally outputted the required data into a new CSV file. Here's the U-SQL script we created:
@results = EXTRACT postcode string, total int, males int, females int, numberofhouseholds int FROM "/Postcode_Estimates_1_M_R.csv" USING Extractors.Csv(); @m12results = SELECT postcode, total, males, females FROM @results WHERE postcode.Substring(0, 3).ToLower() == "m12"; OUTPUT @m12results TO "/output/M12Postcodes.csv" ORDER BY total DESC USING Outputters.Csv();
In this tutorial, we’re going to break down the U-SQL statement. The intention here is to give you a good look at how a U-SQL statement is constructed. U-SQL aims to allow you to retrieve data from a data source and transform it into your required format. The transformed data can be then be stored, either in a table or as a file. This was alluded to in the first article, when we were talking about U-SQL being a LET solution (Load, Extract, Transform), as opposed to an ETL solution (Extract, Transform, Load).
U-SQL supports what Microsoft calls structured and unstructured data. Structured data is data loaded into tables in a U-SQL database, whilst unstructured data is what we’re dealing with in this example – files, no matter what the format.
The Extract Statement
Let’s look at the EXTRACT statement first:
@results = EXTRACT postcode string, total int, males int, females int, numberofhouseholds int FROM "/Postcode_Estimates_1_M_R.csv" USING Extractors.Csv();
We assign the contents of the EXTRACT statement to a variable, @results. Note this uses T-SQL format, but there isn’t a DECLARE statement (although DECLARE does exist in U-SQL). This statement does not actually assign the entire contents of the Postcode_Estimates_1_M_R.csv file to this variable, which is known as a rowset; it’s more of a reference, working in a similar manner to a Common Table Expression (CTE) in T-SQL.
The first part of the EXTRACT statement declares the format of the data we want from the file (known as “schema on read”). The file contains five columns (postcode, total, males, females and numberofhouseholds), but we only want the first four. We can’t specify this in the EXTRACT statement – we must specify all the columns listed in the file, otherwise our statement will fail. We can filter out the number of households column a bit later though, as we’ll see.
U-SQL has no idea about column headers, so having a file with a header row in it could cause your U-SQL script to fail. Microsoft has a fix planned for this and there are a few ways to work around it. One option is to extract all columns using strings, then convert values to their true types later in the U-SQL query using TryParse methods, discarding the invalid matches. Another option is to pass the silent:true parameter to the extractor, which will fail all non-conforming rows without failing the entire job. It will be up to you to figure out how many rows are missing though. When we executed this job in the previous article, we explictly removed the header row, which is probably the easiest (though not the most convenient) way of handling this problem at the time of writing. (UPDATE: As of February 2017, U-SQL supports header rows)
Note the way we’ve declared the data types – as strings and ints. You may have been expecting to see VARCHAR or INT. U-SQL uses C# data types, not T-SQL. This is something new for T-SQL developers to learn, but it isn’t a big learning curve. All of the usual data types are supported; a full list can be found on MSDN.
With the data format declared, we specify the file to query using the FROM clause. This is analogous to querying a table – in effect, the file is the table. Note the path format used, which is in Linux/Unix format. This example uses a relative path, but it is possible to specify absolute paths if necessary – you’d want to do this from a locally hosted app, for example.
The final part of the EXTRACT statement is the USING clause, which tells U-SQL which extractor to use.
U-SQL comes pre-supplied with built-in extractors for:
- CSV files (comma-separated)
- TSV files (tab-separated)
- Text files (the separator can be specified)
The Text extractor allows you to specify a delimiter parameter, such as a pipe symbol should commas or tabs be unsuitable. The CSV and TSV extractors are derivatives of the Text extractor, with the delimiter value set appropriately.
Theoretically, U-SQL can handle any type of data. There are sample outputters available to handle other data types such as JSON, and you can write your own using C#. No doubt many types of extractor will be available over the coming months and years.
The extraction process allows us to access the data, but next we need to transform it. We want to extract columns for M12 postcodes (Manchester in the UK), and we also want to ignore the “numberofhouseholds” column. We can write as many transformation statements as required here, but for our purposes only one is needed:
@m12results = SELECT postcode, total, males, females FROM @results WHERE postcode.Substring(0, 3).ToLower() == "m12";
This looks more or less like a T-SQL statement. There are some subtle differences. We are assigning the results to another rowset variable, @m12results. We’re SELECTing FROM our previous rowset variable, @results. So far, so boring. What’s really interesting is the WHERE statement, which looks nothing like T-SQL. It’s a C# statement, using the string type’s Substring method to match on postcodes where the first three characters are “m12”. There are two other noticeable differences from T-SQL:
- The equality operator – we have two = symbols, not one. This is the standard C# equality operator
- The string value “m12” is wrapped in double quotes – not single quotes, which is the norm in T-SQL
Using C# to implement filters gives U-SQL a huge amount of power. This could be problematic for pure T-SQL developers, but a bit of C# studying won’t do you any harm!
With the data transformed, all that’s left to do is output our M12 rows.
Outputting Transformed Data
There are two output targets – we can output to another file, like our example below, or we can insert the transformed rows into a table. We’re popping the rows out to another file, presumably for further downstream processing.
OUTPUT @m12results TO "/output/M12Postcodes.csv" ORDER BY total DESC USING Outputters.Csv();
The OUTPUT keyword tells U-SQL this is an output statement. We provide the rowset variable holding the data we want to output and a path to an output file. Should this file already exist, it will be overwritten. We also apply an ORDER BY here. This could equally have been applied in the previous SELECT statement, but it’s more efficient to do it as part of the OUTPUT statement.
Lastly, we have the USING statement. This tells U-SQL the format in which we want to save the data. After reading the extractors section, I’m sure it will come as no surprise to learn that U-SQL comes with three outputters. Can you guess what they are?
Again, the delimited parameter can be passed to the Text outputter. All three outputters support a number of other parameters:
- dateTimeFormat – sets date/time format used for date/time columns. This uses C# date/time formats
- encoding – The default is UTF-8, but other encodings like ASCII and UTF32 are supported
- escapeCharacter – The character used to escape itself and delimiter values in the file
- nullEscape – The string used to represent a NULL, e.g. “Unknown”. The default value is null
- quoting – Stipulates whether values are wrapped in double quotes or not
As usual, MSDN has a detailed document covering the outputters. As with extractors, you can write your own outputters in Visual Studio.
So, we’ve broken down a rather simple U-SQL statement and analysed how U-SQL operates. We’ve seen that C# provides U-SQL with a lot of power and flexibility, and Extractors and outputters give us the ability to process and generate data.
We’ve still barely scratched the surface of what U-SQL can do. Next time, we’ll take a look at how we can convert our query to work on many files instead of just one. Until then, happy querying!