Stairway to U-SQL

Stairway to U-SQL Level 12: SQL.ARRAY and CROSS APPLY EXPLODE

,

We’ve been using simple data types in T-SQL and other languages ever since programming languages were invented. INT, FLOAT, VARCHAR, NVARCHAR, STRING, DECIMAL…the list goes on. There are some slightly more esoteric data types available in T-SQL, such as GEOGRAPHY and HIERARCHYID. U-SQL offers a couple of additional data types of its own, both of which allow us to keep key pieces of data together, whilst at the same time being able to output that data as separate rows when necessary. These data types are called SQL.ARRAY and SQL.MAP. In this article, I’ll introduce you to SQL.ARRAY, with SQL.MAP to follow in the next step of this stairway. We’ll also see how CROSS APPLY can be used with these data types, in conjunction with the EXPLODE expression.

What is SQL.ARRAY?

SQL.ARRAY is a very special data type. It allows you to store multiple values in the same column. Let’s assume we’ve been asked to run a report for a specific set of postcodes. We want a certain number of Liverpool postcodes, a certain number of Manchester postcodes, and so on. In a traditional table structure, the data may be stored like this:

Area (string)Postcode (string)
LiverpoolL33 5XA
LiverpoolL33 5XB
LiverpoolL33 5XD
LiverpoolL33 5XE
LiverpoolL33 5XF
ManchesterM1 1AE
ManchesterM1 1BA
ManchesterM1 1BE
ManchesterM1 1BY
ManchesterM1 1BZ

But if we use a SQL.ARRAY data type, we could store the data like this:

Area (string)Postcode (SQL.ARRAY)

Liverpool

L33 5XA
L33 5XB
L33 5XD
L33 5XE
L33 5XF

Manchester

M1 1AE
M1 1BA
M1 1BE
M1 1BY
M1 1BZ

SQL.ARRAY allows you to store a collection of values in the same column. This is a very powerful concept, as we’ll see.

A Script Which Doesn’t Use SQL.ARRAY

An array is a collection of similar objects – a collection of strings or integers, for example. This is a useful construct as it allows us to store multiple values in one easy to use location. Let’s assume, for example, that we’ve been asked to return the postcode estimate details for a certain number of postcodes in Liverpool and Manchester (the postcodes in the tables above, coincidentally). We can write this in typical U-SQL fashion – I’ve added this script to our project as 550 PostcodeEstimates Without Array Or Map.usql:

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@pcodes = SELECT * FROM (VALUES ("L33 5XA"), ("L33 5XB"), ("L33 5XD"), ("L33 5XE"), ("L33 5XF"), 
("M1 1AE"), ("M1 1BA"), ("M1 1BE"), ("M1 1BY"), ("M1 1BZ")) AS PC(Postcode);
@pcodeTotals = 
SELECT p.Postcode.Substring(0, 1).ToUpper() == "L" ? 
  "Liverpool" : "Manchester" AS City, p.Postcode, pe.Total
FROM @pcodes AS p INNER JOIN PostcodeEstimates AS pe 
ON p.Postcode == pe.Postcode;
OUTPUT @pcodeTotals TO "/outputs/pcodetotals_basic.txt"
ORDER BY Postcode ASC USING Outputters.Csv();

This returns the expected output:

This works well, although it is a bit limited – for example, if we wanted to add some postcodes from another area, this line would need to change:

p.Postcode.Substring(0, 1).ToUpper() == "L" ? "Liverpool" : "Manchester" AS City

This code works well for two areas, but it makes it difficult to add more. We’d have to restructure the code.

There are lots of other ways of achieving this output - one way is to use SQL.ARRAY. Let’s build a script that uses SQL.ARRAY to achieve the same effect. This script will give us some added flexibility. Add a new script to the project called 560 PostcodeEstimates With SqlArray.usql (you’ll already have this if you downloaded the project from GitHub). The code for this script is below.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@pcodes = SELECT * FROM (VALUES
("Liverpool", "L33 5XA,L33 5XB,L33 5XD,L33 5XE,L33 5XF"),
("Manchester", "M1 1AE,M1 1BA,M1 1BE,M1 1BY,M1 1BZ")) AS pc(Area, Postcodes);
// Put postcodes into an array
@pcodesAreaArray =
SELECT Area, new SQL.ARRAY<string>(Postcodes.Split(',')) AS Postcodes
FROM @pcodes;
@pcodeTotals =
SELECT ae.Area, pc.Postcode, pe.Total
FROM @pcodesAreaArray AS ae
CROSS APPLY EXPLODE(ae.Postcodes) AS pc(Postcode)
INNER JOIN PostcodeEstimates AS pe ON pc.Postcode == pe.Postcode;
OUTPUT @pcodeTotals
TO "/outputs/pcodetotals_array.txt"
ORDER BY Postcode ASC
USING Outputters.Text();

In the non-SQL.ARRAY script, the @pcodes rowset variable stored one postcode per row. Here, we are storing just two rows of two columns – the area, and a comma-separated list of postcodes. So far, so normal (the comma-separated list is just a string). The assignment to @pcodesAreaArray is where things start to become interesting:

// Put postcodes into an array
@pcodesAreaArray =
SELECT Area, new SQL.ARRAY<string>(Postcodes.Split(',')) AS Postcodes
FROM @pcodes;

This line actually creates the SQL.ARRAY column (as part of the @pcodesAreaArray rowset). It selects the Area from @pcodes, then it returns an array of strings, using the Postcodes column from @pcodes.

new SQL.ARRAY<string>(Postcodes.Split(','))

The new keyword is a C# statement that declares a new instance of a variable – it’s telling U-SQL that we want to create a new SQL.ARRAY. There are two parts to this statement. The first part tells U-SQL to create an array of strings:

new SQL.ARRAY<string>

We can specify any valid primitive type supported by U-SQL inside the angular brackets, it all depends upon the type of data we want the array to store. To create an array using an integer, we’d write one of these two statements:

new SQL.ARRAY<int>(code to retrieve values)
new SQL.ARRAY<int>{hard-coded values}

The brackets that come after this declaration represent the constructor for the SQL.ARRAY type. A constructor is an initialiser – it allows the developer to provide the SQL.ARRAY type with a set of values. Here’s an example which creates an integer-based SQL.ARRAY with some hard-coded values:

@numbers = SELECT NumberCollection, new SQL.ARRAY<int>{10, 20, 30, 40} AS NumbersList
FROM @numbers;

IMPORTANT! Note that this statement doesn’t have any parentheses – it only uses curly brackets. When you hard-code values you just specify a single set of curly brackets, with each value in the array separated by a comma. This works equally as well for strings:

@pcodesAreaArray = SELECT Area, new SQL.ARRAY<string>{"L33 5XA", "L33 5XB", "L33 5XD", 
"L33 5XE", "L33 5XF"} AS Postcodes
FROM @pcodes WHERE Area == "Liverpool";

Of course, quite often you won’t want to use hard-coded values – you’ll want to pull the array values from a table or file. That’s what the line of code in our example does:

new SQL.ARRAY<string>(Postcodes.Split(','))

The code inside the brackets tells U-SQL where to obtain the values for the array. The string type in C# supports a number of extension methods, some of which we’ve seen in the past (Substring in the previous example, for instance). Split is another such method. Split takes the current string and slices it up using the specified separator character (a comma in this case). You start with this single string:

L33 5XA,L33 5XB,L33 5XD,L33 5XE,L33 5XF

And you end up with a list of elements:

L33 5XA

L33 5XB

L33 5XD

L33 5XE

L33 5XF

You can split on any character or combination of characters.

It is important to remember that the values in the SQL.ARRAY cannot be selected like any other column. If you try to OUTPUT a SQL.ARRAY to a file you’ll hit an error:

This is because it isn’t possible to just write a set of array values to a file as text. This is where the EXPLODE expression comes in.

Exploding SQL.Arrays with the APPLY Operator

Exploding data sounds quite destructive! Fortunately, these explosions are good things. With the array now in place, we can explode it out to return the same result set as the first query.

@pcodeTotals =
SELECT ae.Area, pc.Postcode, pe.Total
FROM @pcodesAreaArray AS ae
CROSS APPLY EXPLODE(ae.Postcodes) AS pc(Postcode)
INNER JOIN PostcodeEstimates AS pe ON pc.Postcode == pe.Postcode;

This is much simpler than our first attempt – and more flexible too. The convoluted check to determine which area our postcode belongs to has gone – that’s now pulled back from the Area column. If we want to add more areas and postcodes to this query, all we need to do is include them in the original SELECT statement, ensuring the postcodes are stored as a comma-separated list.

We are using a CROSS APPLY EXPLODE on the Postcodes column:

CROSS APPLY EXPLODE(ae.Postcodes) AS pc(Postcode)

CROSS APPLY acts in a similar fashion to an INNER JOIN – rows in the left-hand table will only be returned if there’s a matching row in the right-hand table. U-SQL also supports OUTER APPLY, which behaves like a LEFT JOIN – rows from the left-hand table are always returned, even if there isn’t a matching value in the right-hand table. The APPLY operators work either with EXPLODE or a custom piece of code called an Applier, which you would need to write yourself.

EXPLODE returns a rowset of the appropriate type, placing each item in the array into its own row. The type being used for this array is a string. ae.Postcodes (inside the brackets) is the SQL.ARRAY column we wish to explode out. Finally, we assign a table name (pc) and column names (just one in this case – Postcode) to the APPLY. The end result is a rowset, which is applied to the query. If U-SQL allowed us to output a SQL.ARRAY column using a statement like SELECT * FROM @pcodesAreaArray (it doesn’t – you’ll see the error in the screenshot displayed above if you try it), we’d have returned data in the format we saw earlier:

Area (string)Postcode (SQL.ARRAY)

Liverpool

L33 5XA
L33 5XB
L33 5XD
L33 5XE
L33 5XF

Manchester

M1 1AE
M1 1BA
M1 1BE
M1 1BY

This data set only contains two rows. With the CROSS APPLY EXPLODE, we return this:

AreaPostcode
LiverpoolL33 5XA
LiverpoolL33 5XB
LiverpoolL33 5XD
LiverpoolL33 5XE
LiverpoolL33 5XF
ManchesterM1 1AE
ManchesterM1 1BA
ManchesterM1 1BE
ManchesterM1 1BY
ManchesterM1 1BZ

Ten rows – which is much more useful to us. You can’t use EXPLODE without an associated APPLY, although CROSS APPLY and OUTER APPLY can be used by themselves in certain circumstances.

The query finishes off by joining to PostcodeEstimates, which returns the relevant total for the postcode. Try running this – the output is exactly the same as the original query. We’ve now gained a lot more flexibility and simplified the query.

Returning Individual Items From Arrays

If you wish, you can pull out single values from an array. You do this by providing the index of an individual array element. C# indexes are zero-based, meaning the first element can be indexed at position 0. To return the first postcode for each area, we could write a SELECT statement like this:

SELECT Area, Postcodes[0] AS FirstPostcode FROM @pcodesAreaArray;

You simply specify the index of the array element you required in square brackets after the SQL.ARRAY column name – Postcodes in the example above. This returns just two rows – the first postcode found for Liverpool, and the first postcode found for Manchester.

Other Ways To Use A SQL.ARRAY

C# arrays are very powerful – we’re on the tip of the iceberg here. For instance, you can tell U-SQL to return all array elements that only match a certain pattern, to group array values, or to return only substrings of array values. There are lots of other things you can do too, but I’ll leave them as an investigative exercise for the reader. I will show you how to find out what is available though.

In our 560 script, comment out this line:

@pcodeTotals =
SELECT ae.Area, pc.Postcode, pe.Total
FROM @pcodesAreaArray AS ae
CROSS APPLY EXPLODE(ae.Postcodes) AS pc(Postcode)
INNER JOIN PostcodeEstimates AS pe ON pc.Postcode == pe.Postcode;

Type this:

@pcodeTotals = SELECT Area FROM @pcodesAreaArray;

With that code in place, change the statement to:

@pcodeTotals = SELECT Area, Postcodes. FROM @pcodesAreaArray;

Yes, that’s a dot/period after the word “Postcodes”. When you type that, an Intellisense list will appear. This shows you all of the wonderful methods you can use on your array.

There are loads of these, scroll down the list and take a look. Not all of them make sense in a U-SQL context, but quite a lot of them do. For instance, instead of returning the first postcode for each area by using Postcodes[0] (like we did earlier), we could write:

@pcodeTotals = SELECT Area, Postcodes.ElementAtOrDefault(0) AS FirstPostcode FROM @pcodesAreaArray;

This will return the postcode if found, otherwise it will return null. Have a good look at these methods, lots of them are very flexible.

Summary

SQL.ARRAY is a very powerful addition to the U-SQL library. The ability to store an ordered list of data in a single column is quite compelling and extremely useful. Used well, SQL.ARRAY can simplify your code and make it more efficient.

At the beginning of this article, I mentioned SQL.MAP. SQL.MAP brings something slightly different to the table (pun intended), and we’ll explore that in our next article. Happy U-SQLing until then!

This article is part of the parent stairway Stairway to U-SQL

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating