Stairway to U-SQL

Stairway to U-SQL Level 13: SQL.MAP

,

In the previous step of the U-SQL Stairway, we took a look at the SQL.ARRAY data type. This allows us to store multiple values of the same type in a single column. For instance, we can store a collection of postcodes in a SQL.ARRAY column, with the area they belong to stored in a corresponding string column.

SQL.MAP is fairly similar to SQL.ARRAY, but it stores items as a pair of values (as opposed to the single value supported in a SQL.ARRAY item). These values together form a key-value pair, and allow items to be stored that can be used for lookups. These kinds of pairs often exist in a lookup table, such as the Counties table we created several steps ago in this series.

Say we’ve been asked to store the country each county resides in. Our dataset is only concerned with the UK, so we only have at most four countries to deal with – England, Scotland, Wales and Northern Ireland. We can create a new table which stores the name of the country in a string column, and all of the counties for that country in a SQL.MAP column.

Creating a Table with a SQL.MAP Column

Open up the SSC_UkPostcodes Visual Studio solution (you can download it from GitHub if you don’t already have it). If it doesn’t already exist, add a new script called 600 Create Countries Table.usql. Here’s the code:

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
DROP TABLE IF EXISTS Countries;
CREATE TABLE IF NOT EXISTS Countries
(
CountryName string,
Counties SQL.MAP<string, string>,
INDEX idx_Countries CLUSTERED (CountryName)
DISTRIBUTED BY HASH (CountryName)
);

Run this when you’re ready to create the table. We’ve met most of this code before – the interesting line is:

Counties SQL.MAP<string, string>,

This doesn’t look too dissimilar to the SQL.ARRAY we declared last time around:

SQL.ARRAY<string>

However, the SQL.MAP has two strings declared, not one. And we’re not just limited to strings! For instance, we could have declared:

Counties SQL.MAP<string, int>
Counties SQL.MAP<decimal, string>
Counties SQL.MAP<string, bool>

You can create a SQL.MAP from any of the usual .NET types. I’ve used two strings to map to the format in which we store county records:

  • County Code, which is a string
  • County Name, which is also a string

Together, these form a key-value pair, which I mentioned at the top of the article. The SQL.MAP is actually a wrapper around something called a Dictionary, which is a C# type that stores a list of Key-Value Pairs. If you take a look at the new table in the Server Explorer, we can see the data types in use.

There’s a load of technology behind the SQL.MAP data type, but all we need to know as U-SQL developers is we can use it to group key-value pairs together. Here is how the Countries table could look when populated with some data (Scotland hasn’t actually had counties since 1975 – the country is divided by council area. But for the purposes of our data set, we’ll stick with the counties):

CountryName

Counties

EnglandE10000002Buckinghamshire
E10000007Derbyshire
E10000008Devon
ScotlandS10000001East Lothian
S10000004Inverness
WalesW10000001Clywd
W10000002Gwent
W10000003Powys
Northern IrelandN10000001Antrim
N10000002Armagh
N10000003Londonderry

Populating SQL.MAP

Being able to store data together gives us a bit more flexibility when it comes to, for example, lookup values – we can now store all of the lookup data directly against the thing we’re trying to look up!

Regardless of what a SQL.MAP column gives us, it’s pretty useless unless we populate it with some data. We’re going to bring the records from the Counties table into the SQL.MAP column. To make sure you have the correct data, pull the latest code from GitHub and run script 200 Populate Tables.usql. With this done, add a new script to the project (you’ll probably already have it if you’ve pulled the latest code). Call the script 610 Insert Countries Data.usql. You may spot something new in this code.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
TRUNCATE TABLE Countries;
@nonEnglishCounties = SELECT * FROM (VALUES ("Northern Ireland", "N10000001"),
("Northern Ireland", "N10000002"), ("Northern Ireland", "N10000003"),
("Wales", "W10000001"), ("Wales", "W10000002"), ("Wales", "W10000003"),
("Scotland", "S10000001"), ("Scotland", "S10000002"), ("Scotland", "S10000003"),
("Scotland", "S10000004"), ("Scotland", "S10000005"), ("Scotland", "S10000006")
) AS c(CountryName, CountyCode);
// Insert non-English counties
INSERT INTO Countries
(CountryName, Counties)
SELECT nec.CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
INNER JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
GROUP BY nec.CountryName;
// Insert English counties
INSERT INTO Countries
(CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

This code isn’t complicated. It truncates the Countries table, inserts some Scottish, Welsh and Northern Irish counties, and then some English counties. Let’s look at one of those SELECT statements in a bit more detail.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

This looks just like any other INSERT statement you’ve ever seen, except for the second column declaration in the SELECT statement.

MAP_AGG(c.CountyCode, c.CountyName) AS Counties

What is this strange and mystical MAP_AGG statement? It’s a built-in U-SQL function that takes a collection of key-value pairs and transforms them into a SQL.MAP. MAP_AGG actually does the exact opposite of the EXPLODE statement. EXPLODE splits a key-value pair into separate values, whilst MAP_AGG takes a pair of values and transforms them into a key-value pair.

MAP_AGG Syntax

There isn’t much to the syntax of MAP_AGG.

MAP_AGG([DISTINCT] Key, Value)

The DISTINCT keyword is optional. Key and Value can be any valid C# expression. This is one of those areas where the full power of the C# language comes into play, as you can use things like lambda expressions as well as simple column names. In our example, we’ve used column names.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

We’ve declared an alias of “c” for the Counties table. The columns from this table are passed to the MAP_AGG function. c.CountyCode is used as the Key, c.CountyName is used as the Value. On the surface, this looks just like any other SQL statement – MAP_AGG could be any function that accepts two parameters. It’s what happens once the statement has completed execution that we’re interested in!

Viewing Data in a SQL.MAP Column

Run the script to populate the Countries table. Now that the data is in the table, how the heck do we pull it out? As a first attempt, let’s try using the Server Explorer. Open this up, navigate to (Local)/U-SQL Databases/UkPostcodes/Tables/Postcodes.Countries (assuming you’re testing this locally – otherwise, follow the relevant path within your Azure account). Right-click on this and choose the Preview option.

The preview screen appears and attempts to load the data. Unfortunately, it all goes wrong with a Fetch table data error.

This is because the preview screen can’t handle a SQL.MAP or SQL.ARRAY column “out of the box”. We’ll have to write a custom OUTPUT statement to access the data instead.

Selecting SQL.MAP Data (or not!)

If you haven’t downloaded from GitHub, add a new script called 620 Select Countries Data.usql. Here is our first attempt at writing the data in the Countries table to a file.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@countries = SELECT cy.CountryName, cy.Counties FROM Countries AS cy;
OUTPUT @countries TO "/outputs/countries.tsv" USING Outputters.Tsv();

Looks good, doesn’t it? Unfortunately, it doesn’t work – this is pretty much the same statement the preview screen executed earlier. This time, we actually see a descriptive error message.

It isn’t just the preview screen that has a problem with raw SQL.MAP columns! To correctly output the data, we need to use the EXPLODE statement. MAP_AGG put the data into the SQL.MAP column, and now EXPLODE transforms it back to separate columns. Replace the code above with this:

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;
@countries =
SELECT cy.CountryName, c.CCode AS CountyCode, c.CName AS CountyName
FROM Countries AS cy
CROSS APPLY EXPLODE(cy.Counties) AS c(CCode, CName);
OUTPUT @countries TO "/outputs/countries.tsv" USING Outputters.Tsv();

If you run this, the data will be outputted in the expected format.

It’s worth breaking this SELECT statement down. The SELECT line outputs three columns – CountryName, CCode (County Code), and CName (County Name). CountryName is a standard string column, coming straight from the Countries table. CCode and CName are generated from the SQL.MAP, via the EXPLODE statement we met in the last article. The EXPLODE statement takes the Counties SQL.MAP column as an input, and returns two columns – Key (CCode) and Value (CName). We can give these columns any name we want, but the SQL.MAP key will be returned as the first column, with the value as the second column.

Populating a SQL.MAP From a File

Creating a SQL.MAP from a file is pretty much the same as creating it from a table. You just SELECT the data in the same manner.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(CountyCode, CountyName) AS Counties
FROM
(EXTRACT CountyCode string, CountyName string
FROM @CountyFilePath USING Extractors.Csv()) AS counties;

Populating a SQL.MAP Manually

Finally, it’s also possible to manually add items to a SQL.MAP. Here’s one way of making that happen.

// Declare the manual values here
@basePeople = SELECT * FROM (VALUES
("MMQ1", "Mike McQuillan"), ("BMQ1", "Bertie McQuillan"), ("DMQ1", "Dolly McQuillan"))
AS p(PersonId, PersonName);
// Convert them into a SQL.MAP
@persons = SELECT new SQL.MAP<string, string>{{PersonId, PersonName}} AS PersonList
FROM @basePeople;
// Create an output set
@outputData = SELECT pData.PersonId, pData.PersonName
FROM @persons AS p CROSS APPLY EXPLODE(PersonList) AS pData(PersonId, PersonName);
// Write out the data to file
OUTPUT @outputData TO "/outputs/people.tsv" USING Outputters.Tsv();

There isn’t much to it – the manual values are declared in the first statement, and from then on it’s pretty similar to what we’ve already seen.

SQL.MAP Methods

To finish up, we’ll take a quick look at some of the extra methods you can call on a SQL.MAP column. You may recall we did something similar for SQL.ARRAY.

It’s pretty easy to obtain a COUNT of the number of counties for each country – SQL.MAP provides us with a Count() method.

SELECT cy.CountryName, cy.Counties.Count() AS CountyCount FROM Countries AS cy;

Here’s the output:

We can also use the ContainsKey method, to determine which country a particular county belongs to (just in case your geography isn’t top of the class). Here’s an example.

// E10000003 = Cambridgeshire
@countries = SELECT cy.CountryName, cy.Counties.ContainsKey("E10000003") AS ContainsCounty FROM Countries AS cy;

This returns true for England, false for all of the others.

There are lots of available methods – take a look at the documentation on the .NET Dictionary class to gain an idea of what you can do. Be aware that not all of these methods are available on SQL.MAP (ContainsValue, for instance).

Summary

U-SQL provides developers with two complex types – SQL.ARRAY, and SQL.MAP. We’ve now covered both of these, and have seen the extra abilities they provide us with. We’ll delve into another aspect of U-SQL soon – see you next time!

This article is part of the parent stairway Stairway to U-SQL

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating