SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 

Stairway to U-SQL Level 13: SQL.MAP

By Mike McQuillan,

The Series

This article is part of the Stairway Series: Stairway to U-SQL

As Microsoft continues to expand the Azure platform, they have enhanced its ability in ways that are quite different from what we've come to expect from SQL Server. Learn about the new language from Microsoft, U-SQL, designed to work with Data Lakes and Big Data in Azure.

In the previous step of the U-SQL Stairway, we took a look at the SQL.ARRAY data type. This allows us to store multiple values of the same type in a single column. For instance, we can store a collection of postcodes in a SQL.ARRAY column, with the area they belong to stored in a corresponding string column.

SQL.MAP is fairly similar to SQL.ARRAY, but it stores items as a pair of values (as opposed to the single value supported in a SQL.ARRAY item). These values together form a key-value pair, and allow items to be stored that can be used for lookups. These kinds of pairs often exist in a lookup table, such as the Counties table we created several steps ago in this series.

Say we’ve been asked to store the country each county resides in. Our dataset is only concerned with the UK, so we only have at most four countries to deal with – England, Scotland, Wales and Northern Ireland. We can create a new table which stores the name of the country in a string column, and all of the counties for that country in a SQL.MAP column.

Creating a Table with a SQL.MAP Column

Open up the SSC_UkPostcodes Visual Studio solution (you can download it from GitHub if you don’t already have it). If it doesn’t already exist, add a new script called 600 Create Countries Table.usql. Here’s the code:

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;

DROP TABLE IF EXISTS Countries;

CREATE TABLE IF NOT EXISTS Countries
(
CountryName string,
Counties SQL.MAP<string, string>,
INDEX idx_Countries CLUSTERED (CountryName)
DISTRIBUTED BY HASH (CountryName)
);

Run this when you’re ready to create the table. We’ve met most of this code before – the interesting line is:

Counties SQL.MAP<string, string>,

This doesn’t look too dissimilar to the SQL.ARRAY we declared last time around:

SQL.ARRAY<string>

However, the SQL.MAP has two strings declared, not one. And we’re not just limited to strings! For instance, we could have declared:

Counties SQL.MAP<string, int>
Counties SQL.MAP<decimal, string>
Counties SQL.MAP<string, bool>

You can create a SQL.MAP from any of the usual .NET types. I’ve used two strings to map to the format in which we store county records:

  • County Code, which is a string
  • County Name, which is also a string

Together, these form a key-value pair, which I mentioned at the top of the article. The SQL.MAP is actually a wrapper around something called a Dictionary, which is a C# type that stores a list of Key-Value Pairs. If you take a look at the new table in the Server Explorer, we can see the data types in use.

There’s a load of technology behind the SQL.MAP data type, but all we need to know as U-SQL developers is we can use it to group key-value pairs together. Here is how the Countries table could look when populated with some data (Scotland hasn’t actually had counties since 1975 – the country is divided by council area. But for the purposes of our data set, we’ll stick with the counties):

CountryName

Counties

England

E10000002

Buckinghamshire

E10000007

Derbyshire

E10000008

Devon

Scotland

S10000001

East Lothian

S10000004

Inverness

Wales

W10000001

Clywd

W10000002

Gwent

W10000003

Powys

Northern Ireland

N10000001

Antrim

N10000002

Armagh

N10000003

Londonderry

Populating SQL.MAP

Being able to store data together gives us a bit more flexibility when it comes to, for example, lookup values – we can now store all of the lookup data directly against the thing we’re trying to look up!

Regardless of what a SQL.MAP column gives us, it’s pretty useless unless we populate it with some data. We’re going to bring the records from the Counties table into the SQL.MAP column. To make sure you have the correct data, pull the latest code from GitHub and run script 200 Populate Tables.usql. With this done, add a new script to the project (you’ll probably already have it if you’ve pulled the latest code). Call the script 610 Insert Countries Data.usql. You may spot something new in this code.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;

TRUNCATE TABLE Countries;

@nonEnglishCounties = SELECT * FROM (VALUES ("Northern Ireland", "N10000001"),
("Northern Ireland", "N10000002"), ("Northern Ireland", "N10000003"),
("Wales", "W10000001"), ("Wales", "W10000002"), ("Wales", "W10000003"),
("Scotland", "S10000001"), ("Scotland", "S10000002"), ("Scotland", "S10000003"),
("Scotland", "S10000004"), ("Scotland", "S10000005"), ("Scotland", "S10000006")
) AS c(CountryName, CountyCode);

// Insert non-English counties
INSERT INTO Countries
(CountryName, Counties)
SELECT nec.CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
INNER JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
GROUP BY nec.CountryName;

// Insert English counties
INSERT INTO Countries
(CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

This code isn’t complicated. It truncates the Countries table, inserts some Scottish, Welsh and Northern Irish counties, and then some English counties. Let’s look at one of those SELECT statements in a bit more detail.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

This looks just like any other INSERT statement you’ve ever seen, except for the second column declaration in the SELECT statement.

MAP_AGG(c.CountyCode, c.CountyName) AS Counties

What is this strange and mystical MAP_AGG statement? It’s a built-in U-SQL function that takes a collection of key-value pairs and transforms them into a SQL.MAP. MAP_AGG actually does the exact opposite of the EXPLODE statement. EXPLODE splits a key-value pair into separate values, whilst MAP_AGG takes a pair of values and transforms them into a key-value pair.

MAP_AGG Syntax

There isn’t much to the syntax of MAP_AGG.

MAP_AGG([DISTINCT] Key, Value)

The DISTINCT keyword is optional. Key and Value can be any valid C# expression. This is one of those areas where the full power of the C# language comes into play, as you can use things like lambda expressions as well as simple column names. In our example, we’ve used column names.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(c.CountyCode, c.CountyName) AS Counties
FROM Counties AS c
LEFT JOIN @nonEnglishCounties AS nec ON c.CountyCode == nec.CountyCode
WHERE nec.CountyCode IS NULL;

We’ve declared an alias of “c” for the Counties table. The columns from this table are passed to the MAP_AGG function. c.CountyCode is used as the Key, c.CountyName is used as the Value. On the surface, this looks just like any other SQL statement – MAP_AGG could be any function that accepts two parameters. It’s what happens once the statement has completed execution that we’re interested in!

Viewing Data in a SQL.MAP Column

Run the script to populate the Countries table. Now that the data is in the table, how the heck do we pull it out? As a first attempt, let’s try using the Server Explorer. Open this up, navigate to (Local)/U-SQL Databases/UkPostcodes/Tables/Postcodes.Countries (assuming you’re testing this locally – otherwise, follow the relevant path within your Azure account). Right-click on this and choose the Preview option.


The preview screen appears and attempts to load the data. Unfortunately, it all goes wrong with a Fetch table data error.

This is because the preview screen can’t handle a SQL.MAP or SQL.ARRAY column “out of the box”. We’ll have to write a custom OUTPUT statement to access the data instead.

Selecting SQL.MAP Data (or not!)

If you haven’t downloaded from GitHub, add a new script called 620 Select Countries Data.usql. Here is our first attempt at writing the data in the Countries table to a file.

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;

@countries = SELECT cy.CountryName, cy.Counties FROM Countries AS cy;

OUTPUT @countries TO "/outputs/countries.tsv" USING Outputters.Tsv();

Looks good, doesn’t it? Unfortunately, it doesn’t work – this is pretty much the same statement the preview screen executed earlier. This time, we actually see a descriptive error message.

It isn’t just the preview screen that has a problem with raw SQL.MAP columns! To correctly output the data, we need to use the EXPLODE statement. MAP_AGG put the data into the SQL.MAP column, and now EXPLODE transforms it back to separate columns. Replace the code above with this:

USE DATABASE UkPostcodes;
USE SCHEMA Postcodes;

@countries =
SELECT cy.CountryName, c.CCode AS CountyCode, c.CName AS CountyName
FROM Countries AS cy
CROSS APPLY EXPLODE(cy.Counties) AS c(CCode, CName);

OUTPUT @countries TO "/outputs/countries.tsv" USING Outputters.Tsv();

If you run this, the data will be outputted in the expected format.

It’s worth breaking this SELECT statement down. The SELECT line outputs three columns – CountryName, CCode (County Code), and CName (County Name). CountryName is a standard string column, coming straight from the Countries table. CCode and CName are generated from the SQL.MAP, via the EXPLODE statement we met in the last article. The EXPLODE statement takes the Counties SQL.MAP column as an input, and returns two columns – Key (CCode) and Value (CName). We can give these columns any name we want, but the SQL.MAP key will be returned as the first column, with the value as the second column.

Populating a SQL.MAP From a File

Creating a SQL.MAP from a file is pretty much the same as creating it from a table. You just SELECT the data in the same manner.

INSERT INTO Countries (CountryName, Counties)
SELECT "England" AS CountryName, MAP_AGG(CountyCode, CountyName) AS Counties
FROM
(EXTRACT CountyCode string, CountyName string
FROM @CountyFilePath USING Extractors.Csv()) AS counties;

Populating a SQL.MAP Manually

Finally, it’s also possible to manually add items to a SQL.MAP. Here’s one way of making that happen.

// Declare the manual values here
@basePeople = SELECT * FROM (VALUES
("MMQ1", "Mike McQuillan"), ("BMQ1", "Bertie McQuillan"), ("DMQ1", "Dolly McQuillan"))
AS p(PersonId, PersonName);

// Convert them into a SQL.MAP
@persons = SELECT new SQL.MAP<string, string>{{PersonId, PersonName}} AS PersonList
FROM @basePeople;

// Create an output set
@outputData = SELECT pData.PersonId, pData.PersonName
FROM @persons AS p CROSS APPLY EXPLODE(PersonList) AS pData(PersonId, PersonName);

// Write out the data to file
OUTPUT @outputData TO "/outputs/people.tsv" USING Outputters.Tsv();

There isn’t much to it – the manual values are declared in the first statement, and from then on it’s pretty similar to what we’ve already seen.

SQL.MAP Methods

To finish up, we’ll take a quick look at some of the extra methods you can call on a SQL.MAP column. You may recall we did something similar for SQL.ARRAY.
It’s pretty easy to obtain a COUNT of the number of counties for each country – SQL.MAP provides us with a Count() method.

SELECT cy.CountryName, cy.Counties.Count() AS CountyCount FROM Countries AS cy;

Here’s the output:

We can also use the ContainsKey method, to determine which country a particular county belongs to (just in case your geography isn’t top of the class). Here’s an example.

// E10000003 = Cambridgeshire
@countries = SELECT cy.CountryName, cy.Counties.ContainsKey("E10000003") AS ContainsCounty FROM Countries AS cy;

This returns true for England, false for all of the others.

There are lots of available methods – take a look at the documentation on the .NET Dictionary class to gain an idea of what you can do. Be aware that not all of these methods are available on SQL.MAP (ContainsValue, for instance).

Summary

U-SQL provides developers with two complex types – SQL.ARRAY, and SQL.MAP. We’ve now covered both of these, and have seen the extra abilities they provide us with. We’ll delve into another aspect of U-SQL soon – see you next time!

 

This article is part of the Stairway to U-SQL Stairway

Sign up to our RSS feed and get notified as soon as we publish a new level in the Stairway! Rss

Total article views: 1009 | Views in the last 30 days: 3
 
Related Articles
FORUM

conversion column string to date in ssis

conversion column string to date in ssis

FORUM

combine column values into a string...

combine column values into a string...

SCRIPT

Find a text string in any column in a database

Find a string in any "string" (char, varchar, nchar etc) column in the database.

FORUM

how to Find And remove string in the column

how to Find And remove string in the column

FORUM

Split String

Split comma delimitted String Into Columns

Tags
azure    
cloud    
data lakes    
sql.map    
u-sql    
 
Contribute