Stairway to U-SQL

Stairway to U-SQL Level 6: U-SQL Indexes and Partition Schemes

,

Last time out, we looked at creating databases, tables and schemas in U-SQL. We learnt that it isn’t possible to create a table without also creating an index for the table and a partition scheme for the index. In this latest article, we’ll take a look at:

  • The types of index U-SQL makes available to us
  • What a partition scheme is
  • The types of partition scheme available to us
  • How a partition scheme works
  • How to insert data into a partitioned table
  • Why an index is useful in U-SQL

As usual, there’s a lot for us to work through, so let’s go!

Indexes

As you’re reading this on SQLServerCentral, there’s a pretty good chance you know what an index is. But for the uninitiated, I’ll quickly say that an index is something that makes your searches faster. That’s a very (very!) simple explanation of what an index is, but for our purposes it will suffice. SQL Server supports many types of indexes – clustered, nonclustered, unique, columnstore, XML…the list goes on. U-SQL supports one type of index – the clustered index (other indexes are planned for future releases). These clustered indexes work in a similar manner to SQL Server clustered indexes, by ensuring the data is stored on disk in the specified order.

IMPORTANT! If you want to drop the clustered index at some later point, you won’t be able to do it. Instead, you’ll have to drop the entire table. This sounds bad to us data-centric folks, but as you’ll probably be using U-SQL for reporting purposes, it might not be a huge problem. There’s a possibility you’ll be creating tables, loading data, querying, then dropping the tables to save money in Azure. Of course, if you are planning on keeping the data around, you need to make sure you correctly cluster the table when you create it.

There are two ways to create an index, both of which we saw in Level 5. You can either create the index as part of the CREATE TABLE statement, or use the CREATE INDEX statement. Here’s the code we used last time to create the Counties table.

CREATE TABLE IF NOT EXISTS Counties
(
CountyCode string,
CountyName string,
INDEX idx_Counties CLUSTERED (CountyCode)
DISTRIBUTED BY HASH (CountyCode)
);

Note the INDEX and DISTRIBUTED BY lines. This is where the index is specified. The index lines can be broken down as:

INDEX index_name CLUSTERED (field list)

DISTRIBUTED BY partition_specification

It is possible to cluster and distribute (partition) on more than one column. Now, here’s the same index, but specified as a separate CREATE INDEX statement.

CREATE CLUSTERED INDEX idx_Counties ON Postcodes.Counties (CountyCode)
DISTRIBUTED BY HASH (CountyCode);

Exactly the same, but separated as an individual statement. Use whichever one you’re more comfortable with. So, we have an index. A partition scheme is specified in our statement, but an index without a partition scheme is useless. Why is it useless? Let’s find out!

What is a Partition Scheme?

When you upload a file to the Azure Data Lake Store, it is split into a number of extents. This is similar to the process used by Hadoop – the file is split into smaller chunks and replicated three times (by default). This is done to enable parallel reads and parallel writes to the file, speeding up file processing times. So the index dictates how the data in the table is stored, and the partition scheme dictates how the data is partitioned and distributed across the system.

There are two types of partitioning supported by the Data Lake Store:

  • Horizontal partitioning - Also known as fine grained, this ensures similar data values are kept close to each other, reducing the need to hit multiple machines to return a set of specific values.
  • Vertical partitioning - In this scenario, the developer controls the partitions. You specify the partition keys, and the partition can be directly specified in your U-SQL statements. This is also known as coarse grained partitioning.

We’ll take a look at both of these schemes in this article.

Horizontal Partitioning

U-SQL provides four different distribution schemes to support horizontal partitioning.

  • HASH (keys) - The key values supplied (CountyCode in the script we saw earlier, for instance) are hashed to provide fast lookups when queried. Multiple keys can be provided if necessary, forming a composite key.
  • DIRECT HASH (ID) - This gives the developer (i.e. you or me) more control over how the data is hashed. It requires you to provide the hashing key as part of the table. Instead of partitioning on the clustered columns like a normal HASH, this requires a numeric column to provide the partition IDs. This is often used with functions like RANK() and DENSE_RANK() (yes, these exist in U-SQL).
  • RANGE (keys) - This will ensure ranges of values are kept together. It works in a similar way to HASH – you provide the key (or keys), and the data will be partitioned across the Data Lake as specified.
  • ROUND ROBIN - The preceding distribution schemes may lead to something called data skew. This happens when your data doesn’t line up with the keys you’ve specified in the distribution scheme. Should this happen, you have two options – replace the keys in your chosen distribution scheme, or use the ROUND ROBIN scheme. ROUND ROBIN doesn’t use any keys, and instead partitions the data equally across all nodes. This is a good “catch-all” model to use when none of the other schemes can meet your requirements.

We’ll take a look at HASH in this article, and will investigate the other distribution schemes in the next article.

Vertical Partitioning

With vertical partitioning, it is up to the script’s developer to explicitly define the distribution scheme. You specify the columns used to define buckets, the columns to be used for hashing, and finally the number of buckets required. By explicitly specifying the columns, you dictate into which bucket certain data are placed. For example, creating a vertical partition on the County Code column would ensure the records for each county exist within a specific bucket. When a filtered query is executed on County Code U-SQL can go straight to the required buckets.

It is also possible to define an “unknown” partition, and to direct records that don’t match the criteria to that partition. You need to specify this in your INSERT statements – we’ll look at this a bit later.

Why is Partitioning Important?

U-SQL jobs are split out into vertices (singular: vertex). You can think of a vertex as a parallel unit of execution. Your partition scheme helps to determine how much work an individual vertex has to do. Say you’ve partitioned on the District Code column, and 70% of your records have the same District Code. 70% of your records could be processed by the same vertex. Not necessarily a bad thing, until you learn that vertices time out after five hours. It’s very important that no element of your jobs can take this long. If you’re concerned, consider the ROUND ROBIN scheme, which will evenly split the data.

How a File is Searched

Say we want to search for all postcodes from my home district of Knowsley (this is next to Liverpool for those who are geographically interested). I happen to know that the District Code for Knowsley is E08000011. Open up Visual Studio and open up the SSC_UkPostcodes Visual Studio project we started in article 5. The project should have scripts from 010 to 060. In the Solution Explorer, right-click on the project name and select Add > New Item. Create a new U-SQL Script called 070 Extract Knowsley Data.usql. Here’s the query to return the records, which you need to type or paste into the script:

USE DATABASE UkPostcodes;
DECLARE @DataFilePath string = "/ssc_uk_postcodes/postcodes/Postcodes.csv";
@postcodes =
EXTRACT Postcode string,
CountyCode string,
DistrictCode string,
CountryCode string,
Latitude decimal?,
Longitude decimal?
FROM @DataFilePath
USING Extractors.Csv();
@knowsleyPostcodes =
SELECT Postcode,
CountyCode,
DistrictCode,
CountryCode,
Latitude,
Longitude
FROM @postcodes
WHERE DistrictCode == "E08000011";
OUTPUT @knowsleyPostcodes
TO "/outputs/knowsleyPostcodes.tsv"
USING Outputters.Tsv();

IMPORTANT NOTE: File names specified in U-SQL scripts are case-sensitive. Make sure the name in the script exactly matches the name of the file.

If this doesn’t look familiar, please refer back to the start of the series to see how U-SQL statements are formed.

Before running this, decide whether you want to execute the scripts on your local machine or against Azure. You will see the same execution plans no matter the environment, but you’ll receive more information back from Azure.  If you do go with Azure, make sure you select the correct Data Lake Analytics account in the drop-down on the U-SQL toolbar, then run the query (more details in Level 4 if you need to refer back).

The postcodes file this query uses has not been executed in parallel – it’s all been processed in one big lump. Here’s the execution plan to prove it.

Just two vertices used – one to extract the data from the CSV file, and another to load it into the TSV file. Clearly, reading directly from files isn’t the most efficient way to use Azure Data Lakes. The file is acting in the same way a heap table does in SQL Server. Fortunately, we introduced tables in our last article. By loading our data into a table we’ll be able to improve our query’s performance.

Creating, Populating and Querying a Horizontally Partitioned Table

Why don’t we create a postcodes table, partitioned by District Code. Here’s the code to create such a table. Add a new script to your project and call it 080 Create Partitioned Postcodes_DistrictCode Table.usql before adding the code to it.

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  CREATE TABLE IF NOT EXISTS Postcodes_DistrictCode
  (
  Postcode string,
  CountyCode string,
  DistrictCode string,
  CountryCode string,
  Latitude decimal?,
  Longitude decimal?,
  INDEX idx_Postcodes
  CLUSTERED(Postcode)
  DISTRIBUTED BY HASH (DistrictCode)
  );

We’ve partitioned this by HASH (DistrictCode). Run this against your Azure Data Lake account to create the table in Azure. Next, we need to insert some data into the table. This is the first time we’ve inserted data into a table in the series. We pull out the data from a file, then use a very familiar looking INSERT statement to put the data into the table we just created. This is a very simple example; if you so desired, you could easily filter out any data/columns you didn’t want before inserting into the table. You don’t need to exactly match the columns in the file with the columns in the table. Create a new script called 090 Insert Postcode_DistrictCode Data.usql and add this code.

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  DECLARE @DataFilePath string = "/ssc_uk_postcodes/postcodes/Postcodes.csv";
  
  @postcodes =
  EXTRACT Postcode string,
  CountyCode string,
  DistrictCode string,
  CountryCode string,
  Latitude decimal?,
  Longitude decimal?
  FROM @DataFilePath
  USING Extractors.Csv();
  
  INSERT INTO Postcodes_DistrictCode
  (
  Postcode,
  CountyCode,
  DistrictCode,
  CountryCode,
  Latitude,
  Longitude
  )
  SELECT Postcode,
  CountyCode,
  DistrictCode,
  CountryCode,
  Latitude,
  Longitude
  FROM @postcodes;

Run this against Azure to load the data into the table. The execution plan for this data load looks a bit more interesting than our previous effort:

27 vertices used here, 25 of which were for the data load. This was the Azure Data Lake’s parallelism capability in action, running tasks in parallel. It took a little longer to load the data, but this will greatly assist us when running queries against this table in the future. The multiple vertices have been used to split the postcodes file into extents. Each extent stores a certain amount of data, and the extents are spread and duplicated across servers, giving speed (via parallelism) and reliability (via the duplication). With the correct partitioning scheme, it’s possible to return data using only one extent (as we’ll see imminently).

Now, here’s the last part of the puzzle – the query that obtains the Knowsley postcodes from the table. When we ran this last time against the file, U-SQL would have run the query something like this:

As the top quality diagram shows, U-SQL would have executed something that essentially boiled down to a table scan. Because it didn’t know which extents held the Knowsley data, it had to query the entire file. Now that we have a table, we can improve this situation. Add another new U-SQL script to your Visual Studio project. This one is called 100 Extract Knowsley Postcode Data.usql. Surprise, surprise – it’s going to pull the Knowsley records from the table we’ve just populated.

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  @knowsleyPostcodes =
  SELECT Postcode,
  CountyCode,
  DistrictCode,
  CountryCode,
  Latitude,
  Longitude
  FROM Postcodes_DistrictCode
  WHERE DistrictCode == "E08000011"; // Knowsley district code
  
  OUTPUT @knowsleyPostcodes
  TO "/outputs/knowsleyPostcodes.tsv"
  USING Outputters.Tsv();

Here’s another cool diagram (drawn in Microsoft Paint, no expense spared here), which shows what should happen this time.

If we presume the Knowsley data can be found in Extent B, U-SQL will use the index and partition scheme to go straight to the extent containing the data. If you execute this script against your Azure account, we should see an execution plan that looks reasonably similar to the diagram.

Lo and behold, one vertex! It’s powerful stuff. Just to prove the difference, here’s the plan for the same query, but with the table partitioned on CountyCode. Note three vertices were used, as the data had to be extracted first.

Have a play with the other horizontal partition types to see how they affect the execution plan – we’ll investigate them further in a future article. To finish this introduction to partition schemes and indexes, we’re going to take a look at vertical partitioning.

Creating, Populating and Querying a Vertically Partitioned Table

The data set we are using contains 27 different county codes (I counted ‘em). Remember that vertical partitions are split into buckets. As a result, we need two statements to create this table – one to create the table, and another to create the partitions. Add the new U-SQL script 110 Create Vertical Partitioned Postcodes Table.usql to the Visual Studio project. Here’s the CREATE TABLE statement you should firstly add to that script:

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  CREATE TABLE IF NOT EXISTS Postcodes_VerticalPartitioning
  (
  Postcode string,
  CountyCode string,
  DistrictCode string,
  CountryCode string,
  Latitude decimal?,
  Longitude decimal?,
  INDEX idx_Postcodes
  CLUSTERED (Postcode)
  PARTITIONED BY (CountyCode)
  DISTRIBUTED BY HASH (DistrictCode) INTO 3
  );

Pretty much the same as always, except for the new PARTITIONED BY and INTO 3 statements. The index is clustered on postcode, so the data will be sorted on disk in postcode order. The partition scheme uses County Code to split the data into buckets, with the data within those buckets using a hashed distribution scheme based on the District Code. We’ve specified INTO 3, which explicitly tells U-SQL to split the file across three extents. What you choose here depends upon the size of your data.

There are a couple of things to highlight about buckets. Firstly, you aren’t limited to partitioning on a single column – if you need to partition using two or three columns, you can go for it! You can also use any of the horizontal partitioning schemes we saw earlier as the distribution scheme – the example uses HASH, but ROUND ROBIN, RANGE or DIRECT HASH could be used. With that said, let’s add the ALTER TABLE statement that defines the buckets to the script we’ve just created:

  ALTER TABLE Postcodes_VerticalPartitioning
  ADD PARTITION("E10000002"),
  PARTITION("E10000003"),
  PARTITION("E10000006"),
  PARTITION("E10000007"),
  PARTITION("E10000008"),
  PARTITION("E10000009"),
  PARTITION("E10000011"),
  PARTITION("E10000012"),
  PARTITION("E10000013"),
  PARTITION("E10000014"),
  PARTITION("E10000015"),
  PARTITION("E10000016"),
  PARTITION("E10000017"),
  PARTITION("E10000018"),
  PARTITION("E10000019"),
  PARTITION("E10000020"),
  PARTITION("E10000021"),
  PARTITION("E10000023"),
  PARTITION("E10000024"),
  PARTITION("E10000025"),
  PARTITION("E10000027"),
  PARTITION("E10000028"),
  PARTITION("E10000029"),
  PARTITION("E10000030"),
  PARTITION("E10000031"),
  PARTITION("E10000032"),
  PARTITION("E10000034"),
  PARTITION("UNKNOWN");

Now, you may be thinking I’m a liar – after all, I said we’d create 27 buckets, and this statement creates 28! The final bucket is a catch-all, for values that don’t match any of the County Code values we’ve specified for the individual buckets. In theory, you are not limited to one value per bucket – it’s perfectly acceptable to specify multiple values, e.g.:

PARTITION("E10000031", "E10000032", "E10000034")

BUT! There seems to be a bug in U-SQL at present – if you specify multiple values for a partition you may or may not be able to insert your data, and if you do manage to insert it you won’t be able to extract it out! As a result, it’s probably a good idea to limit your partitions to single values for the time being.

You can specify the same value for multiple buckets, but this will cause issues when you try to retrieve your data – so don’t do it! One other thing worth mentioning – you will not be able to insert data into a table that has buckets specified until you’ve added the partitions. Now, save the script and run it against your Azure account. The table will be created. We need a script to populate it. Add a new U-SQL script called 120 Insert Vertical Partitioned Postcodes Data.usql and add the code below to it. See if you can spot the difference from our previous INSERT statement.

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  DECLARE @DataFilePath string = "/ssc_uk_postcodes/postcodes/Postcodes.csv";
  
  @postcodes =
  EXTRACT Postcode string,
  CountyCode string,
  DistrictCode string,
  CountryCode string,
  Latitude decimal?,
  Longitude decimal?
  FROM @DataFilePath
  USING Extractors.Csv();
  
  INSERT INTO Postcodes_VerticalPartitioning
  ON INTEGRITY VIOLATION MOVE TO PARTITION ("UNKNOWN")
  SELECT Postcode,
  CountyCode,
  DistrictCode,
  CountryCode,
  Latitude,
  Longitude
  FROM @postcodes;

It’s pretty similar to the INSERT statement we used for the horizontal partitioning, except for this line:

  ON INTEGRITY VIOLATION MOVE TO PARTITION ("UNKNOWN")

This tells U-SQL to put any rows it cannot match to a partition into the unknown partition. A nice feature. If you genuinely aren’t interested in these rows, you can tell U-SQL to ignore them, by instead specifying this line:

  ON INTEGRITY VIOLATION IGNORE

If you don’t specify one of these statements and the file does contain values that don’t match the partition values, the statement will throw an error:

If everything succeeds, you’ll see lots of vertices in the execution plan. You should also note there are 28 partitions shown in the image, although I’ve had to make them so small you might find it difficult to see them (the first row with lots of boxes show the partitions; the rows underneath these are the splits and aggregates executed for each partition).

Once the job completes (this does take a while – 36 minutes when I attempted it – the two orange boxes represent retries, so maybe they had something to do with the long run-time) and the data has been loaded, we can easily query individual partitions. Add script 130 Extract Vertical Partitioned Postcodes Data.usql to your Visual Studio project and add the code below, which pulls back data for Lancashire (CountyCode E10000017).

  USE DATABASE UkPostcodes;
  USE SCHEMA Postcodes;
  
  @lancsPostcodes =
  SELECT Postcode,
  DistrictCode,
  CountryCode,
  Latitude,
  Longitude
  FROM Postcodes_VerticalPartitioning
  WHERE CountyCode == "E10000017";
  
  OUTPUT @lancsPostcodes
  TO "/outputs/lancsPostcodes.tsv"
  USING Outputters.Tsv();

Run this and the Lancashire postcode file will be generated pretty quickly, as U-SQL only needs to pull data from a single bucket. There are no holes in this bucket!

Dropping Partitions

You can drop a partition if you wish, using the ALTER TABLE command (USE commands excluded for brevity):

ALTER TABLE Postcodes_VerticalPartitioning
DROP PARTITION("E10000017");

However, if you do run this the data that sits in that partition is also deleted. If you were to run the above, then run the EXTRACT script above, nothing would be returned. But if you were to change the CountyCode in the query to, say, E10000018, the data would be returned as you didn’t drop that particular partition. Be careful when dropping partitions!

Summary

Optimally storing your data in your Azure Data Lake Store can really help your U-SQL queries. This article has demonstrated a number of ways in which you can store your data, and we’ve also seen how the storage can directly affect the performance of your U-SQL queries.

We’ve looked at both vertical and horizontal partitioning, but only the HASH distribution scheme used by horizontal partitioning. We’ll take a look at the other three schemes in the next article.

This article is part of the parent stairway Stairway to U-SQL

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating