SQLServerCentral Article

Azure DWH part 14: PolyBase to access to Non relational data

,

Introduction

PolyBase is used to access to non-relational data using T-SQL, such as a SQL table in Azure SQL Database or Azure SQL Data Warehouse (ASDW). In this tutorial, we will learn how to access to a CSV file stored in Azure using ASDW.

Requirements

  1. SQL Server Management Studio 2017 or Visual Studio.
  2. An Azure Account
  3. A CSV file (you can get it at the bottom of the article in the resources section).

Getting Started

If you do not have experience with Azure, it can be a little hard to work with PolyBase. Here we present a brief description of the steps to follow.

  1. Create a storage account. The storage account will be used to upload the csv file that will be accessed as a table at the end of the article.
  2. Create a container in the storage account. It is like a folder.
  3. Upload the csv file in the container.
  4. Create a master key to encrypt the information with a password.
  5. Create a credential (in this section we provide an access key to the storage account).
  6. Create an external data source (here we associate the credential to the container URL).
  7. Create an external file format (this step is to define the type of file, which is a CSV file in this example).
  8. Create an external table (this is the table that extracts the information from the CSV and is treated like any SQL table)
  9. Finally, we will check the functionality of the external table running some T-SQL select sentences with where, join clauses.

Each of these steps is detailed below.

1. In the Azure Portal, go to more services > Storage accounts

Add a new storage account:

Add a name for the storage account. In Deployment model dialog, use the Resource Manager. The Classic is an old version and not recommended. In Account kind, we will use the Blob storage because we want to store information in the Azure Blob. In performance, the cheapest option is Standard. In this scenario, it is a test, so we do not require the Premium account. In replication, you have options to replicate locally, in different data center, in the same datacenter, but multiple copies. You can have 3 to 6 copies and the most expensive option is to have a secondary read-only copy:

The Access Tier allows you to select between Cool storage (infrequent accessed data) and Hot (frequently accessed data). The Storage encryption and secure transfer are used for security. You can also select your subscription, group and location. Once selected the options, press the create icon:

Once created the Azure Storage, click there:

Click the +Container option to create a new container in the new Storage Account. Specify the name and access type and press OK:

Click on the container just created:

Press the Upload button:

Upload the csv file mentioned in the requirements:

In the storage account, go to Access keys. Copy the key1 or key2:

With the Query Editor, you can run most queries except those that create an external table. This may change in the future. When we created this article, this feature was in preview version. That is why it is recommended to use SSMS from a local machine:

Connect to the ADWH using SSMS and add a master key:

CREATE MASTER KEY ENCRYPTION BY PASSWORD='MySqlCentralPwd@$w0rd';
We will next create a Credential, named mysqlservercredential or any name of your preference. The Secret word is extracted from the Access Keys of the Azure Storage Account. It can be the key1 or key2:
CREATE DATABASE SCOPED CREDENTIAL mysqlservercredential 
 WITH IDENTITY = 'credential', Secret = 'guwBPL6+k2obC0cTcStpxu8ussey'
To create the data source, you will need to check the URL of the blob service Endpoint. In the Azure Portal, in the Storage Account, go to overview and check the URL address:

We will create an external data source named prices. It can be any name. In location, we use the following format:

wasbs://container_name@Blob_Service_Endpoint/

Where container_name is the container1 created previously inside the storage account. The Blob_Service_Endpoint can be extracted from the previous image:

CREATE EXTERNAL DATA SOURCE prices
WITH (
    TYPE = HADOOP,
    LOCATION = 'wasbs://container1@sqlservercentraldwh.blob.core.windows.net/',
    CREDENTIAL = mysqlservercredential
);

Create an external file format is used to define the format of the file. In this example, it is a Comma Separated Value format (CSV). The field terminator will be a comma:

CREATE EXTERNAL FILE FORMAT sqlservercentralformat 
WITH ( 
    FORMAT_TYPE = DELIMITEDTEXT, 
    FORMAT_OPTIONS ( 
        FIELD_TERMINATOR = ','
    ) 
);

Finally, we will create an external table based on the data of the csv file mentioned in the requirements section. We will specify the file format created and the data source created:

CREATE EXTERNAL TABLE prices
( 
     Id int,
     product VARCHAR(64),
     price int
) 
WITH 
( 
    LOCATION = '/', 
    DATA_SOURCE = prices, 
    FILE_FORMAT = sqlservercentralformat
)

If everything is OK, you will be able to run a query to show the data in the external table:

select *  from dbo.prices

You will be able to see the information from the CSV file in the Azure Storage Account using T-SQL now:

There are two common error messages that you may have when you run queries on external tables. Here you have 2 typical errors.

Error 1:

Msg 107090, Level 16, State 1, Line 15
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/prices.csv)Column ordinal: 2, Expected data type: VARCHAR(4) collate SQL_Latin1_General_CP1_CI_AS, Offending value: price  (Column Conversion Error), Error: String or binary data would be truncated.
This error occurs when the data is truncated. You may need to modify the data type or the number of characters supported. for example, if the value has 4 characters, and your data type is varchar(2) you will need to change to varchar(4).
Error 2:
Msg 107090, Level 16, State 1, Line 16
Query aborted-- the maximum reject threshold (0 rows) was reached while reading from an external source: 1 rows rejected out of total 1 rows processed.
(/prices.csv)Column ordinal: 0, Expected data type: INT, Offending value: Id  (Column Conversion Error), Error: Error converting data type NVARCHAR to INT.
This error occurs when the data type of the external table is not compatible with the csv file. For example, if the external table is an integer and the csv contains text instead of numbers. To fix the problem, you may need to modify the data in the CSV file or change the data type in the external table.

Some Tests On the External Table

Once that we have the external table, we can check if it works. Let's take a look and run a query with the where clause:
select product
 from dbo.prices 
 where price =550
As you can see, we are running queries in a CSV file using T-SQL as if it were a SQL table:

We will now check if we can run inner joins with this external table. To do that, we will create a SQL table:

CREATE TABLE productdesc
(
  id int,
  [product details] varchar(60)
)

We will then add some data:

insert into productdesc values (1, 'iphone 7')
insert into productdesc values (2, 'Samsung Galaxy S7')

We will now run a query between with an inner join to get common information from the external table and the SQL table:

select
  p.id,
  p.product,
  pd.[product details]
 from prices p
 inner join productdesc pd
  on p.id=pd.id

As you can see, you can run queries with inner joins, or other joins, to the external table:

Conclusion

In this article, we learned how to create an external table from an Azure Storage Account with a CSV file. We learned how to create T-SQL queries to the CSV file.

References

Resources

Rate

5 (2)

You rated this post out of 5. Change rating

Share

Share

Rate

5 (2)

You rated this post out of 5. Change rating