Part 2: The Modern Azure Data Warehouse – ADF and ADLS

In today’s data-driven world, having the right tools to manage and process large datasets is crucial. That’s where Azure Data Lake Storage (ADLS) and Azure Data Factory (ADF) come in handy, making it easier than ever to store and transform your data. In this post, I’ll show you how to set up ADLS to store your Parquet files and configure ADF to manage your data flows efficiently.

Check out part 1 to find out how we chose the tech to use.

Table of Contents

Azure SQL DB

You will need an Azure SQL DB for this process. You can set up the sample db. This is easy to do from the Create SQL Database setup in the Azure portal. Just configure your db as you like and choose the Sample on the Additional Settings page. I will be using an existing db and an existing table that I have to show you the basic process. You can do the same and don’t need to create the sample db if you don’t want to.

Azure Data Lake Storage

You will need a place to store the parquet files that your ADF will create. This is a standard storage account, but you need to make sure you check the box for “enable hierarchical namespace”

Also, make sure to create a container in your storage account. ADF will need this to store the parquet files.

Creating an Azure Data Factory

Here’s a Microsoft link on how to do the basic setup of your Data Factory. Once it’s created, click the Launch Studio button on the overview panel of your data factory.

Setup Link to Storage Account

We need to create a link to our storage account and our Azure SQL DB.

Sourced from: https://www.tech-findings.com/2021/09/Loading-data-SQL-Tables-multiple-parquet-files.html, which is a great tutorial, by the way, for going in the opposite direction from parquet into SQL tables

Set up your connection like the following.

And now you have a link to your storage account from ADF.

Setup Link to Azure SQL DB

You can use a few different authentication types to connect to Azure SQL from ADF.

I will show you how to connect with a System Assigned Managed Identity here, but you could use any of these as long as it has read access to your db.

To facilitate this, you must add the name of your ADF with appropriate perms to your database.

CREATE USER [josephineadf] 
FROM EXTERNAL PROVIDER WITH DEFAULT_SCHEMA=[dbo];
EXEC sp_addrolemember 'db_datareader', 'josephineadf';

Then, it’s a similar process of adding a New linked service. Search for Azure SQL.

Set it up like so using whatever db you want to connect.

If you’ve just added the managed identity to your Azure SQL DB, you might get an error like so.

But then I went ahead and tested the connection right away again, and it worked.

Now, you have two linked services.

Validate and Publish Linked Services

Before you can go onto the next steps, you need to validate and publish your changes thus far. Click Validate all to ensure you can proceed to publishing.

This should produce something like this output.

Click Publish All.

You will see that your changes are published, as there is no longer a grey circle in front of your Azure SQL DB linked service.

Pulling Data from Azure SQL DB into Parquet Files

Click on the Home button in the ADF Studio and then click on Ingest.

Then, you’ll need to create a Built-in copy task, which I’m going to put on a daily schedule.

Choose the tables you want to ingest. I only have Ola setup in my db, so it’s a super tiny table I would never normally import into a parquet file, but only as a test to show you how it’s done.

The next page is where you can get the entire table or filter down by a custom query or datetime column. In this case, I’m testing just to get the last day of values with each ADF run. This is potentially where you could say only get the last hour to do incremental loads with ADF, as well.

Then, you need to choose a destination.

Then, you need to set the file format settings

Name your copy data task and set anything else you need.

Then, you’ll get a summary page for review.

Then, a summary page on which you can click Finish.

Running Your Pipeline

Once your pipeline is created, click on the Author button to expand your pipelines.

I’m going to trigger it right now to ensure it will create a parquet file in my data lake.

You need to enter values here.

If you want to see your pipeline runs, click on the Monitor button, and pipeline runs.

We can see it was successful, and I will go check my storage account. I see a randomly named file, so I may want to change that to make it more descriptive, like tablename_something. Also, I may want to account for incremental changes, but that’s for another day. I also may want to add a way to easily loop through tables without having to change my ADF pipeline. For now, I’m happy to have a file in my storage and know that this basic ADF is working.

The post Part 2: The Modern Azure Data Warehouse – ADF and ADLS appeared first on sqlkitty.