The Azure Data Factory is a means of moving data around in the cloud. In a similar way that SQL Server Integration Services (SSIS) can be used to transform and load data in on premise versions of SQL Server, Azure Data Factory is a tool that you could use to perform a similar function in Azure. In the next series of posts, I'm going to take a look at some of the different aspect of the Azure Data Factory.
I am assuming that most people have used SSIS at least to some extent. For this reason, I'm going to start with an SSIS package and show how we can create a similar transformation using the Azure Data Factory. This article is not meant to be a comparison of these technologies (if you want to see a great comparison of SSIS vs. Azure Data Factory, watch Reza Rad's presentation - http://www.radacad.com/our-session-at-sql-rally-azure-data-factory-vs-ssis from the SQL PASS Rally Nordic 2015). What I will be doing is using some of the concepts that exist in SSIS as a means to introduce similar concepts in the Azure Data Factory.
The SSIS Package
We are going to start with a very simple SSIS package that takes a flat file and loads the contents of the file into a table. Later in the series, we will update the SSIS package to transform the data before it reaches its destination. The SSIS package probably looks quite familiar as can be seen below.
The Azure Data Factory
In Azure, we are going to create a Data Factory that populates some Customers based on a CSV input file. In my last post, I used AzCopy to upload the file to Azure Blob Storage. The Data Factory we are creating will load that file into a table in Azure Database. The completed Data Factory will appear as follows:
In the diagram above, we are going to create 1) a Service and Dataset that will represent the flat file that we have uploaded to SQL Blob Storage, 3) a Service and Dataset that will represent the table that is going to accept the data in Azure SQL Database and 2) a Pipeline that will move the data from the flat file in Blob Storage to the Azure SQL Table. When we compare the Azure Data Factory to the SSIS Package, we can see that the Service and Dataset 1 are very similar to the Flat File Data Source (both numbered 1 above), the Service and Dataset 3 are very similar to the OLEDB DB Destination (both numbered 3) while the Pipeline is very similar to the Data Flow Path from the SSIS Package (both numbered 2). In the next section, I'll go through how to create the Service and Dataset to connect to the flat file that is stored in Azure Blob Storage.
Creating the Service and Customer Input Dataset
The first thing that needs to be done is to create an Azure Data Factory. In order to do that, we will go to the Azure Preview Portal and click All Resources->Data + Analytics->Data Factory and then fill in the necessary fields (as shown below).
Then, we will click the Data Factory tile on the Start Page and then click "Author and Deploy" in order to launch the Data Factory Editor.
Next we need to create a Service to link to the Azure Blob Storage. In order to create the Serivce, go to "New Data Store" and select "Azure Storage". The first thing that you will notice is that all of the coding that you do for the Dataset is done in JSON. Another thing that you will notice is that there is no funky User Interface that lets you create the JSON - you just edit the JSON by hand. Personally, I'm not that big of a fan of this approach. At the end of the day, SSIS packages are XML, and I never have to edit XML when I create an SSIS package - I have an editor for each component. I'm digressing a bit here, but a User Interface would be nice. In the JSON, you will need to add the name of the account and the Account Key. These can be found in the properties tab of the Azure Blob Storage Account. Once you've completed the changes to the JSON, click "Deploy".
Now that we have created a Service to attach to Azure Blob Storage, we need to create a Dataset in order to represent the structure of the file that we are going to import. The file that we are going to upload into the Azure can be seen below. The file contains the following information: First Name, Last Name, Credit Card Number, Gender and the Data Lineage as to who inserted the record.
In order to create the Dataset, we click "New Dataset". Again, we get supplied with some JSON as a template. We change the JSON so it appears as below. The following things are changed in the JSON:
- We change the structure so that it matches the structure of the CustomerLoad.csv file.
- We add in the name of the file, which in our case is CustomerLoad.csv.
- We add the name of the container that contains the input file.
Up to this point we've started the creation of an Azure Data Factory. We've created a Service and Dataset in order to connect to a flat file that we've stored in Azure Blob Storage. These concepts are very similar to a Flat File Connection in SSIS. In the next post, we will create the connection to the Azure SQL Server Database and a Pipeline to connect the source and destination.