Orchestrating Azure Databricks Notebooks with Azure Data Factory

By:   |   Comments (4)   |   Related: > Azure Data Factory


Problem

While Azure Data Factory Data Flows offer robust GUI based Spark transformations, there are certain complex transformations that are not yet supported.  Additionally, your organization might already have Spark or Databricks jobs implemented, but need a more robust way to trigger and orchestrate them with other processes in your data ingestion platform that exist outside of Databricks.

By orchestrating your Databricks notebooks through Azure Data Factory, you get the best of both worlds from each tool:  The native connectivity, workflow management and trigger functionality built into Azure Data Factory, and the limitless flexibility to code whatever you need within Databricks.

Solution

If you don't already have a free Azure account, follow this link to create a trial account.

In order to use Databricks with this free trial, go to your profile and change your subscription to pay-as-you-go. For more information, see Azure free account.

Also, if you have never used Azure Databricks, I recommend reading this tip which covers the basics.

Create an Azure Databricks Workspace

Please follow this ink to another tip where we go over the steps of creating a Databricks workspace.

Create an Azure Data Factory Resource

Next, we need to create the Data Factory pipeline which will execute the Databricks notebook. Navigate back to the Azure Portal and search for 'data factories'.

Shows the Data Factories link in the Azure portal search tab.

Click on 'Data factories' and on the next screen click 'Add'.

Shows the Add button for creating a new Data Factory

On the following screen, pick the same resource group you had created earlier, choose a name for your Data Factory, and click 'Next: Git configuration'.

Shows the fields required to be filled out to create a new Data Factory, which are the resource group and the Name.

Normally, you would link your Data Factory to source control to enable saving incomplete code and for general code back-up.  For now, check the box 'Configure Git later' and click 'Review and Create'.

Shows the Git configuration page for Data Factory, and where to click to Configure Git later.

Once your configurations are validated, click 'Create' and your Data Factory should be created!

Create a Databricks Linked Service

The next step is to create a linked service. A linked service within Data Factory is a connection string that is used to authenticate to different data sources or compute.  In a production setting, all secrets, keys, and passwords are stored in the Keyvault, and then referenced within Data Factory.

To add the linked service, we first need to open Data Factory. Navigate to the Data Factory you just created and click on 'Author and Monitor'.

Shows the author and monitor button to launch Data Factory.

Once you are in the Data Factory, you can see the navigation bar on the left-hand side to go to the main components of Data Factory. Click the toolbox to open settings. This is where we will add the linked service.

Shows the Data Factory left side nav bar, and where the toolbox to click to launch the settings menu.

The screen will automatically open on the Linked Services screen. Click 'New'.

Shows the setting menu and how to create a new Linked Service.

Switch from the 'Data store' tab to the 'Compute' tab, and select Azure Databricks. Click 'Continue'.

Shows the new Linked Service menu, and where to click to select Compute and Azure Databricks.

Before we complete this form, we need to go into Databricks to generate a user token. This token will allow Data Factory to authenticate to Databricks. Be careful what you do with this token, as it allows whoever has it to fully access your Databricks workspace.

Open Databricks, and in the top right-hand corner, click your workspace name. Then click 'User Settings'.

Shows the Databricks workspace menu and where to click User Settings to generate a new token.

This will bring you to an Access Tokens screen. Click 'Generate New Token' and add a comment and duration for the token. This is how long the token will remain active. Click 'Generate'.

Shows the generate new token button in the User Settings menu.

The token will then appear on your screen. Once you click 'Ok', the token will never appear again, so make sure you copy it properly! Copy the token, and go back to Data Factory.

There are a few things to fill out in the linked service. Add a name using some form of naming convention. Under 'Account selection method', select 'From Azure subscription'. This will allow you to select your subscription and your Databricks workspace.

For the cluster, we are going to use a new 'Job' cluster. This is a dynamic Databricks cluster that will spin up just for the duration of the job, and then be terminated. This is a great option that allows for cost saving, though it does add about 5 minutes of processing time to the pipeline to allow for the cluster to start up.

Paste the access token into the appropriate field and then select the Cluster options as I have done in the below screenshot. Once you are done, click 'Test Connection' to make sure everything has been entered properly.

Shows the configurations required in Data Factory for filling out the Databricks linked service.

Import Databricks Notebook to Execute via Data Factory

The next step is to create a basic Databricks notebook to call. I have created a sample notebook that takes in a parameter, builds a DataFrame using the parameter as the column name, and then writes that DataFrame out to a Delta table.

To get this notebook, download the file 'demo-etl-notebook.dbc' that is attached to this tip.

To import the notebook, navigate to the Databricks home screen. Click 'Workspace' in the navigation bar on the left, and click 'Shared'. Click the carrot next to shared, and select 'Import'.

Shows the Databricks Workspace menu to import a new notebook.  Click Workspace, Shared, the carrot next to shared, and then import.

Select 'File', and browse to the 'demo-etl-notebook.dbc' file you just downloaded.

Click Import, and you should now have the notebook in your workspace. Open the notebook to look through the code and the comments to see what each step does.

Create a Data Factory Pipeline

Now we are ready to create a Data Factory pipeline to call the Databricks notebook. Open Data Factory again and click the pencil on the navigation bar to author pipelines.

Shows the Pencil in the nav bar of Data Factory to take you to the author screen where we can build a Data Factory pipeline.

Click the ellipses next to the Pipelines category and click 'New Pipeline'.

Shows the create new pipeline button for creating a new pipeline.

Name the pipeline according to a standard naming convention.

Shows the name field to fill out when creating a new pipeline.

Next, add a Databricks activity to the pipeline. Under 'Activities', drop down 'Databricks', and click and drag 'Notebook' into your pipeline. Name the activity.

Shows how to drag a Databricks Notebook activity into your pipeline.

Navigate to the 'Azure Databricks' tab, and select the Databricks linked service you created earlier.

Shows how to select the Databricks linked service on the Execute Databricks Notebook task.

Move to the settings tab. Click 'Browse' next to the 'Notebook path' field and navigate to the notebook you added to Databricks earlier. Select it.

Shows how to select the proper Databricks notebook in the Databricks Notebook task.

In order to pass parameters to the Databricks notebook, we will add a new 'Base parameter'. Make sure the 'NAME' matches exactly the name of the widget in the Databricks notebook., which you can see below. Here, we are passing in a hardcoded value of 'age' to name the column in the notebook 'age'. However, you can also pass dynamic content to the Databricks notebook, such as Data Factory variables, parameters, iterators, etc.

Shows how to add parameters to pass into the Databricks notebook from Data Factory.

Now, we are ready to test the pipeline. Click 'Debug' in Data Factory, and the notebook should be executed.

Shows the Debug button to kick off the Data Factory job.

If the job succeeds, your screen will look like this!

Shows the successful completion of the execution of the Data Factory pipeline.

Debugging

Data factory offers a number of different ways to debug your notebook in case of failure or to see how variables and logic computed at runtime.

One option is to pass information back to Data Factory from the Databricks notebook. For example, if you want to keep track of row counts inserted during an ETL job, you can pass that row count value back from Databricks into Data Factory and then have Data Factory log it. That is what we do in this sample notebook.

By clicking the highlighted button in the output section of the page, you can see that we dynamically passed the row count from the ETL job back to the Data Factory.

Shows how to click the output button to see the outputs that came from the Databricks job.
Shows the output of the Databricks notebook.

Even more critical is the ability to see an ephemeral version of the notebook for each execution of the pipeline. By clicking the eye glasses in the output section of the page, you can click a link that will take you to the actual execution of the Databricks notebook. Here, you can see either exactly where the notebook may have failed, or just generally all of the cell outputs from that job run. Adding print statements in your notebook can be extremely valuable to debug processes during the development phase.

Shows the button to click to view the ephemeral version of the notebook.

Click the Run Page URL to open up the ephemeral version of the notebook.

Shows the Databricks ephemeral notebook link in Data Factory.

There you have it! You have successfully executed a Databrick notebook through Data Factory.

Next Steps


sql server categories

sql server webinars

subscribe to mssqltips

sql server tutorials

sql server white papers

next tip



About the author
MSSQLTips author Ryan Kennedy Ryan Kennedy is a Solutions Architect for Databricks, specializing in helping clients build modern data platforms in the cloud that drive business results.

This author pledges the content of this article is based on professional experience and not AI generated.

View all my tips



Comments For This Article




Friday, February 2, 2024 - 5:41:38 AM - Vhelical Back To Top (91916)
Discover the seamless synergy between Azure Databricks services Notebooks and Azure Data Factory in this insightful blog. Nice blog

Thursday, December 22, 2022 - 3:40:34 AM - karthik Back To Top (90773)
Hi,
I am trying to execute the notebook via azure datafactory to Azure Databricks notebook but unable to success my ADF pipeline, if I run the azure databricks notebook separately on my notebook, i could able to see the result without any issues, if run via the ADF pipeline, i am getting below like.

** i am using prophet lib in my notebook

ModuleNotFoundError: No module named 'prophet'

ModuleNotFoundError Traceback (most recent call last) in 6 import pandas as pd 7 import pyspark.pandas as ps ----> 8 from prophet import Prophet 9 from pyspark.sql.types import StructType, StructField, StringType, FloatType, TimestampType, DateType, IntegerType 10

Kindly help me

Tuesday, October 20, 2020 - 9:34:26 PM - Roopa Shastri Back To Top (86666)
Great Post ! Very helpful !

Wednesday, October 7, 2020 - 6:07:55 AM - infosectrain Back To Top (86614)
Nice post!














get free sql tips
agree to terms