Blog Post

Optimize Azure Fabric Pipelines with This Key Spark Setting

,

Boost Your Azure Fabric Pipelines: Don’t Overlook This Crucial Spark Setting

Are your Azure Fabric pipelines with multiple notebooks running slower than you’d like? Are you paying for more Spark compute time than you should be? The culprit might be a simple setting that’s easy to miss. In this blog post, we’ll dive into the “For pipeline running multiple notebooks” setting in Azure Fabric and explain why enabling it can significantly improve your pipeline’s performance and reduce your costs.

The Challenge: Multiple Notebooks, Multiple Spark Sessions

Imagine a common scenario: a data pipeline that needs to get data from multiple sources, process it, and then use that data for a final task, like generating an email. A typical pipeline for this might look something like this:

This pipeline, for example, has three notebooks running in parallel to fetch different datasets, and a fourth notebook that uses this data to generate an email.

Without a specific setting enabled, each time a new notebook activity starts, it spins up a brand-new Spark application. This process of starting a new Spark session takes time. For a pipeline with four notebook activities like the one above, that’s four separate Spark application startup times. These redundant startups add up, creating a performance bottleneck and extending the overall runtime of your pipeline.

Even worse, each new Spark session consumes its own compute resources, meaning you are being charged for the time it takes for each session to initialize and for the resources it uses, even if the other sessions are idle.

The Solution: Sharing Spark Sessions

This is where the “For pipeline running multiple notebooks” setting comes in. This powerful feature allows multiple notebooks within the same pipeline to share a single Spark application. Instead of each notebook starting its own session, they all connect to and use the same underlying Spark cluster.

To access this setting, you need to go to your workspace‘s settings.

  1. Click the Workspace settings icon in the top right of your Azure Fabric workspace.
  2. In the menu on the left, navigate to Data Engineering/Science and then Spark settings.
  3. Under the High concurrency tab, you will find a toggle labeled For pipeline running multiple notebooks. Make sure to switch this toggle On.
  4. Ensure that each notebook uses a common session tag.

When this setting is enabled, all notebooks in your pipeline will share a single, long-running Spark application. The Spark application starts once at the beginning of the pipeline run and remains active until the entire pipeline is complete.

The Benefits: Performance and Cost Gains

Enabling this setting provides two major advantages:

  • Improved Performance: The most immediate benefit is a significant reduction in pipeline runtime. By eliminating the overhead of multiple Spark session startups, your notebooks can execute much more quickly. In a parallel pipeline like our example, the three Get Data notebooks can all begin their work immediately without waiting for their own dedicated Spark cluster to spin up.
  • Reduced Costs: Since your pipeline is only using one Spark application for its entire duration, you’ll be paying for fewer resources and less time. The cost savings can be substantial, especially for complex pipelines that run frequently or have many notebook activities.

In a world where data pipelines are the backbone of modern data platforms, optimizing their performance is critical. By taking a moment to enable the “For pipeline running multiple notebooks” setting, you can ensure your Microsoft Azure Fabric pipelines are not only faster but also more cost-effective. Don’t let unnecessary Spark session startups slow you down—turn on this setting and watch your pipelines fly!

Comparing this setting in Fabric to AWS’ Glue

In a previous post, I mentioned comparing Fabric with other cloud platforms. Let’s kick that off by discussing how this may look in AWS.

AWS Glue is a serverless ETL service where a “job” is the primary unit of execution. Each Glue job, whether it’s a Spark ETL job or a Python Shell job, is designed to run in its own, isolated environment. A single Glue job can contain complex logic, including the ability to run multiple scripts or stages, but it’s all within the context of that one job’s execution environment and Spark session.

While you don’t have a toggle for sharing sessions across multiple, distinct Glue jobs in the same way as Azure Fabric, there are ways to achieve similar goals:

  • Job Chaining/Workflows: You can use AWS Glue Workflows or other orchestration tools like AWS Step Functions to chain together multiple Glue jobs. This allows for a sequence of tasks, where the output of one job becomes the input of the next. However, each job in the workflow will still spin up its own, separate Spark environment.
  • Single, Monolithic Job: A more common and efficient approach in Glue is to consolidate your data processing logic into a single Glue job script. Instead of having separate jobs for GetProductData, GetSalesData, and GetCustomerData, you would write a single Python script within one Glue job that performs all three tasks and any subsequent processing. This job would start one Spark session and use it for the entire duration of the script, thus eliminating the overhead of multiple Spark session startups. This approach is the closest conceptual equivalent to what the Fabric setting accomplishes.
  • Glue Streaming Jobs: For streaming data, Glue has continuous streaming jobs. These jobs run continuously and maintain a persistent Spark environment, similar to a long-running session. However, this is for streaming use cases, not batch processing with multiple discrete jobs.

In summary, while there isn’t a simple toggle, the recommended practice in AWS Glue to achieve similar performance and cost benefits is to structure your code into a single, comprehensive job rather than using multiple, small jobs that would each incur Spark startup overhead. This is a key difference in how these two platforms are designed and how you would optimize your data processing workflows on each.

Original post (opens in new tab)

Rate

You rated this post out of 5. Change rating

Share

Share

Rate

You rated this post out of 5. Change rating