Blog Post

How you can save up to 80% on Azure Data Factory pricing

,

… with this one weird little trick 🙂 Sorry for the clickbait, but you can actually save a lot on ADF pricing if you design your pipelines carefully. And as we will see, don’t do everything in ADF. Before we proceed further, I advise to watch this short YouTube video by Cathrine Wilhelmsen on ADF cost. It explains the basics very well and demonstrates the concepts of the pricing with a few examples:

As you might’ve noticed, pricing in ADF is not the same as it was in SSIS for example. In SSIS, you pay your SQL Server license and you’re done (well, and you buy a server to run it on). It doesn’t matter what you do with SSIS, the cost is the same. If you run 1 package or 1000 packages, there’s no difference except in your electricity bill. However, in ADF you pay more if you use it more. You pay for each action you do, you pay for each activity you use and for how long things are running. There are a couple of guidelines you can follow to try to minimize costs:

  • Using a VNET for your integration runtimes is expensive. Really expensive. Cathrine has a good overview in the video where it shows costs are a thousand times higher.
  • Try to use ADF as an ELT tool. Or even just as an orchestrator. Use it to ingest data into your sink (data lake, database, whatever) and then use a cheaper type of compute to do the actual calculations. For example, dump the data in a staging table and then use SQL statements to do the rest. External activities (such as executing SQL statements in a database) are the cheapest type of activity in ADF.
  • I may be biased, but I try to avoid dataflows as well in ADF. I can do pretty much the same things in SQL much easier and especially cheaper. I don’t need to startup a Spark cluster just to add a few columns to my dataset or to do a lookup. If you hate writing code though, dataflows might be just your thing.
  • Minimize the number of activities. You pay for at least one minute for each activity. If you have many pipelines or frequently executed pipelines, having many activities starts adding up.
  • If you really really want to save on costs in ADF, you can run everything on a self-hosted integration runtime. This is the cheapest runtime, but on the other hand means that you need to provision your own server again.

You might’ve noticed the most expensive part in ADF is the Copy Activity. Its cost depends on the following formula:

If you have a metadata-driven Copy Activity inside a For Each loop, things can get expensive fast. For example, I have a pipeline that extracts data from a REST API. I need to extract 100 endpoints, which means I get 100 minutes of copy duration. Using the formula and price points from the video, this means an execution cost of approximately $0.83 per pipeline run. This doesn’t seem much, but this is $25 for a whole month (one execution per day). For a single pipeline. If you have multiple pipelines like this one, you can see why it can get expensive. Or if you increase the number of executions. Running the pipeline every hour is about $600 a month. For a single pipeline. Executing it every 5 minutes is about $7200. Whoops. Do not use ADF as a streaming tool. Luckily there’s still a part of the formula we can influence: the number of used DIUs (Data Integration Units). Cathrine also mentions this in the video. The default setting is Auto, which scales automatically but starts at 4.

However, you can set it manually to 2.

This means you can already save 50% on your Copy Activity cost! Obviously, this affects performance and you should only do this when working with smaller data sets. But if you just have to load some smaller files or some REST API data, definitely set it to 2.

How can we cut costs even more? By avoiding the Copy Activity altogether. In my use case of the REST API, I wrote an Azure Function that extracts data from the API and dumps it into a staging table using SqlBulkCopy. This means an external activity is used, which is much cheaper. By using the Azure Function, I reduced the price of my pipeline to $0.05 per month (for a daily run). Even if I would run the pipeline hourly, it would cost me a mere $1.09 in ADF. You have to pay for the Azure Function as well of course, but if you use a consumption plan the first million (!) executions and 400,000 GB/s are free (per subscription). So it’s possible you’re staying in the free part of the offering, which means $0 cost.

Even if you go over the free offering, the costs are cheaper than ADF. Keep in mind though, the development effort is higher when using Azure Functions because you’ll have to write the code yourself, while the Copy Activity is configured in a couple of minutes.

The post How you can save up to 80% on Azure Data Factory pricing first appeared on Under the kover of business intelligence.

Original post (opens in new tab)
View comments in original post (opens in new tab)

Rate

5 (1)

Share

Share

Rate

5 (1)