Last month, Microsoft introduced this new Delete Activity in the Azure Data Factory (ADF) that allows implementing those and many other use-cases: https://docs.microsoft.com/en-us/azure/data-factory/delete-activity.
So, I wanted to explore more and test the following use-cases with this new ADF Delete activity:
a) Remove sourcing files after copying them to a staging zone,
b) Delete files based on their timestamp (historical files management),
c) Detele files from an on-premise system.
Delete file use-case:
(A) Remove sourcing files after copying them to a staging zone
For this use-case, I will be using my ADF pipeline that I had already created for my previous blog post - Developing pipelines in Azure Data Factory using Template gallery
Where after data files copying activity I will include the next step to remove my sourcing data files since I will already have them in my staging blob container.
And my expectation would be to see my staging "storesales-staging" container with the copied files
and my sourcing files blob container "storesales" to be empty.
To make this happen, I only need to set a blob storage dataset for my Delete task activity with a folder name and indication that I need to delete files recursively.
and then on that specific blob storage dataset I just need to specify a file mask (*) for all the files to be removed:
After running my ADF pipeline with the new Delete activity task, all sourcing files get successfully copied to the staging container and they are gone from the sourcing side as well.
(B) Historical files management
This is a very interesting case when you want to handle your file management process based on time-related attributes of your data files, i.e. when the last time they were modified or loaded.
Microsoft introduced time-based filters that you could apply to your Delete activity. It's currently managed on the dataset level and allows to make a reference to your files' repository that needs to be cleaned. You potentially can say, I want to delete 1-month-old, 1-year-old or even 1 hour or minute old files. And I find very helpful that you can use particular date functions that exist in Azure Data Factory:
It only takes to add this time-related condition to my Delete activity dataset and voilà my files which are 10 seconds old after just copying them are deleted as well! You can adjust this filtering condition to your specific data scenario.
(C) Detele files from an on-premise system
This is one of my favorite test cases, how cool this is to remove files from my local computer C:\ drive using cloud-based workflow in Azure Data Factory :-).
To enable this scenario, you will need to install Integration runtime in your on-premise environment. Then you need to create a linked service to your on-premise file system in your data factory, which will require:
- Host (root path of a folder of your delete activity)
- User name and Password to access your on-premise files (I would recommend to save your password in an Azure Key Vault and then reference that secret name in your data factory pipeline).
The rest is super easy, you just need to reference your newly created file system linked service dataset and specify additional criteria for files to be deleted. In my case, I'm deleting all .txt files from the C:\Temp\test_delete folder on my local computer.
Now I can have a peace of mind knowing that files cleaning job can be done in Azure Data Factory as well! 🙂
You can find all the code for these 3 data factory pipeline use-cases in my personal GitHub repository: https://github.com/NrgFly/Azure-DataFactory