Initially, I was confused by its name, thinking that it would help me to validate my incoming sourcing data files with expected data types of the columns within those files. However, the actual purpose of the Validation activity task is to provide better control of your ADF pipeline execution, it is in a way works as a traffic control of your ADF course of actions.
To test out this Validation activity, I created a simple ADF pipeline, to copy all the CSV files that I placed in my Azure Data Lake Storage (ADLS) folder (storesales) that has 3 sub-folders to another ADLS folder (storesales-staging):
Along with copying files from “storesales” to “storesales-staging” folder, I included an Azure Databricks job to run some calculations, then another task to send email notification when my pipeline is completed, and also I included an additional step to clean (delete) files from the sourcing ADLS folder after all the files are copied to staging destination.
All seems to be nice and stable, however, if no sourcing files are available, then nothing will be copied to the staging ADLS folder when I execute my pipeline; file triggers could help to restrain me and kick off the pipeline only when new files arrive in the sourcing ADLS folder.
But what if I have multiple data connectors that can’t share a single ADF pipeline file trigger; to add more complexity, I can request that my Databricks notebook can only start with a certain file being available in a specific location. And to be super paranoid, I want to make sure that after my Delete files activity, the sourcing folder is really empty (and nobody secretly populated the sourcing folder with files after this).
The Validation activity task has the following list of attributes:
dataset – Activity will block execution until it has validated this dataset reference exists and that it meets the specified criteria or timeout has been reached.
timeout – Specifies the timeout for the activity to run.
sleep – A delay in seconds between validation attempts.
childItems – Checks if the folder has child items.
minimumSize – Minimum size of a file in bytes.
You can forget, about the timeout and sleep attrubutes for now, and you can set adjust them later. I’m just interested in dataset and childitems attributes. My dataset in this Validation case is a reference to the sourcing folder in ADLS and childitems,… oh, you’d better see them yourself! 🙂
(1) Check if a folder exists only; to cover this use-case, I just need to set my dataset reference to the ADLS data connection and set childitems to “Ignore“. If this condition is not fulfilled, then my ADF pipeline execution is paused for the duration of the timeout attribute.
(2) Check if a folder exists and has files in it; this use-case is covered by simply setting the childitems attributes to “True“. Again, if files are not present, then the Validation activity holds the execution of my pipeline.
(3) Check if a folder exists and it’s empty, you can assume that this use-case is configured by the “False” value of the childitems attrubute. And you will be right!
I’ve also numbered corresponding Validation activity use-cases in my ADF pipeline. Besides testing it with ADLS folders, it also worked using a reference to my Blob storage containers, except for the 3rd use-case, where I couldn’t check if a blob-storage based folder was empty since there are no such things as folders in Azure blob storage accounts.
In overall, I really liked this new addition to the set of activity tasks in Azure Data Factory. The Validation activity is a good mechanism where I can specify conditions to pass or hold the execution control flow of my data transformation pipelines. And it really does look like a traffic light system, you just need to wait for green light to allow traffic to proceed in a specified direction.
Let me know, what other use-cases you can come up with using this Validation activity.
You can also find a new ADF pipeline illustrated in this blob post in my GitHub repository: