Azure Data Factory sequential pipeline triggering

Eriks Dombrovskis
3 min readMay 20, 2021
Photo by JJ Ying on Unsplash

Azure Data Factory is a great platform to solve data related scenarios. Whether migrating data from on premises to cloud, fetching data from an API, or aggregating some logs. You can even apply transformation while the data is on the move.

I have been using Azure Data Factory mostly for reoccurring (daily, weekly) batch jobs to load different datasets into the DataLake. And it was easy as all you need to do is to set up a pipeline by specifying the Source and Sink. Then add the necessary triggers and you are done:

Simple pipeline for copying data to DataLake

As simple as it was the requirements for dataset copying to DataLake changed. Now instead of being able to copy all the data for a particular dataset in on copy activity a limit was set to a specific amount per one job. So it was necessary to split the one pipeline run into multiple ones for each dataset.

To achieve this there was a need for a trigger that triggers another pipeline run sequentially based on previous pipeline success until all data is moved for each dataset. To my disappointment the only triggers I found were these:

Available trigger types for Azure Data Factory

To overcome this hurdle there were two things we needed:

  • a place to store state of which datasets have been finished copying.
  • a wrapper pipeline that executes the main pipeline.

To store the state of which datasets need to be processed we used a simple table that tracks if dataset has finished processing:

State table keeps track if dataset needs to be processed

Next thing is to have a wrapper pipeline that tracks the state of this table and executes dataset copy pipeline until no pipelines left to process. Additionally we need to add some logic to our copy pipeline that it updates the state table if processing of it has finished:

Putting it all together
  • Get DatasetCount is a lookup activity that queries our state table and counts the amount of datasets to process.
  • Set DatasetCount sets a global variable to the amount of datasets that have the process_flag equal to “1”.
  • Until DatasetCount is 0 loops copy pipeline until the count of variable DataSetCount is zero.
  • Execute CopyDatasets triggers run of copy pipeline.
  • Get DatasetList now takes only datasets that have process_flag set to “1”.
  • If Dataset Finished Processing is your own set of logic that checks if data has finished copying. If it has then we have an activity UpdateState that sets the process_flag in our state table to “0”.

After finishing the copy pipeline we have to update the global variable:

  • Get DatasetCount gets the count of datasets that need to be processed by querying the state table.
  • Update DatasetCount sets the global variable DatasetCount to the remaining count of tables to be processed.

The CopyDatasets pipeline is continuously triggered by CopyDatasetsWrapper until the state table has no more datasets left to process.

Thank you and good luck in building your own data processing solutions!

--

--