Hi Sucharita, many thanks for this extensive and well explained article.
I had a question regarding the strategy with multiple source tables and one target table. What would be you preferred option?
I have this scenario:
- Two delta tables A and B in an Az Data Lake.
- One target table in Synapse Analytics.
- LastModified timestamp in tables A and B.
At the moment, using a data flow since source data is in delta format, I retrieve the MAX LastModified timestamp for table A and table B, and then take the MIN of these two. This is the new watermark column value. I could also get the MAX of the two instead of the MIN, but we may want to reload a failed pipeline and add LastModified timestamps prior to the MAX of the two tables.
The caveat from this design is, you will always grab some data which was already process, and if the update frequency of the two tables (A and B) is completely different, let's say table A is updated daily but table B monthly, then I will reload every day the current month of table A until Table B has a new Last Modified Date.
I look forwards to getting your thoughs on that 🙂