Incremental Data Loading using Azure Data Factory

Question

Post reply

Incremental Data Loading using Azure Data Factory

Sucharita Das

SSC-Addicted

Points: 456
More actions
September 24, 2020 at 12:00 am

#3787138

Comments posted to this topic are about the item Incremental Data Loading using Azure Data Factory

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply

Renato10 SSC Enthusiast Points: 180 More actions · Answer 1

Hi Sucharita,

Thank you for this article!

I have some comments though:

This solution will work if the source data contains a WaterMark Column
Step 11: Add Parameters: These parameter values can be modified to load data from different source table to a different sink table.
- But then the two stored procedures will fail?
Step 21: Check Data in Azure SQL Database
- What if the source data has been changed during the Extraction process?
What next?
- Having source data available in the Landing Zone is the first step
- Then the source data is copied into another type of Corporate / Enterprise (transaction) data model depending on the implemented Data Warehouse methodology
- And then the data in the Corporate / Enterprise (transaction) data model is copied into a Corporate / Enterprise information data model (the BUS Matrix of Dimensional Modeling) consisting of Facts and Dimensions

Best regards,

René

Sucharita Das SSC-Addicted Points: 456 More actions · Answer 2

Thank you for your feedback.

My response on your questions/remarks:

This solution will work if the source data contains a WaterMark Column -- yes.
Step 11: Add Parameters: These parameter values can be modified to load data from different source table to a different sink table.But then the two stored procedures will fail? -- <b>dbo.usp_upsert_Student should be replaced with the relevant one. Otherwise, most of the code is parameterized including the second SP.</b>
Step 21: Check Data in Azure SQL Database What if the source data has been changed during the Extraction process? -- For any data movement operation, we assume that the source data remains the same before the validation. If data is changed, this validation is not possible in that particular iteration.
What next? Having source data available in the Landing Zone is the first step
Then the source data is copied into another type of Corporate / Enterprise (transaction) data model depending on the implemented Data Warehouse methodology
And then the data in the Corporate / Enterprise (transaction) data model is copied into a Corporate / Enterprise information data model (the BUS Matrix of Dimensional Modeling) consisting of Facts and Dimensions -- yes. Once the data is transferred to the destination, many possible activities can be done on the data.

Anmol81 Newbie Points: 1 More actions · Answer 3

Hi Sucharita Das, thanks for the blog I really appreciate this, I have few questions though they are as below:

What if we also want the delete operations to be get replaced in the destinations from the source, what needs to be changed in the existing stored procedure to achive that?
Given that this pipeline will eventually work for a small a single tables, what process we need to do in order to build a for each loop and loop through certain tables and run the same pipeline for each table ?

Appreciate your respone here, thanks again!

gauri Newbie Points: 2 More actions · Answer 4

Hi Sucharita Das,

I have a question, how can we do this same thing using an incrementing key instead of timestamp column. Also is it possible to perform delete operation .

Thanks.

Sucharita Das SSC-Addicted Points: 456 More actions · Answer 5

Thank you for your feedback.

You may please refer the article https://www.sqlservercentral.com/articles/incremental-data-loading-through-adf-using-change-tracking.

Let me know for any more question/query on this.

Paul Hernández SSCarpal Tunnel Points: 4961 More actions · Answer 6

Hi Sucharita, many thanks for this extensive and well explained article.

I had a question regarding the strategy with multiple source tables and one target table. What would be you preferred option?

I have this scenario:

- Two delta tables A and B in an Az Data Lake.

- One target table in Synapse Analytics.

- LastModified timestamp in tables A and B.

At the moment, using a data flow since source data is in delta format, I retrieve the MAX LastModified timestamp for table A and table B, and then take the MIN of these two. This is the new watermark column value. I could also get the MAX of the two instead of the MIN, but we may want to reload a failed pipeline and add LastModified timestamps prior to the MAX of the two tables.

The caveat from this design is, you will always grab some data which was already process, and if the update frequency of the two tables (A and B) is completely different, let's say table A is updated daily but table B monthly, then I will reload every day the current month of table A until Table B has a new Last Modified Date.

I look forwards to getting your thoughs on that

Best regards,

Paul

Paul Hernández

Stamey SSCertifiable Points: 5600 More actions · Answer 7

So, how do you scale this solution?

Thanks,

Chris

Learning something new on every visit to SSC. Hoping to pass it on to someone else.

Incremental Data Loading using Azure Data Factory

Cookies on SQLServerCentral