Incremental Data Loading using Azure Data Factory

  • Comments posted to this topic are about the item Incremental Data Loading using Azure Data Factory

  • Hi Sucharita,

    Thank you for this article!

    I have some comments though:

    • This solution will work if the source data contains a WaterMark Column
    • Step 11: Add Parameters: These parameter values can be modified to load data from different source table to a different sink table.

      • But then the two stored procedures will fail?

    • Step 21: Check Data in Azure SQL Database

      • What if the source data has been changed during the Extraction process?

    • What next?

      • Having source data available in the Landing Zone is the first step
      • Then the source data is copied into another type of Corporate / Enterprise (transaction) data model depending on the implemented Data Warehouse methodology
      • And then the data in the Corporate / Enterprise (transaction) data model is copied into a Corporate / Enterprise information data model (the BUS Matrix of Dimensional Modeling) consisting of Facts and Dimensions

    Best regards,

    René

  • Thank you for your feedback.

    My response on your questions/remarks:

    1. This solution will work if the source data contains a WaterMark Column -- yes.
    2. Step 11: Add Parameters: These parameter values can be modified to load data from different source table to a different sink table.But then the two stored procedures will fail? -- <b>dbo.usp_upsert_Student  should be replaced with the relevant one. Otherwise, most of the code is parameterized including the second SP.</b>
    3. Step 21: Check Data in Azure SQL Database What if the source data has been changed during the Extraction process? -- For any data movement operation, we assume that the source data remains the same before the validation. If data is changed, this validation is not possible in that particular iteration.
    4. What next? Having source data available in the Landing Zone is the first step

      Then the source data is copied into another type of Corporate / Enterprise (transaction) data model depending on the implemented Data Warehouse methodology

      And then the data in the Corporate / Enterprise (transaction) data model is copied into a Corporate / Enterprise information data model (the BUS Matrix of Dimensional Modeling) consisting of Facts and Dimensions -- yes. Once the data is transferred to the destination, many possible activities can be done on the data.

  • Hi Sucharita Das, thanks for the blog I really appreciate this, I have few questions though they are as below:

    1. What if we also want the delete operations to be get replaced in the destinations from the source, what needs to be changed in the existing stored procedure to achive that?
    2. Given that this pipeline will eventually work for a small a single tables, what process we need to do in order to build a for each loop and loop through certain tables and run the same pipeline for each table ?

    Appreciate your respone here, thanks again!

  • Hi Sucharita Das,

    I have a question, how can we do this same thing using an incrementing key instead of timestamp column. Also is it possible to perform delete operation .

    Thanks.

  • Thank you for your feedback.

    You may please refer the article https://www.sqlservercentral.com/articles/incremental-data-loading-through-adf-using-change-tracking.

    Let me know for any more question/query on this.

  • Hi Sucharita, many thanks for this extensive and well explained article.

    I had a question regarding the strategy with multiple source tables and one target table. What would be you preferred option?

    I have this scenario:

    - Two delta tables A and B in an Az Data Lake.

    - One target table in Synapse Analytics.

    - LastModified timestamp in tables A and B.

    At the moment, using a data flow since source data is in delta format, I retrieve the MAX LastModified timestamp for table A and table B, and then take the MIN of these two. This is the new watermark column value. I could also get the MAX of the two instead of the MIN, but we may want to reload a failed pipeline and add LastModified timestamps prior to the MAX of the two tables.

    The caveat from this design is, you will always grab some data which was already process, and if the update frequency of the two tables (A and B) is completely different, let's say table A is updated daily but table B monthly, then I will reload every day the current month of table A until Table B has a new Last Modified Date.

    I look forwards to getting your thoughs on that 🙂

    Best regards,

    Paul

    Paul Hernández
  • So, how do you scale this solution?

    Thanks,

    Chris

    Learning something new on every visit to SSC. Hoping to pass it on to someone else.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply