Incremental load issue

  • Hey guys,

    I'm in the process of trying to figure out if there is any way I can reliably perform an incremental ETL where the source is an Ingres database.
    Unfortunately I have the following issues when trying to do this:

    1. There is no reliable time stamp for modifications etc. on source tables
    2. Row identifiers are subject to change by Ingres maintenance processes etc.
    3. If there is a PK for a given table, uniqueness is not enforced and we've experienced data issues in the past that don't provide us with much confidence that we can use these to uniquely identify rows

    These 3 issues eliminate the ways in which I think an incremental load could be achieved (timestamps, HASHYBYTES etc.), so I'm wondering if anyone has experienced these kind of limitations and how (if at all) they were able to get around them.

    Chris

  • So, outside of row-level uniqueness, which you can fix later once it's loaded into the targeted system, is there any other way to target the records you need that have been modified? That's really the answer to the question as I see it.

    If I faced a set of data sources that changed, but did not give any indication of that change then I have one approach I normally take. In my mind, the best approach will always be to pull not only the last 24 hours of data that came in, but a rolling X amount of days with those 24 hours. If for example, you know that changes will only happen within 6 days of the last 24 hours, then every day, you always pull the last 7 days of data regardless if there is change or not. Then you update change accordingly and dump the rest.

    Thus, maybe start on trying to find out when change happens. Does it happen for records that are only 24 hours old, 3 days old, 7, 14, 30, 90, etc? If you find that most change is happening 14 days ago, then see what the possibilities are to pull the last 14 days of data every day. If that data is too large to do that, then you may have to find a way to summarize the source data to detect change and or start touching base with the DBA's of that source system to ensure they indicate change in their system correctly so you can do your job correctly. Modification meta needs to happen regardless.

    When it comes to primary key enforcement, don't sweat that piece too much. Assume the source data is dirty and full of duplicates anyways. You may have to pull more than what you need, but more than you need is better than less than what you need. You can deduplicate or even apply change to existing records yourself with the data and apply those row-level uniqueness constraints as needed when you clean, stage and load the data to a primary table.

  • Thank you for your response.
    I'm not sure that would work for us but I'll take some time to digest it and possibly come back with some questions.

  • Chris-475469 - Wednesday, June 7, 2017 4:52 PM

    Thank you for your response.
    I'm not sure that would work for us but I'll take some time to digest it and possibly come back with some questions.

    Why do you say that? Too much data to ingest? Can't do it in that way? I mean, I don't see any other way outside of getting the source system to change.

  • The main problem I see is the back dated changes that I've seen which can go back a couple of years.

    We'd need to know about those so we'd potentially have to continue to process a lot of data. Fortunately the vendor is currently working on an updated product that uses SQL Server instead, so I'm hoping we can push them to enforce uniqueness through proper PKs etc.

    Just a shame it'll be a while until it's available.

  • For backdated changes we have an additional system that performs statistical analysis on both sides to detect missed items (count by created date, count by updatedate, count by status ...). Those missed items get added to a queue-system that is checked by the incremental load

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply