DWH Transactional/Incremental Extracts

  • Hi,

    My DWH ETL has a Transactional/Incremental Extraction logic, where all new records from OLTP source tables are extracted beyond the last extract's maximum DATETIME column. We do not have upper boundary constraint.

    For instance, my extract condition is of logic:    "Select * from OLTP Source_Table Where TableColumn > max (currently held TableColumn in DWH)"

    I have noticed a few scenarios where certain source records to have been missed because of OLTP's Loading and DW's Extraction timings getting overlapped. Is this "records getting missed" scenario possible, even though I take DATETIME datatype column for comparison?

    Could it be better if the upper bound is used to avoid such a possibility of missing records, such as below...

    "Select * from OLTP Source_Table Where TableColumn > max (currently held TableColumn in DWH) AND TableColumn < TODAY"

    With above extra boundary constraint, DWH will be 1 day behind that of OLTP, but, it helps avoid OLTP Loading and DWH Extraction times not overlap.

    thanks for you thoughts.

  • Sounds like you have either late arriving records in the source system that you're not catching or your extract is not fetching all the records. When it comes to late arriving records, you can fix that by actually pulling more than you need. For example, if you're doing only the last 24 hours, then pulling the last 3 days may address the late arriving records. 

    The only issue with that is you're redoing work. You're pulling more than what you need that you have already processed. However, if there are changes, you can use that to your benefit too. I do this for a number of data sources where changes happen after 24 hour loads.

    Outside of time. What else can you use to ensure you're not missing the records or will allow you to do multiple extracts if there is truly late arriving records in the source system? Can you store the source primary keys and use them as a filter too? Is that scalable if you do so? Does the trade off of storing that metadata, copying it to staging and doing INSERT WHERE NOT EXIST a good tradeoff for data quality and so forth? Those are the things to think about after you figure out if these are truly late arriving records or is your extract just not on par with catching them?

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply