Data Model for a framework to keep track of ETL processes

  • Im attempting to come up with small framework or set of SQL tables that can be used to keep track of counts, amounts, dates between source systems (could be a SQL database, excel files on a SFTP server, files in ADLS) and the destination (a SQL database).   Has anyone ever come up with one or would you have any input?

    To me it seems like some kind of orchestration/control table is needed to house ETL process names (this seems to general, like I need more exact information), date run, run by, source information, destination information (I could see source and destination info being in a seperate table, like a connections table and then having a FK to the orchestration table), and a log table to house process name, source count, source amount, destination count, destination amount (essentially the data we want to ensure balances between source and destination).

    Any on input or links to sites with good info will be greatly appreciated.  I have been searching myself, but not finding much useful info.

     

    Thank you

  • It sounds like a pretty big ask for a free forum to me, but my advice is you track whatever information you feel is relevant. There is data captured by SSIS with each execution and that may be sufficient for your needs. Or you may want a full audit trace. My opinion - tracking source and destination counts and amounts may not be helpful. The reason being, in my environment, if the package isn't writing all of the data, it is failing. If it fails, I immediately know the data won't match. I don't let it continue if there are errors. If you allow errors where 1 or more rows MAY be missed, then those numbers may be interesting to you. OR if you have no easy way to repeat an import (ie you are only appending data, not merging or truncate/reload the data), then capturing that may be useful if you design your system for resumability on failure, but I find that tryin to design around resumability often comes at a high performance hit.

    For what I do at my company, the SQL agent job history is good enough for a datetime of data import/export, the SSIS package history is good enough for row counts and such. I have never had a need or a business case for capturing more than that.

    Now if you need to capture all of that data, then it is really up to you what you capture. I feel like it is a lot of overhead for little benefit in my environment, but maybe it is useful for yours. As for normalizing the table, if it is all inserted via the ETL load (first step insert, last step update or insert as the last step or however you want to do it) there is little risk of bad data going in. It can happen, but I'd want my SSIS packages to be minimally impacted by this so I'd not want to do lookups for values to make sure the FK matches and if a new system comes into play, I wouldn't want my SSIS package failing because nobody had added it to the secondary table yet. I'd want it all in a single table so I can do a quick and easy INSERT and it should rarely have issues. Issues may arise if 2 jobs kick off at the same time as inserts and updates are blocking operations.

    But I would seriously look at if this is needed and if anyone will ever look at this data? If the data will never be looked at, is this worth the time and effort to build AND modify all SSIS packages (or SQL jobs, depending on the approach you want to take to capture the data) to handle it? The data is not likely to be used for any business decisions, so to me it feels more like it is debugging data which SHOULD be visible from the run history of an SSIS package OR from visual studio during package development.

    Not trying to shoot down the idea, I am just trying to understand the purpose of capturing this data.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • This reply has been reported for inappropriate content.

    A good data model provides a better view and tracking of the ETL processes, which increases accuracy and efficiency. Incorporating ideas from various training modules such as Law Assignment Help, demonstrates how various methods can be incorporated to improve data management methods.

Viewing 3 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply