Loading ODS strategies

  • Hi everyone,

    Just curious on others' approaches on this.  So I'm developing a data warehouse in pieces.  The approach I was thinking of taking is pulling in data from my numerous source systems unaltered into the Operational Data Store, then having my scrubbing and data quality processes put the good summarized data in the DW.

    My question is from a best practice standpoint, does it make sense to load the bad unaltered data into the ODS then have scrubbing and kickbacks for user corrections there?  Or is it a bad practice to put the dirty data in the ODS?

    My thought process is that being able to query, analyze  as well as write the data quality processes from the consolidated data from the different source systems that would reside in the ODS, is a plus, but not sure if that's the "best practice" approach to how I should handle that from a data warehousing standpoint.

    How would you all normally handle that?

    Thanks for your input!

     

  • My opinion on this is that it falls into the "it depends" bucket that most database tasks do.

    I personally like having the point in time source of truth available in case an end user tries to argue that the data is wrong.  You can go back to your staging tables (what you ca'' "bad unaltered data") and prove that the calculations are doing what they should be doing.

    On top of this, if table StagingA is used to populate FinalA, FinalB, and FinalC, having that all in the same database makes the process nicer if you decide to use stored procedures or TSQL to populate some of that.

    It can also be nice if your SSIS package is becoming too complex and you realize that TSQL can handle it easier.  Or if you are more comfortable writing TSQL.

    Now on the other hand, if StagingA is used top populate ONLY FinalA, it may make more sense to pull the data into memory, manipulate it into the "good" data, and populate the final table all in one package.

    That is why I think it falls under "it depends'.  The process we do at work is to copy the data into staging tables first and then do the transforms on the staging tables (either SSIS, TSQL or stored procedure, depends on the developer and which process made the most sense at the time) and then populate the final tables.  The reason we populate staging first every time is it might be our one package that uses that staging table today, but will that trend continue in 6 months?  1 year? 5 years?  etc.  That way we have a copy of the data sitting in the warehouse as a staging table and future packages can just pull from the staging table rather than the main table.

    I imagine some of the data warehousing experts out there may have different opinions on this, but this is the approach we took and are happy with it.  We like having a "point in time source of truth" in case end users try to argue that a report is incorrect because they changed live data and the report didn't reflect this.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

Viewing 2 posts - 1 through 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply