Design Input for ETL of OLTP Database

  • Hello All - 

    I wasn't sure exactly where my question fit, so thought i'd put it here first.  I just started a new contract position and am inheriting an existing design.  Here's the project:

    My company wants to ETL/stage an entire OLTP database of 600 tables.  The database is currently owned by a vendor and they have an application which sits on top of it.  The database resides on the vendor's server and they have given us access to do whatever we'd like with it.  We do currently have a data warehouse however it is going to be rebuilt and for the time being the project manager wants to replicate the database on our side so that we can generate reports.  

    If I were here sooner, I would have maybe suggested a different way, but since I wasn't, this is what I have to work with.  

    I'd like to keep the key constraints because reports will be built on top of this database.  However, designing an ETL process for 600 tables is proving to be rather interesting 🙂  I'm wondering if dropping all constraints, loading the tables and then re-building the constraints will be the best way to go.  We are simply storing the data, we will not be altering it once we receive it.  

    I've also asked if we can get a file to restore the entire database on our end however the PM wasn't too keen on that.  I'm still trying to make a case.

    Thoughts?

  • So, essentially, you are just replicating their database that's it. Not really a data warehouse at that point. At least, not by my standards.

    The objective a data warehouse is to take many data sources and tables from source systems and conform/consolidate/process them to one unified model for some purpose like reporting. That means not only absorbing the source keys, but generating your own that your data warehouse fully issues and manages across all data sources in your system. Uniformity is key here and when you get into a habit of just treating every data source as if it's completely separate or siloed, then you're just a container of replicants that have no relation or use from one another.

    That's just me though. I would totally try to not have 600 tables, but consolidate down to a model that supports my business needs while having the ability to add more data sets and scale as the business grows.

  • That's correct, just simply replicating the vendor's data.  I'm thinking that a restore would be best, most direct way, but i'm getting resistance.  I would love to transform some of this data into more of a DW structure, however I wasn't here for the beginning of this project and the PM isn't willing to go backward and re-examine.  Further, it will take time for me to get up to speed on the application and what it does before I can make recommendations, PM just wants to get moving.

  • Why are you trying to keep the constraints?  If you know they're enforced in the source data and you aren't modifying the data in the DW there is no reason to keep them.

  • If you know they're enforced in the source data and you aren't modifying the data in the DW there is no reason to keep them.   

    There are lots of reasons to keep them, or at least re-establish on completion of the ETL process.  It acts as a check to ensure the process has run correctly.  It also allows those tables to be more efficiently queried as the optimizer knows the relationships between the tables.  All too many DBAs forgo the relational integrity constraints because "the ETL takes care of it" only to have it bite them at some point in the future. 

  • RonKyle - Tuesday, December 12, 2017 9:35 AM

    If you know they're enforced in the source data and you aren't modifying the data in the DW there is no reason to keep them.   

    There are lots of reasons to keep them, or at least re-establish on completion of the ETL process.  It acts as a check to ensure the process has run correctly.  It also allows those tables to be more efficiently queried as the optimizer knows the relationships between the tables.  All too many DBAs forgo the relational integrity constraints because "the ETL takes care of it" only to have it bite them at some point in the future. 

    The reason I suggested that is because

    A) this is a straight copy of data with no ETL involved so we know any constraints are already enforced
    B) since this is going to be for reporting many of those constraints may not provide any value and optimization in general would need to be looked at as compared to what's in the application db, at that point add what might actually help
    C) in general it makes the ETL more process more complicated because any constraints now have to stay synced up between both DB's, you've either added a recreation process or have to factor in load order for FK, once again for potentially no gain if the reports can't use them.

  • reason I suggested that is because

    I gathered your reasons from your brief statement.  That doesn't change the fact that it is not a good idea.  A lot of people forgo the RI constraints, but it inevitably causes issues and is not advice I would give under any circumstance.

  • RonKyle - Tuesday, December 12, 2017 4:02 PM

    reason I suggested that is because

    I gathered your reasons from your brief statement.  That doesn't change the fact that it is not a good idea.  A lot of people forgo the RI constraints, but it inevitably causes issues and is not advice I would give under any circumstance.

    I don't think he is saying you shouldn't use constraints, just not the constraints of the data source, which in this case is a database versus something else. Having those checks and balances regardless is not a bad idea if you're just duplicating the data source. But then again, no conforming is really happening here to make that data source the data warehouses. It's technically not the data warehouses data at this point and is going to be difficult to grow and scale now because it has 600 OLTP based tables being used for reporting purposes.

  • If you favour the backup/restore approach, try pushing the cost angle of that - setting up a backup/restore = less than a day's work, setting up an SSIS job to transfer 600 tables in the right order = many days

  • I would go with an Availability Group, snapshot replication or log shipping before I'd do something as heavy-handed as backup and restore.  Log shipping is low tech but there's virtually no maintenance, your source data can be closer to real time if you want (ship the log 1x a day or 1x an hour), and you won't have to deal with schema change like you will with an ETL process.   Then you can spend all that time you saved developing ETL for the data warehouse that you really need 😉

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply