Separate Staging Tables and Data Warehouse into different Databases.

  • Until today I never saw a Database that contained both the Staging Tables and Data Warehouse Tables in the same Database.

    The Staging Tables are in a different schema than the Data Warehouse Tables.

    Maybe I'm missing something but I have to make a solid case that this is a good practice.

    Any information would be greatly appreciated.

    http://www.bidn.com/blogs/TomLannen/bidn-blog/1277/separating-data-warehouse-tables-from-your-relational-and-staging-tables

    For better, quicker answers on T-SQL questions, click on the following...

    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...

    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Can't really say if it is good or bad, just different. When we started a data warehouse project at a previous employer we set up the Staging and ODS databases separately. We could have just as easily put them in the same the database using schemas to keep things separate.

    If this is being done, one thing I would also make sure of is that the staging and data warehouse tables were in different file groups. This would allow for the placing of these groups on different spindles on the server (or SAN) so that they weren't using the same disks.

  • What I often see are separate databases on the same instance, where the staging database is in simple recovery mode and the data warehouse is in full recovery mode. Procs parked in the staging database use cross-database calls to load the data into the data warehouse database after it has been massaged and prepared in tables in the staging database. Different schemas in the same database do not give you the separation necessary to run different recovery modes.

    The big barrier to moving the staging database onto a different instance altogether is the additional overhead of getting that data from the staging database into the data warehouse. Consider a simple cross-database INSERT...SELECT versus a secondary push job using SSIS or some other means. It's a tradeoff when having your staging database on a separate instance between flexibility in terms of scaling your environment and ease of loading your data warehouse.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Thanks you very much for the information!:-)

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • opc.three (3/23/2012)


    What I often see are separate databases on the same instance, where the staging database is in simple recovery mode and the data warehouse is in full recovery mode. Procs parked in the staging database use cross-database calls to load the data into the data warehouse database after it has been massaged and prepared in tables in the staging database. Different schemas in the same database do not give you the separation necessary to run different recovery modes.

    The big barrier to moving the staging database onto a different instance altogether is the additional overhead of getting that data from the staging database into the data warehouse. Consider a simple cross-database INSERT...SELECT versus a secondary push job using SSIS or some other means. It's a tradeoff when having your staging database on a separate instance between flexibility in terms of scaling your environment and ease of loading your data warehouse.

    I understand the Staging database using the simple recovery model, but curious why the data warehouse is using the full recovery model? Is it due to size and amount of data loaded each time that t-log backups are considered better?

    Previous company I worked at that was building a dat warehouse had theirs using the simple recovery model. A full backup was run after the nightly load process was completed.

  • Lynn Pettis (3/23/2012)


    opc.three (3/23/2012)


    What I often see are separate databases on the same instance, where the staging database is in simple recovery mode and the data warehouse is in full recovery mode. Procs parked in the staging database use cross-database calls to load the data into the data warehouse database after it has been massaged and prepared in tables in the staging database. Different schemas in the same database do not give you the separation necessary to run different recovery modes.

    The big barrier to moving the staging database onto a different instance altogether is the additional overhead of getting that data from the staging database into the data warehouse. Consider a simple cross-database INSERT...SELECT versus a secondary push job using SSIS or some other means. It's a tradeoff when having your staging database on a separate instance between flexibility in terms of scaling your environment and ease of loading your data warehouse.

    I understand the Staging database using the simple recovery model, but curious why the data warehouse is using the full recovery model? Is it due to size and amount of data loaded each time that t-log backups are considered better?

    Previous company I worked at that was building a dat warehouse had theirs using the simple recovery model. A full backup was run after the nightly load process was completed.

    Generically, a staging database where tables are wiped and reloaded with each new incoming feed makes it volatile, i.e. there is little to no value in recovering a previous DB. Whereas to recover a DW to minimize downtime for data clients could be important.

    In the first case that comes to mind running the DW in FULL was purely a DR position. Standard operating procedure was for all mission-critical databases, for which the DW database qualified, to be mirrored. As you know mirroring requires the database to be in FULL recovery mode 100% of the time. Staging was classified as volatile so the DR plan for that DB was to restore the last full backup to the secondary server and resume the job schedule. With mirroring, the DW could be brought online by simply failing over to the secondary mirror. Overkill? Maybe, but the shop had a very "one size fits all" mentality.

    I could see value in log shipping a DW though, or in being able to restore to a PIT for purposes of isolating a production issue. Are those worth the overhead of running a DW in FULL? Not sure, depends on the shop and their SLAs I guess.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • opc.three (3/23/2012)


    Lynn Pettis (3/23/2012)


    opc.three (3/23/2012)


    What I often see are separate databases on the same instance, where the staging database is in simple recovery mode and the data warehouse is in full recovery mode. Procs parked in the staging database use cross-database calls to load the data into the data warehouse database after it has been massaged and prepared in tables in the staging database. Different schemas in the same database do not give you the separation necessary to run different recovery modes.

    The big barrier to moving the staging database onto a different instance altogether is the additional overhead of getting that data from the staging database into the data warehouse. Consider a simple cross-database INSERT...SELECT versus a secondary push job using SSIS or some other means. It's a tradeoff when having your staging database on a separate instance between flexibility in terms of scaling your environment and ease of loading your data warehouse.

    I understand the Staging database using the simple recovery model, but curious why the data warehouse is using the full recovery model? Is it due to size and amount of data loaded each time that t-log backups are considered better?

    Previous company I worked at that was building a dat warehouse had theirs using the simple recovery model. A full backup was run after the nightly load process was completed.

    Generically, a staging database where tables are wiped and reloaded with each new incoming feed makes it volatile, i.e. there is little to no value in recovering a previous DB. Whereas to recover a DW to minimize downtime for data clients could be important.

    In the first case that comes to mind running the DW in FULL was purely a DR position. Standard operating procedure was for all mission-critical databases, for which the DW database qualified, to be mirrored. As you know mirroring requires the database to be in FULL recovery mode 100% of the time. Staging was classified as volatile so the DR plan for that DB was to restore the last full backup to the secondary server and resume the job schedule. With mirroring, the DW could be brought online by simply failing over to the secondary mirror. Overkill? Maybe, but the shop had a very "one size fits all" mentality.

    I could see value in log shipping a DW though, or in being able to restore to a PIT for purposes of isolating a production issue. Are those worth the overhead of running a DW in FULL? Not sure, depends on the shop and their SLAs I guess.

    As soon as I saw DB Mirroring, it makes sense. With that I would also assume that t-log backups are run periodically or after each load to keep the t-log from filling the disk.

  • Lynn Pettis (3/23/2012)


    As soon as I saw DB Mirroring, it makes sense. With that I would also assume that t-log backups are run periodically or after each load to keep the t-log from filling the disk.

    I cannot recall specifically, but it was something like every 15 minutes during heavy load times, and every few hours during the times when the DB was being used for reporting. Some lighter data feeds were loaded throughout the day as well.

    There are no special teachers of virtue, because virtue is taught by the whole community.
    --Plato

  • Welsh Corgi (3/23/2012)


    Until today I never saw a Database that contained both the Staging Tables and Data Warehouse Tables in the same Database.

    This is common practice in "some" RDBMS other than SQL Server, in most cases even in the same schema relying on naming covention to tell appart each kind of table like; stg_, dim_, fact_, ods_, etc.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (3/23/2012)


    Welsh Corgi (3/23/2012)


    Until today I never saw a Database that contained both the Staging Tables and Data Warehouse Tables in the same Database.

    This is common practice in "some" RDBMS other than SQL Server, in most cases even in the same schema relying on naming covention to tell appart each kind of table like; stg_, dim_, fact_, ods_, etc.

    I'm only concerned about SQL Server at this point but I remeber that in Oracle when you create a new Database you specify whether you want to create an OLTP or an OLAP Database.

    Oracle configures certain system options differently depending upon whether it is an OLTP or an OLAP Database.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Welsh Corgi (3/23/2012)


    PaulB-TheOneAndOnly (3/23/2012)


    Welsh Corgi (3/23/2012)


    Until today I never saw a Database that contained both the Staging Tables and Data Warehouse Tables in the same Database.

    This is common practice in "some" RDBMS other than SQL Server, in most cases even in the same schema relying on naming covention to tell appart each kind of table like; stg_, dim_, fact_, ods_, etc.

    I'm only concerned about SQL Server at this point but I remeber that in Oracle when you create a new Database you specify whether you want to create an OLTP or an OLAP Database.

    Oracle configures certain system options differently depending upon whether it is an OLTP or an OLAP Database.

    That's correct, the GUI based Oracle assistant would set some generic parameters to favor either OLTP or OLAP as needed. These parameters can change at any time manually - which usually happens.

    Either way, a set of Staging tables are not precisely and OLTP application - large scans and inserts are expected on them therefore an OLAP configured instance applies perfectly.

    My point is, it is not necesarily a bad thing to have data warehouse staging and core tables sitting on the same database; just a way to do it.

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.
  • PaulB-TheOneAndOnly (3/23/2012)


    Welsh Corgi (3/23/2012)


    PaulB-TheOneAndOnly (3/23/2012)


    Welsh Corgi (3/23/2012)


    Until today I never saw a Database that contained both the Staging Tables and Data Warehouse Tables in the same Database.

    This is common practice in "some" RDBMS other than SQL Server, in most cases even in the same schema relying on naming convention to tell appart each kind of table like; stg_, dim_, fact_, ods_, etc.

    I'm only concerned about SQL Server at this point but I remember that in Oracle when you create a new Database you specify whether you want to create an OLTP or an OLAP Database.

    Oracle configures certain system options differently depending upon whether it is an OLTP or an OLAP Database.

    That's correct, the GUI based Oracle assistant would set some generic parameter to favor either OLTP or OLAP as needed. These parameter can change at any time manually - which usually happens.

    Either way, a set of Staging tables are not precisely and OLTP application - large scans and inserts are expected on them therefore an OLAP configured instance applies perfectly.

    My point is, it is not necessarily a bad thing to have data warehouse staging and core tables sitting on the same database; just a way to do it.

    I'm concerned about the time it takes to back up the Database and recover it. Yes you can do file groups but this dude used the same filegroup for everything and I would prefer to not be restricted to a filegroup backup and restore.

    I changed the database set to simple recovery model when he had only published the Staging Tables because I did not want every transaction to be logged in Staging.

    The design does not even come close to a Star Schema or Star Snowflake Schema.

    Nothing that indicates a Fact Table or Dimensional table, just a repeat of the relational model in the same Database where the staging schema resides.

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • My staging tables were nothing more than a copy of the source tables. Their sole purpose was a landing place for the data when it was pulled each night before being moved to our ODS database that was going to serve as the source for populating our data warehouse that ended up on the back burner.

  • Thanks for all of the information.:-)

    If you think of any of considerations please do not hesitate to share with me for I may need more ammo to make my case.

    Thanks again!

    For better, quicker answers on T-SQL questions, click on the following...
    http://www.sqlservercentral.com/articles/Best+Practices/61537/

    For better answers on performance questions, click on the following...
    http://www.sqlservercentral.com/articles/SQLServerCentral/66909/

  • Welsh Corgi (3/23/2012)The design does not even come close to a Star Schema or Star Snowflake Schema.

    Nothing that indicates a Fact Table or Dimensional table, just a repeat of the relational model in the same Database where the staging schema resides.

    May be (just may be) you are looking at a Bill Inmon style Data Warehouse 🙂

    _____________________________________
    Pablo (Paul) Berzukov

    Author of Understanding Database Administration available at Amazon and other bookstores.

    Disclaimer: Advice is provided to the best of my knowledge but no implicit or explicit warranties are provided. Since the advisor explicitly encourages testing any and all suggestions on a test non-production environment advisor should not held liable or responsible for any actions taken based on the given advice.

Viewing 15 posts - 1 through 15 (of 40 total)

You must be logged in to reply to this topic. Login to reply