Staging Database vs. Staging Schema within the DW

  • We are building a data warehouse using SQL Server 2014 EE SP1. I would like to know which method is better for storing the OLTP Source Data; a staging database or a staging schema within the data warehouse? What are the pros and cons of each method? I know if I have a separate Staging Database, I would have another database to set permission on and back up. Are there other things I should consider? If I used the schema within the data warehouse, then this schema would be backed up with the other schemas (i.e. dbo) within the DW Database. Could a back up be done on a schema? Is the only way to back up a schema is to use the export wizard, select the schema and export all tables and data for that particular schema?

    Thanks in advance.

  • I would do a separate database because:

    1. Separation of processes

    2. Faster recovery time of the Data Warehouse because the stage data is not in the Data Warehouse.

    3. You typically will not have DRI between stage and final tables so that is not an issue.

    You could use file or file group backups to backup the stage schema separately from other schemas if you put all your stage schema objects on a separate file or file group.

  • To me a full answer depends on how many sources you are pulling from to populate your data warehouse.

    I have four principle sources. Each source has it's own staging database divided into two schema: Landing and Staging. The landing schema has the same table and column names as the source database and to the extent possible the same data type. Each landing table also has a bit field that defaults to 0 and a comments field that defaults to an empty string. It will be set to one once all the data has been checked. This is to prevent one bad row from bringing the process to a halt. If a row is bad, the reason will be noted in the comments section. You will need to a way to identify those if you do that. The Staging schema is used to start any necessary conversions or data break outs (time into it's hour and minute components is common). No table stores information permanently. All the tables are truncated.

    From the Staging schema (or the Landing if the staging is not necessary for a particular process) all imports lead to a common staging database broken out by a different schema name for each source. The data is put into 3NF if that's necessary, which it often is. It's also the start for combining data that is in multiple sources.

    From here it goes to the data warehouse database. It is also broken out into the source schemas. Lately I've been transferring my fact and dimension tables into views. Kimball suggested this in his most recent book. After some experimentation, I think this works well overall. These views are assigned to a schema that matches the cube name. The warehouse contains only data needed for the warehouse

    Finally, there is an ETL database that is broken down by schema to match the four sources. It stores the row counts and exception information to create the feedback reports.

    If you primarily draw from one source, this might be too much. Some would say it's too much anyway, but the ETL is the most crucial part of the DW and you want the design to be as self-explanatory as possible. This isn't the design I started with, but moved to it as the need for greater clarity asserted itself.

  • I personally adopted the staging schema for my data sources. The file group is still on it's own LUN and all my data sources come in at different times ranging from 200 KB to 30 GB flat files.

    Being I have to push these into dozens of data marts, adding even more data bases for the large amount of data sources would only add to the confusion. So, I keep it all in one database in a simple two step process.

    It's easy to understand, easy to manage and still separate as it's in separate schema.

  • Staging data is transient in nature and behaves similar to a temp table. This means there is going to be a lot of inserts and updates against this table. which in turn will cause a lot of IO as well as auto growth and shrink behavior . Its better to prevent these kind of behaviors from the final database since the final database might be a 24/7 data source for other BI applications like SSAS or SSRS and in these cases you dont want your database stuck doing things that are not part of the code business.

    Schemas are logical entities used for better organization and security and don't really have any performance gains other than object name resolution during query parsing. Staging tables usually don't require complex data management or security so I think schema might not make much sense in a DW context.

    Besides whats the point to backing up Stg table , also stg tables require a lot of bulk insert so wouldnt it be better to have it in a database with a bulk logged recovery model instead of compromising the integrity of the DW database.

    Also resource governor and other policies can be applied against the stg database easier than stg schema.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/24/2015)


    Staging data is transient in nature and behaves similar to a temp table. This means there is going to be a lot of inserts and updates against this table. which in turn will cause a lot of IO as well as auto growth and shrink behavior . Its better to prevent these kind of behaviors from the final database since the final database might be a 24/7 data source for other BI applications like SSAS or SSRS and in these cases you dont want your database stuck doing things that are not part of the code business.

    Schemas are logical entities used for better organization and security and don't really have any performance gains other than object name resolution during query parsing. Staging tables usually don't require complex data management or security so I think schema might not make much sense in a DW context.

    Besides whats the point to backing up Stg table , also stg tables require a lot of bulk insert so wouldnt it be better to have it in a database with a bulk logged recovery model instead of compromising the integrity of the DW database.

    Also resource governor and other policies can be applied against the stg database easier than stg schema.

    What's the point of backing up something that is essential to your DW? Well, I don't know, so it functions after a restore? That's if you're not rebuilding every staging table or database with each ingestion of new data if they don't exist (which is a good idea too). If you don't, then you're DW does not function without them right?

    Besides that, if you're staging data is located on separate disks with possibly even separate raid configuration, how is this increasing I/O? If I'm not mistaken, and correct me here if I'm wrong Jayanth_Kurup, your instance is balancing resources across all databases. Therefore, in a separate schema or in a separate database will make no difference, but separate disks, will. I think you just made this argument in the other thread on table partitioning did you not? 😛

    I won't touch what you just said about security as if having data in a staging table or temp table is not important security considerations. I've always been taught that no matter where your data is, security is always a priority no matter if it's in a table just for 5 seconds or the primary resting place.

    That's like me telling my boss that it's not a big deal that all of your data is only accessible to the entire world for 5 seconds. It's only 5 SECONDS! So why do we care? :w00t:

    As for bulk logging, that depends on the use case and shouldn't be applied to all use cases. For example, I personally use simple recovery because I have all the flat files for my data on disk before it's loaded into the database. I have no need for point-in-time recovery nor do any major updates to the data that can't be recovered from the last good full or diff backup, so there is no need for full recovery mode that requires to switch to bulk logging when ingesting bulk data.

  • xsevensinzx (10/24/2015)


    Jayanth_Kurup (10/24/2015)


    Staging data is transient in nature and behaves similar to a temp table. This means there is going to be a lot of inserts and updates against this table. which in turn will cause a lot of IO as well as auto growth and shrink behavior . Its better to prevent these kind of behaviors from the final database since the final database might be a 24/7 data source for other BI applications like SSAS or SSRS and in these cases you dont want your database stuck doing things that are not part of the code business.

    Schemas are logical entities used for better organization and security and don't really have any performance gains other than object name resolution during query parsing. Staging tables usually don't require complex data management or security so I think schema might not make much sense in a DW context.

    Besides whats the point to backing up Stg table , also stg tables require a lot of bulk insert so wouldnt it be better to have it in a database with a bulk logged recovery model instead of compromising the integrity of the DW database.

    Also resource governor and other policies can be applied against the stg database easier than stg schema.

    What's the point of backing up something that is essential to your DW? Well, I don't know, so it functions after a restore? That's if you're not rebuilding every staging table or database with each ingestion of new data if they don't exist (which is a good idea too). If you don't, then you're DW does not function without them right?

    A staging table is where the TRANSFORM happen in ETL process , it is never meant to be backed up. Staging tables are often used to perform calculations, enrich the data or prepare it for final load into the destination table. It's mostly even used to handle late arriving dimensions etc, in the event of a failure its simply repopulated from the original source. I think what you are referring to is the destination table where the final transformed data is being LOADED. You should backup the Destinaton table not the staging table. typically this would be the endpoint for your SCD.

    Besides that, if you're staging data is located on separate disks with possibly even separate raid configuration, how is this increasing I/O? If I'm not mistaken, and correct me here if I'm wrong Jayanth_Kurup, your instance is balancing resources across all databases. Therefore, in a separate schema or in a separate database will make no difference, but separate disks, will. I think you just made this argument in the other thread on table partitioning did you not? 😛

    You cannot split files and filegroups by schema you can however do that at the table level in which case you don't even need a schema to enforce files and FG anyway. Why bother creating a schema called stg then simply start naming stg tables with stg_ instead. The schema will have nothing to do with the the end files and FG the data resides in.

    Stage tables are usually write intensive while the destination table (i mentioned above) is read intensive. So placing the files on different RAID configurations does help, About the other post , the video, script and blog is there i haven't seen anybody else come back yet with actual data that contradicts the findings :-p

    I won't touch what you just said about security as if having data in a staging table or temp table is not important security considerations. I've always been taught that no matter where your data is, security is always a priority no matter if it's in a table just for 5 seconds or the primary resting place.

    Almost every BI project i work on we ensure the data is already being masked when being pulled from the source , mostly because the format of the data could be csv, xml or any other plain text format. Which can be hacked in transit itself.I have yet to find a Bi project where analysis is done on SSN , Emailid , credit card number etc. Essentially if you dont needs these fields you shouldn't be fetching them in the first place and if i don't fetch them then I don't have to worry about securing them.

    That's like me telling my boss that it's not a big deal that all of your data is only accessible to the entire world for 5 seconds. It's only 5 SECONDS! So why do we care? :w00t:

    As for bulk logging, that depends on the use case and shouldn't be applied to all use cases. For example, I personally use simple recovery because I have all the flat files for my data on disk before it's loaded into the database. I have no need for point-in-time recovery nor do any major updates to the data that can't be recovered from the last good full or diff backup, so there is no need for full recovery mode that requires to switch to bulk logging when ingesting bulk data.

    You do realize that bulk insert operations are possible even in Simple recovery model , the process is exactly the same you need to have TABLOCK on the table.

    https://msdn.microsoft.com/en-IN/library/ms190422.aspx

    Here is a post on quirky behavior where bulk like operations are happening even in Full recovery model.

    http://enabledbusinesssolutions.com/blog/is-it-possible-bulk-operations-in-full-recovery-model/

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/25/2015)

    A staging table is where the TRANSFORM happen in ETL process , it is never meant to be backed up. Staging tables are often used to perform calculations, enrich the data or prepare it for final load into the destination table. It's mostly even used to handle late arriving dimensions etc, in the event of a failure its simply repopulated from the original source. I think what you are referring to is the destination table where the final transformed data is being LOADED. You should backup the Destinaton table not the staging table. typically this would be the endpoint for your SCD.

    I'm referring to a permanent table that handles staging data in a database. A table that is always handling the same transformation for a underlying data source where data is first loaded to be cleaned, transformed and eventually pushed off to a destination (or primary) table. This is a permanent table in my mind. Is this not the same for you?

    In my understanding, if you are not recreating that permanent table with each load or ensuring it exists, if that permanent table does not exist, the ETL cannot function and therefore you cannot load data to be transformed. Hence why I'm saying it would need to be backed up because regardless of whether or not it's a primary (or destination) table or not, it's critical to the function of the ETL system.

    Jayanth_Kurup (10/25/2015)

    You cannot split files and filegroups by schema you can however do that at the table level in which case you don't even need a schema to enforce files and FG anyway. Why bother creating a schema called stg then simply start naming stg tables with stg_ instead. The schema will have nothing to do with the the end files and FG the data resides in.

    Stage tables are usually write intensive while the destination table (i mentioned above) is read intensive. So placing the files on different RAID configurations does help, About the other post , the video, script and blog is there i haven't seen anybody else come back yet with actual data that contradicts the findings :-p

    Sorry, that was my miscommunication. It was late when I responded.

    What I was referring to was a schema and filegroup alignment. In meaning, a staging schema and staging filegroups. That way you can keep your permanent staging tables completely segregated from your primary tables and schema both with a separate filegroup on separate disks and separate schema.

    Example: staging.FactSales (filegroup: STAGING, location: LUN: S) -> dbo.FactSales (filegroup: PRIMARY, location: LUN: D)

    Jayanth_Kurup (10/25/2015)

    Almost every BI project i work on we ensure the data is already being masked when being pulled from the source , mostly because the format of the data could be csv, xml or any other plain text format. Which can be hacked in transit itself.I have yet to find a Bi project where analysis is done on SSN , Emailid , credit card number etc. Essentially if you dont needs these fields you shouldn't be fetching them in the first place and if i don't fetch them then I don't have to worry about securing them.

    Indeed, you can do that too. But, once it's in the database, you still have to consider other users and even yourself sometimes when dealing with data in transformation. Even if there is no difference, having that separation of staging from primary with the separation of schema is still good in my opinion because as above, i'm referring to permanent staging tables that always live in the database under the staging schema for each data source and each destination (or primary) table.

    It's clear, separated and easy to manage the split with the option for additional security considerations.

  • Great , thank you for clarifying that the stg table is permanent. This is exactly what I follow as well. In my case I always truncate and load the stg table for each ETL , therefore the stg table never has any data before or after successful ETL.

    Now backups of the stg table will be taken with every backup in your case , but in my case only one backup is required or better yet just version control . Backups are traditionally employed to preserve data not schema but having said that I have used it both ways. if the tables are in its own DB then a single backup is all that's needed.

    No need to apologize i understood what you meant, my point was that

    staging.FactSales (filegroup: STAGING, location: LUN: S) -> dbo.FactSales (filegroup: PRIMARY, location: LUN: D)

    is the same as

    dbo.stg_FactSales (filegroup: STAGING, location: LUN: S) -> dbo.FactSales (filegroup: PRIMARY, location: LUN: D) the schema in itself doesn't have anything to do with the placement of files on the disk , its just create table stg_factSales on Staging

    Now probably the most important point for performing ETL in its own DB vs schema in Target DB is:-

    1)the Allocation pages GAM , PFS , IAM , SGAM , DCM , BCP are put under a lot of stress during the few hours that ETL actually happens , these same pages are also required for regular database work loads as well. Splitting the data across stg databases instead of schema allows SQL to work with different allocation pages until the final load process. There by reducing the impact of ETL on the users querying the target database.

    2) In some companies the ETL process i s not as important as the workload executed against the target database so resource governor is used to throttle the resources allocated to the ETL databases, the best example of this would be when creating indexes on stg tables to allow faster joins and filters on the target table during SCD. in these cases the index is a temporary one that has no benefit for the end user and as such you dont want it hogging all the resources.

    3) At some later point in time the amount of data moving in an out of the system will grow often requiring a dedicated staging environment (server) often this environment then takes on the role of (like distributor in replication) pushing cleansed data to a number of subsystem each with thier own copy of the data. A simple example would be something like data from HR stg tables begin pushed into Employees table being maintained by Finance , Transport , Sales , HR and legal. In such cases creating stg schema in each of the downstream databases is redundant.

    Jayanth Kurup[/url]

  • It's my understanding the pages are still going to be under pressure once the data moves to the destination tables from the source tables outside of the DW database? This is similar to auto-growth of the database due to staging in the same database. The data is still new and the database is still going to grow regardless.

    That being said, I do agree that eventually data is going to outgrow the server regardless of the database. That's ideally why I personally am using Hadoop (NoSQL) to augment that ETL outside of SQL Server. That way if the ETL grows outside of that server, you can still scale out with distributed processing with a solution that's cheaper per terabyte.

  • My recommendation is to keep just about anything that even smells like a DW on a totally separate server. Yes, it takes a bit of effort to move the data and, yes, it's an extra expense but it's been absolutely worth it to me from a resource usage, performance, DR point of view and several other aspects.

    Of course, if the databases have a good design and no one ever writes more than the occasional and fixable performance challenged code on either the OLTP or DW systems, they can peacefully live well on the same server... (yeah... wait for that 😉 )

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • OLTP and DW never go hand in hand (personal experience talking here). I a fairly routine aggregation summarizing data for the last 5 years could easily flood the buffer pool with so much data that oltp system might face contention. The nature of the system as diagrammatically opposite that it would almost be like a server having Dissociative identity disorder .

    In one case you need few rows and serial plans , in the other you want thousands or rows and as much parallelism as the server can offer.

    In one case the operations are write intensive and the other read

    Point in time a must for OLTP and Simple weekly backups more than enough for the DW

    thousands of connections for OLTP , few hundreds for DW.

    Normalization for one Dimensional for the others

    HA and DR behave differently for customer facing vs internal applications

    And many more. It just doesn't make sense to use design practices interchangeably or use the same hardware configuration in two ways , its almost impossible to configure it so that it works well for both.

    EDIT: and then MS went an introduced Columnstore indexes 🙂

    xsevensinzx: in normal ETl process the row can undergo many transformations before its ready for final load. it makes sense to do them all in stg database and only push the final row into the destination this way reducing contention at the allocation structures of the target.

    Jayanth Kurup[/url]

  • Jayanth_Kurup (10/25/2015)


    xsevensinzx (10/24/2015)


    A staging table is where the TRANSFORM happen in ETL process , it is never meant to be backed up.

    In our case we only transfer new-since-last-time data. This dramatically reduces the export, transfer, load and transform times. Our staging DB will auto-resume from "most recent date in each table", so restoring from an old, or even empty-table, backup is fine - but it will take longer to re-import the data the following night, quite possibly overrunning and then colliding with working-hours as a result, so we prefer to have recent backups.

    Almost every BI project i work on we ensure the data is already being masked when being pulled from the source

    We usually receive all/most-columns and then only process columns-of-interest. The reason for this is when the business says "Oh, we also/now need XXX" (or the supplier of the data "didn't realise that column was relevant") then we already have that column imported and only have to modify the downstream transform process. Otherwise we would either have to write a one-time single-column XFER, or modify the Export and Import to allow for the new column (suppliers of data always seem to charge a fortune for that ...), and re-import the whole table (which would mark all rows as "changed" and thus cause a lengthy transform process, compared to normal). We can either populate the downstream tables manually to freshen that one column, or force just the processes post-import to run on all rows.

  • I think most if not all strive for a separate server for just the DW. That's a given. The question is where to keep staging. In the same database or separate databases. Mixing OLTP is often not a great idea. 😛

    To be fair, I load some pretty massive data in batch over night. I have everything in one database with separate filegroups for staging and a separate schema. Thus far, everything sings pretty well within great time.

    The only issues I ever face is the limitation of not having distributed processing when the data is being summarized in the respective data marts. You can only scale up so much before things start getting really hairy.

  • Kirsten : not sure if the flexibility afforded by bringing in columns that one day might be useful offsets the storage and performance overhead of all the rows and columns that won't.either way it might be worth looking into how many of such columns contain sensitive data and if there are any meaningful reports based on them.

    Also I just wanted to check if you're using CDC for your etl process since you mentioned you need to do net change.

    Jayanth Kurup[/url]

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply