Importance of a Staging Table/Database

  • We've been importing tons of data into a staging database through TSQL Scripts, performing transformations on the data and then importing into our test/live database depending.

    I'm looking at things and this seems pretty inefficient, so I was wondering if someone can correct my logic.

    Saying I have a table of 100k records.

    Currently I am importing 100k records into a table, possibly performing a transformation which is just a value lookup 90% of the time, and then moving them over to the destination database where I'll insert them if they're new, update if they exist and have changed or just let them die otherwise.

    It seems more efficient to move the 100k records from the origin database to the destination database, determine if they're new and only apply transformations to the new ones. Of the remaining, find out if they've changed and only apply transformations to those.

    Just seems like you're updating smaller sets of data only as needed and avoiding a read/write into another table as well as having to write extra code (or steps in SSIS) for this.

    Our staging database is on the same machine, so you cut out that transfer time from staging to destination, but you're still taxing the CPU/Memory of that database.

    So what is the concept of the staging database that I'm missing here. Is this done to offload transformation resources to another machine? Not keeping so many records in memory? Something else I'm not noticing?

    I guess the only reason I know that we're using staging for everything is that a DBA who has 40 years experience did our first import this way. I'm just at that stage of wondering why this is.

  • You're actually asking two different questions here Shawn:

    1) Whats the best method for Updating/Maintaining DW tables?

    and

    2) Why do I need a staging Database?

    Since (2) is simpler, I will answer that first:

    First, when implementing a DW (Data Warehouse) your are going from one set of functions, OLTP (on-line transaction processing), to three, OLTP, OLAP (on-line analytical processing), and ETL (Extract, Transform and Load). These first two are obvious, but the third is not because the services & functions that it provides are not directly to the users, but rather to the OLAP and to the DBA and to DM (Data Management).

    And although these three service areas are interdependent, they are not at all reasonably compatible with each other, so it's a good idea to keep them as separate as possible and to give each its own resources to do their work. Now obviously, the OLTP resources are the original databases, and the OLAP resources are the DW database. So where is the ETL resources? That's your Staging Database. This is where you can do your ETL work without interfering with the production database or the DW environment.

    Secondly, you need a Staging database to hold many of the things that you may be keeping or using in the OLTP-DBs or DW now. In order to minimize the extent that the ETL processes interfere with the DBs and DWs it is expedient to keep much of the "housekeeping" in the ETL. also, it serves as a holding zone to buffer differences in the timing windows of the sources DBs and the target DBs. The time of day when it is most convenient to extract data from one source database, may not be convenient for the others and also may not be convenient for the DW. With a staging datbase, you have a place to hold it until you're ready.

    And thirdly, you need a Staging database to give you options, for instance, there are actually three or four SOP approaches to doing the ETL for a DW table/dimension and each has different storage and processing needs. A Staging database means that the ETL demands do not have to be a barrier to which method you use for for which table or dimension. likewise, if something goes wrong with the ETL, the staging database gives you a place to work on recovering without necessarily having to interfere with the production source DBs (or the DW until you are ready).

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Shawn Therrien (5/15/2009)


    We've been importing tons of data into a staging database through TSQL Scripts, performing transformations on the data and then importing into our test/live database depending.

    I'm looking at things and this seems pretty inefficient, so I was wondering if someone can correct my logic.

    Saying I have a table of 100k records.

    Currently I am importing 100k records into a table, possibly performing a transformation which is just a value lookup 90% of the time, and then moving them over to the destination database where I'll insert them if they're new, update if they exist and have changed or just let them die otherwise.

    It seems more efficient to move the 100k records from the origin database to the destination database, determine if they're new and only apply transformations to the new ones. Of the remaining, find out if they've changed and only apply transformations to those.

    Yes, that probably is more efficient. However, efficiency is normally not the first priority for an ETL process. The first priority is usually to minimize disruption to the production windows of the OLTP and OLAP services. The second priority is usually to ensure the reliability and integrity of the ETL process. Then the third priority would be performance and efficiency.

    Just seems like you're updating smaller sets of data only as needed and avoiding a read/write into another table as well as having to write extra code (or steps in SSIS) for this.

    Our staging database is on the same machine, so you cut out that transfer time from staging to destination, but you're still taxing the CPU/Memory of that database.

    So what is the concept of the staging database that I'm missing here. Is this done to offload transformation resources to another machine? Not keeping so many records in memory? Something else I'm not noticing?

    Yes, those are good things, but they aren't the primary thing. The primary thing is that if you update the DW straight from the production OLTP then you have made them directly dependent on each other. If something goes wrong on one (or in the process itself), both side are affected.

    Consider what happens if one of the source table gets locked during the transfer, both the source and target tables are locked for the duration. With a staging table the DW is eligible to be used. Whay if the target table gets locked? Again, both source and target are locked. However with a staging DB, the source is completely unaffected. the Load stage can be killed and once the problem is resolved, the Load stage can be restarted from the Extracted data in the staging database. Even if this doesn't happen until the middle of the next day, the production OLTP database are unaffected.

    I could give many other examples. The point being that with a staging database, the overall process might be longer, but the individual stages are shorter, minimizing the source and target disruption and in the event of problems, you have many more options and much more control with a staging DB.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • Some people use separate tables, some use separate databases.

    This is where table partitioning can be of great benefit. In SQL 2008, you can now use the SWITCH command to move data between partitions using meta information. If you're not already using it, have a look - it may help you out.

    --
    Andrew Hatfield

  • Okay, those are a lot of good reasons. Now most of them seem to imply that stage is on another instance of SQL server, preferrably on another machine. Ours is on the same instance/machine as the destination database.

    We're going from OLTP with OLAP tendancies to our new version, an OLTP that is very similar for 80% - just transforming pk/fk for the most part. The other 20% we're doing actual transfomations of data (like that subset sum).

    So I wasn't seeing much of a point due to the config of my staging db and the really basic work I'm doing.

    Really great info on the DWing. That's something I've only touched on, but I'll be working with in greater depth soon

  • Here's another take on the staging database. In order to process the data of the transaction system and prepare it for the DW structure I have many stored procedures and tables that perform different functions. I need to store these somewhere, that place is the staging database. For example on one particular ETL that is run frequently I use a timestamp from the OLTP data to determine what has changed since the last run so I only grab the rows needed. If I have an issue with the ETL and some of these rows transfer but others don't transfer to the OLAP system I need to retain them so they get reprocessed on the next run of the ETL otherwise I risk losing them. For each table from the OLTP system I have an unprocessed rows table that is stored in my staging database.

    In another example I have about 50 stored procedures that are only used to perform different transformations and such. I would not want all these extra objects to live in my production DW. Also I don't need to give my developers as much access to the DW since most the work is done in stage. It keeps the DW design cleaner. My staging database also contains some copies of the OLTP tables that act as "helper tables" to the ETL process and I would not want to have to restore these in the event of a DW restore.

    Another example is an ETL that converts data keyed by INT (identity) to a database keyed by guids, since I can't change the structure of either database I use the staging database as the location that contains both the identity and the guid for each row, this is how I map data from the source to destination.

    So to answer your initial question there probably was a DBA a long time ago that used a staging database and that's because it made sense.

  • Shawn Therrien (5/15/2009)


    Okay, those are a lot of good reasons. Now most of them seem to imply that stage is on another instance of SQL server, preferrably on another machine. Ours is on the same instance/machine as the destination database.

    Actually, that's fairly typical. That's because it is possible to share resources with another process without complete disruption. However , frequently once the Loading starts, part or all of the DW is unusable because the table(s) being loaded are either locked or invalid (or both) during this process. Because of this you want to minimize this particular window and the easiest way to do that is to have the ETL process on the same box (and to only do Loading at this point, no Extracting and minimal Transforms). This is one of the best ways to minimize the DW's outage window.

    [font="Times New Roman"]-- RBarryYoung[/font], [font="Times New Roman"] (302)375-0451[/font] blog: MovingSQL.com, Twitter: @RBarryYoung[font="Arial Black"]
    Proactive Performance Solutions, Inc.
    [/font]
    [font="Verdana"] "Performance is our middle name."[/font]

  • As well as all the things that Barry has stated, I simply won't put unknown outside data directly into it's final resting spot. I always do the validations of the data in staging tables even if someone tells me the data is guaranteed to be good. The idea of doing a 100k rollback because something went wrong just makes me cringe. Much better to "certify" the data before it goes into a DB where it's gotta be right. The idea of on the fly exception processing on a permanent table reminds me of spaghetti if you catch my drift.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Hi Shawn,

    One simple question. How do you perform the incremental load changes? Here we are using lookup and conditional split to check whether its a new row or updated record. Is it a good approach. Currently we are using Informatica and we are planning to migrate to SSIS 2005.

    Any help would be really helpful.

    Thanks,

    Balaji L

  • Balaji (5/18/2009)


    Hi Shawn,

    One simple question. How do you perform the incremental load changes? Here we are using lookup and conditional split to check whether its a new row or updated record. Is it a good approach. Currently we are using Informatica and we are planning to migrate to SSIS 2005.

    Any help would be really helpful.

    Thanks,

    Balaji L

    basically I have two circuits - first each record compares against a lookup to see if the record exists. If so, it continues. If not, it goes to the destination.

    If a lookup item is found, a conditional split happens to see if the appropriate data is correct. If so, you can just let it fail gracefully. I have the rows diverted to a DataFlowComponant right now to see whats happening. If the data is different, then update the data.

    Hope that helps a little.

    -- Edit

    Added an image of what my logic looks like and the controls being used. Still need to finish the actual update so it updates rather than just gives you a count of records it would update.

    Odd that this forum doesn't display uploaded pictures, so I guess you have to click on the link for the picture.

  • Shawn how are you doing the update in the dataflow task?

    I am taking a similar approach but using a merge join instead of a lookup.

    I am using an OLE command to perform the update. Runs pretty quick since I am not processing many rows on each run.

    Would like to replace the OLE Cmd with something faster though since some runs do process a higher number of rows.

    I've attached an example.

    In my first conditional split I am checking for a null in the right table, in the second I am comparing checksums (generated by TSQL) between both tables.

    Any thoughts on this approach would be appreciated, also would the lookup be faster than the merge join? In my OLE sources I only use TSQL and retrieve only the columns I need, similar to how I would setup my lookups.

  • I'm using the OLE DB Command. I hear it's not terribly efficient, but I'll eat my hat if I'm importing more than 300 records per table. We're looking at this mostly as a learning experience and a way to use SSIS as a tool to verify or validate the data.

    Were we looking for performance on this update, I suppose we would use a component to execute an entire SQL Set

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply