Should I create a Staging Instance or Staging Database on the same Production Server? (Can only do SAME PRODUCTION SERVER)

  • Hello All,
    I currently manage a data processing department for a Marketing firm and we deal with TB's of data.

    We manage multiple lists (Databases) per client based on consumer address etc etc , so I do a bunch of updates and ADDRESS changes weekly/Monthly. Currently pushing maybe 50 million records per month if not more?

    Should I create a staging environment on another INSTANCE (same server as the production but different instance) specifically for Data Processing (ETL) and then push it to a production table (final master table of good addresses)?

    In short should i :

    1. Create an instance for staging? 
    or
    2. Create a DATABASE for Staging (on different drives apart from production)

  • If this staging database would only be for ETL type purposes, I think it would be easier to work with on the same instance, as you could use 3 part object names [dbname].[schemaname].[objectname] to reference items from the staging database, and security setup would also be easier for whatever user(s) are actually doing the final processing from the staging database into the live database.

    What did you think would be the advantage of having a separate instance?

  • Hey thanks for that! 
    I was hoping to optimize the speed and basically understand the "CORRECT" way of doing data processing.

    We data process so millions ON THE SAME PRODUCTION DATABASES, creating tables, indexes etc etc. Trying to find the best practice in creating the infrastructure.

  • Chris Harshman - Thursday, August 30, 2018 12:02 PM

    If this staging database would only be for ETL type purposes, I think it would be easier to work with on the same instance, as you could use 3 part object names [dbname].[schemaname].[objectname] to reference items from the staging database, and security setup would also be easier for whatever user(s) are actually doing the final processing from the staging database into the live database.

    What did you think would be the advantage of having a separate instance?

    Just a recommendation... don't use 3 part naming in the actual code.  Only use synonyms that use the 3 part naming and the code would point to the synonyms.  That way, if you ever have to "do things differently or someplace else", all you have to change is where they synonyms point and you won't have to find and change any code.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Personal recommendation based on experience... I would keep the databases on the same instance. Some of the reasons, in no particular order being:

    1. If you are doing any cross instance queries (other than SSIS) you need to use a fully qualified name to access objects, via a Linked Server. This can result in really poor query plans. OPEN_QUERY is a lot better, but takes more work.
      If the databases are on the same named instance, SQL does a much better job of optimizing the queries.
    2. It may seem trivial, but adding the 2nd instance does add some CPU and RAM overhead.
    3. If at a later stage you want to migrate/upgrade to other servers, or even move code between dev/test and prod, you either need the same Linked Server name on both original and destination servers, or you need to modify the code during migration.
    4. You also have to think about management, an extra instance to patch, backup master, MSDB, Monitor, etc.
    On the other hand, I'm not aware of any compelling reason to have a separate named instance.

    Cheers
    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • dtran1127 - Thursday, August 30, 2018 11:31 AM

    Hello All,
    I currently manage a data processing department for a Marketing firm and we deal with TB's of data.

    We manage multiple lists (Databases) per client based on consumer address etc etc , so I do a bunch of updates and ADDRESS changes weekly/Monthly. Currently pushing maybe 50 million records per month if not more?

    Should I create a staging environment on another INSTANCE (same server as the production but different instance) specifically for Data Processing (ETL) and then push it to a production table (final master table of good addresses)?

    In short should i :

    1. Create an instance for staging? 
    or
    2. Create a DATABASE for Staging (on different drives apart from production)

    As (almost) always, it "depends". If for example, you are running a SQL Server Standard Edition on a beefy hardware with more memory than is usable by a single instance, a second instance makes sense as it allows you to utilize more of the hardware resources. If you are running an Enterprise Edition, then this hardly makes sense as you can then balance the resources using the Resource Governor. 
    😎

    Whatever you do, do as Jeff recommends, do NOT use three part naming in the code! That will only cause headaches down the line.

  • Heh... if the code is written properly, you won't need to use the likes of "Resource Governor" especially if there are only 50 million rows per month that are of concern.  That's only 1.6 million rows per day, if you want to measure things that way.  Even a single batch of 50 million rows shouldn't cause issues on today's machines.  Performance of such things is in the code... or not. 😉

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Getting back to the original question, it's a extremely rare thing when I'll load external data directly to a final table.  With those extremely rare exceptions, I almost always load data into staging tables on a separate database, usually on the same machine.  There are many reasons for this.  Here are two of the most important to me...

    1. The data doesn't actually need to ever be backed up because, not only is the data transitory, but it already exists in the original files, which I compress and archive after I've had my way with them.  If I make no changes to the code that is used in the database, there is no need for a backup and the backups are relatively tiny because I'll simply truncate or drop any of the "import" tables, which are frequently built at runtime because it's convenient to do so.  They are, in all actuality, a form of temporary table that's totally expendable.

    2.  There are huge advantages to having ETL processes in a database that is set to the SIMPLE Recovery Model, not the least of which is minimal logging, which usually doubles the performance of the code because it's only doing about half the writes to disk.  Having the processes in a separate database allows for that without interfering at all with the point-in-time recovery of the other databases.

    Having said "ETL" database on the same instance as where the final resting place of the data will be is a huge advantage in many ways, as well.
    1.  Using synonyms, it's trivial to read from those final databases if data is needed from them for validation of the new data.
    2.  Using synonyms, it's also trivial to insert, update, or delete row in those final databases based on the prevalidated data that has been imported, verified, and cleaned in the ETL database.

    Some will say "Well!  You can't do DRI across databases unless you use a trigger!".  True enough about not doing implied DRI but you don't use triggers for this type of stuff.  You use stored procedures as part of the validation process.  Also, you'd have similar but more complex problems if the ETL data lived on a separate box including but not limited to buying another set of licenses for Windows and SQL Server (to name two), doing system level backups, an extra box to manage security for, and establishing a trust between the two systems that's still secure.

    I'll also add that if the box appears to not be able to handle both ETL and the normal traffic of the other databases, the problem is usually with the code and not the fact that you're running both ETL and OLTP on the same box.  And if the box IS underpowered for such a task, my humble opinion is that it would be better to beef up the one box with future scalability in mind than to ignore the weakness of the current box and standing up another box.

    And, no... I have no documentation on any of that... just years of experience where people have stood up another box instead of dealing with the real problem and that problem is in the code.  I've been involved in a whole lot of projects where people thought otherwise and the standing up of an additional box did nothing for performance (the code still sucked) plus they had the extra overhead of going across boxes and their code sucked there, as well.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Eirikur Eiriksson - Sunday, September 2, 2018 11:42 PM

    As (almost) always, it "depends". If for example, you are running a SQL Server Standard Edition on a beefy hardware with more memory than is usable by a single instance, a second instance makes sense as it allows you to utilize more of the hardware resources. If you are running an Enterprise Edition, then this hardly makes sense as you can then balance the resources using the Resource Governor. 
    😎

    Whatever you do, do as Jeff recommends, do NOT use three part naming in the code! That will only cause headaches down the line.

    Yes, sorry, I should have been more clear in my post.  Whenever I or the developers I work with do something that uses multiple databases, I have them use synonyms to simplify the name references in code.  The synonyms would then point to the 3 part named object in the other database.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply