Database Architecture Question

  • Hi,

    I know there's always 100 different ways to skin a cat, so I thought I'd see what you recommend...

    My organization is building a data warehouse, and for the ETL process, we want the source database not to be our production database, but rather a copy of our production database. We're using Ola's scripts to take backups, so it would seem that one option is to simply to use the backup from server A and automate the restore of it to server B, where it becomes the source database. Is there a good reason not to go this route? If it makes sense, what do you think is a good way to automate this? PowerShell? Specifically dbatools.io? T-SQL?

    Ok, so automated restore aside, I start to wonder if I could kill two birds with one stone here. Would this situation be a good candidate for AlwaysOn AG? This way, we'd have the second database that sits without activity for the ETL process run off of and have the assurance of knowing that if anything were to happen to our production database, this second database would kick in?

    Is there yet another option that makes more sense than either of these? My experience is much more in database development than architecture or administration, so I'm trying to get a handle on benefits and disadvantages to the various options.

    Thanks in advance,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • I don't know what your hardware is like but we use SAN snapshots for things like this.  It's nearly instantaneous.  Each SAN manufacturer seems to have their own way of doing it.  YMMV and it might be an "extra" that you'd have to purchase but, speaking from experience, it's well worth it.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Thanks for weighing in, Jeff. So if I'm understanding you correctly, you end up with an exact clone of the box, and the ETL tools would sit on a separate box and just use this snapshotted instance for its ETL work?

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Sunday, March 26, 2017 7:11 AM

    Thanks for weighing in, Jeff. So if I'm understanding you correctly, you end up with an exact clone of the box, and the ETL tools would sit on a separate box and just use this snapshotted instance for its ETL work?

    Mike

    It's actually better than that.  For us, we end up with an exact clone of the databases from prod and it doesn't touch the databases that aren't on prod.  That means that all of the ETL and DBA and other tools sit safely in their own database at the ready.

    Again depending on the software associated with a given SAN, some of the tools will allow you to pick and choose just the databases you want to clone.  That's the best. 

    For us, the SAN tools weren't so kind.  For us, it creates "disk snapshots".  In our case, all of our datafiles (MDF/NDF) are on an M and N drive (mix of MDF and NDF on both drives) and our log files are on the L drive.  Basically, it's an all or nothing situation and we couldn't actually have that because, of course, we didn't want the system databases to be overwritten nor did we want the like-named DBA or UTIL databases to be overwritten on the target machine, which has its own M and L drives (no N drive because the things that must be preserved during a snapshot are small).  This also complicated things a bit because we need to be able to auto-magically attach any new databases that my appear in prod.

    So we have all of these MDF/NDF/LDF files sitting out there on snapshotted V, W, and N drives and we couldn't attach the files because they're on differently lettered drives.  Further, it would be silly to have to manually change things during attaches because there's a whole lot of databases that need to be attached.  With that, I wrote a proc that would automatically detach all databases that aren't on an "exclusion list" (system databases and other databases for ETL procs, DBA stuff, and utilities) .  Then they pop the snapshot (we're working on a command line interface (CLI) for that so it's no longer manual but takes only minutes for them to do),  Once the new shapshot is in place, I've written another proc that rips through the snapshot drives, finds all of the MDFs, interrogates them for the names of the NDF and LDF files, reassigns the correct drive letters for V, W, and N as part of the attach code that's created, and then attaches all of the databases on the snapshot drives that haven't been excluded.

    The detach proc takes just seconds to run and forces people out of databases they may be using by changing the databases to single user with an immediate rollback.  The attach proc doesn't take much longer than the detach proc does..

    That sounds like a fair bit of work and it was but that's only because of the SAN software for the SAN we bought.  It's well worth it though because instead of it taking over an hour for restores to happen for our 2TB system, which also drives the "pipe" a bit nuts for that hour, it's all done in something less than 10 minutes including the current manual interaction required to manually pop the snapshot.  Once we figure out the CLI for the snapshot, we'll be able to automate it and it'll take less than a minute and can be scheduled through the SQL Server Agent to run automatically every morning or on demand when needed.

    Again, don't let that scare you.  Check with your SAN vendor.  I've previously worked on SANs (EMC, if I recall correctly) where all of that was automatic without having to write a lick of code.  The key in all cases is to have the ability to exclude certain databases like the system databases and any special dedicated databases to support DBA, ETL, and Util functionality.  If you do run into the same nuances that I did and I haven't gotten around to writing an article on it yet, send me an email and I'll share my code with you.  Of course, you'll need to modify it a bit for your particular case but it's not difficult to do so.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • If you can setup and use AlwaysOn availability groups - with a read-only secondary you have several advantages.  The read-only secondary can also be used for adhoc reporting - real time reporting, etc...as well as support your ETL processes.

    That read-only secondary could also be setup as an HA solution giving you a 'clustered' environment for failover.  Because AlwaysOn does not require shared storage - you can also have that read-only secondary (or more secondaries) in another location to insure business continuity in the event of a disaster.

    Another advantage is the ability to offload your backups to the read-only secondary so your production environment is not impacted.  You still have to perform other maintenance though...such as integrity checks, index rebuild/reorganize and statistics updates.

    The downside to all of this is cost - to have a read-only secondary it has to be fully licensed based on the server configuration.  Note: depending on the usage you don't have to have like for like...if the secondary is truly read-only and only utilized for reporting you can reduce the number of CPUs and memory for that system.  If you want this as a failover it should have the same resources as your production system.

    Any SAN solution is going to have specific requirements for storage layout.  These utilities generally work at the volume level and if you do not have your system databases separated out to their own volume and the databases(s) you want to clone are not on separate volumes you end up cloning more than you really need and may not even be able to clone the existing volumes as is...

    Automating a daily restore process is doable and could be a solution for you - but it depends on how large your databases are and how long it takes to perform the restore.  The advantage to this method is you also test and validate your backups and you can include your transaction log backups to insure you are able to restore to a point in time.  You can also offload integrity checks (if timing permits) so you can reduce the workload on your production environment.  You could also save on licensing by using Standard Edition for this instance.  Note: unless you have a private backup network that is incredibly fast you would want to restore from a local drive - so factor in how long it takes to copy the backup file across the network.  You could mirror the backup drive - split the drive and present the mirrored copy to your reporting server - then perform backups from that mirrored copy...again depends on how long it takes for the copy and restore.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Good stuff, Jeffrey.  Hadn't though about using AG for this because we don't use AG where I work.  They have some hardware solution going on using "NIMBLE".  I learned something new.  Thanks!

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeffrey Williams 3188 - Sunday, March 26, 2017 10:16 AM

    If you can setup and use AlwaysOn availability groups - with a read-only secondary you have several advantages.  The read-only secondary can also be used for adhoc reporting - real time reporting, etc...as well as support your ETL processes.

    That read-only secondary could also be setup as an HA solution giving you a 'clustered' environment for failover.  Because AlwaysOn does not require shared storage - you can also have that read-only secondary (or more secondaries) in another location to insure business continuity in the event of a disaster.

    Another advantage is the ability to offload your backups to the read-only secondary so your production environment is not impacted.  You still have to perform other maintenance though...such as integrity checks, index rebuild/reorganize and statistics updates.

    The downside to all of this is cost - to have a read-only secondary it has to be fully licensed based on the server configuration.  Note: depending on the usage you don't have to have like for like...if the secondary is truly read-only and only utilized for reporting you can reduce the number of CPUs and memory for that system.  If you want this as a failover it should have the same resources as your production system.

    Any SAN solution is going to have specific requirements for storage layout.  These utilities generally work at the volume level and if you do not have your system databases separated out to their own volume and the databases(s) you want to clone are not on separate volumes you end up cloning more than you really need and may not even be able to clone the existing volumes as is...

    Automating a daily restore process is doable and could be a solution for you - but it depends on how large your databases are and how long it takes to perform the restore.  The advantage to this method is you also test and validate your backups and you can include your transaction log backups to insure you are able to restore to a point in time.  You can also offload integrity checks (if timing permits) so you can reduce the workload on your production environment.  You could also save on licensing by using Standard Edition for this instance.  Note: unless you have a private backup network that is incredibly fast you would want to restore from a local drive - so factor in how long it takes to copy the backup file across the network.  You could mirror the backup drive - split the drive and present the mirrored copy to your reporting server - then perform backups from that mirrored copy...again depends on how long it takes for the copy and restore.

    Jeffrey and, well, Jeff:

    Thank you both for your input. I have two follow-up questions:

    1) Regarding AlwaysOn AG -- that feature is available in the standard edition of SQL Server 2016, correct? I believe it's the first version to include it in the standard edition
    2) I don't think this will ultimately be the case, but could the primary be 2016 Enterprise and the secondary be 2016 Standard? I assume 2016 Std to 2016 Std is a valid scenario.
    3) If I'm understanding correctly, when data is written to the primary, it's also written (or copied?) to the secondary in near real time? If so, am I to expect any kind of performance impact on the primary for having to write twice?

    Thanks again,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Does anyone know the answer to any or all of my follow-up questions? I've googled most of this, but I would just like confirmation if you know the answer(s).

    Thanks,

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Thursday, March 30, 2017 12:23 PM

    Jeffrey Williams 3188 - Sunday, March 26, 2017 10:16 AM

    If you can setup and use AlwaysOn availability groups - with a read-only secondary you have several advantages.  The read-only secondary can also be used for adhoc reporting - real time reporting, etc...as well as support your ETL processes.

    That read-only secondary could also be setup as an HA solution giving you a 'clustered' environment for failover.  Because AlwaysOn does not require shared storage - you can also have that read-only secondary (or more secondaries) in another location to insure business continuity in the event of a disaster.

    Another advantage is the ability to offload your backups to the read-only secondary so your production environment is not impacted.  You still have to perform other maintenance though...such as integrity checks, index rebuild/reorganize and statistics updates.

    The downside to all of this is cost - to have a read-only secondary it has to be fully licensed based on the server configuration.  Note: depending on the usage you don't have to have like for like...if the secondary is truly read-only and only utilized for reporting you can reduce the number of CPUs and memory for that system.  If you want this as a failover it should have the same resources as your production system.

    Any SAN solution is going to have specific requirements for storage layout.  These utilities generally work at the volume level and if you do not have your system databases separated out to their own volume and the databases(s) you want to clone are not on separate volumes you end up cloning more than you really need and may not even be able to clone the existing volumes as is...

    Automating a daily restore process is doable and could be a solution for you - but it depends on how large your databases are and how long it takes to perform the restore.  The advantage to this method is you also test and validate your backups and you can include your transaction log backups to insure you are able to restore to a point in time.  You can also offload integrity checks (if timing permits) so you can reduce the workload on your production environment.  You could also save on licensing by using Standard Edition for this instance.  Note: unless you have a private backup network that is incredibly fast you would want to restore from a local drive - so factor in how long it takes to copy the backup file across the network.  You could mirror the backup drive - split the drive and present the mirrored copy to your reporting server - then perform backups from that mirrored copy...again depends on how long it takes for the copy and restore.

    Jeffrey and, well, Jeff:

    Thank you both for your input. I have two follow-up questions:

    1) Regarding AlwaysOn AG -- that feature is available in the standard edition of SQL Server 2016, correct? I believe it's the first version to include it in the standard edition
    2) I don't think this will ultimately be the case, but could the primary be 2016 Enterprise and the secondary be 2016 Standard? I assume 2016 Std to 2016 Std is a valid scenario.
    3) If I'm understanding correctly, when data is written to the primary, it's also written (or copied?) to the secondary in near real time? If so, am I to expect any kind of performance impact on the primary for having to write twice?

    Thanks again,

    Mike

    1) I have not had a chance to verify the restrictions - but I have seen that it is available.  The limitations I am aware of do not allow the secondary to be used as a read-only replica - this is only available for DR.  However you can always setup a database snapshot of the 'mirrored' database and use that snapshot for reporting.

    2) No - you have to have the same Edition for all replica instances.

    3) Performance will be impacted if you select synchronous mode - which means the system has to commit the transaction on the secondary prior to completing the transaction on the primary.  Using asynchronous mode commits the transaction on the primary and sends the transaction to the secondary which will not impact performance.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Thank you!

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Thursday, March 30, 2017 12:23 PM

    1) Regarding AlwaysOn AG -- that feature is available in the standard edition of SQL Server 2016, correct? I believe it's the first version to include it in the standard edition

    No. What's in Standard Edition is Basic Availability Groups. One primary replica, one synchronous replica for failover, not readable. Readable secondaries (and having more than one secondary) are only features in the 'full' availability groups, and that's Enterprise Only

    2) I don't think this will ultimately be the case, but could the primary be 2016 Enterprise and the secondary be 2016 Standard? I assume 2016 Std to 2016 Std is a valid scenario.

    No.

    3) If I'm understanding correctly, when data is written to the primary, it's also written (or copied?) to the secondary in near real time? If so, am I to expect any kind of performance impact on the primary for having to write twice?

    Yes, but not a large one if you have adequate hardware, decent network link

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster - Tuesday, April 4, 2017 3:31 AM

    Mike Scalise - Thursday, March 30, 2017 12:23 PM

    1) Regarding AlwaysOn AG -- that feature is available in the standard edition of SQL Server 2016, correct? I believe it's the first version to include it in the standard edition

    No. What's in Standard Edition is Basic Availability Groups. One primary replica, one synchronous replica for failover, not readable. Readable secondaries (and having more than one secondary) are only features in the 'full' availability groups, and that's Enterprise Only

    2) I don't think this will ultimately be the case, but could the primary be 2016 Enterprise and the secondary be 2016 Standard? I assume 2016 Std to 2016 Std is a valid scenario.

    No.

    3) If I'm understanding correctly, when data is written to the primary, it's also written (or copied?) to the secondary in near real time? If so, am I to expect any kind of performance impact on the primary for having to write twice?

    Yes, but not a large one if you have adequate hardware, decent network link

    Thank you, Gail!

    I didn't realize there were two versions of AG -- a basic and a full. One more question--does the secondary need to be licensed like the primary? For example, if my primary is SQL Server 2016 Ent and my secondary is Developer, is that allowed, and would it even work since Developer is effectively enterprise?

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

  • Mike Scalise - Tuesday, April 4, 2017 7:31 AM

    Thank you, Gail!

    I didn't realize there were two versions of AG -- a basic and a full. One more question--does the secondary need to be licensed like the primary?

    Depends. Read the licensing guide for details. If it's purely for failover, and has no activity other than that, and you have SA (and maybe other considerations too), then no. Otherwise yes.

    For example, if my primary is SQL Server 2016 Ent and my secondary is Developer, is that allowed

    No. Developer edition is not permitted for production usage (and failover of a production DB is production usage)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Why do you want to use a copy of your Production data as the source for your DW?
    If the primary reason is to get a copy of the data at a fixed point in time, then you can do this using standard SQL Server Snapshots.
    If the primary reason is to minimise IO on the production server, this impact can be planned to be less than you might think.  A typical DW implementation will extract the changes to be applied to Staging Tables, and do any data clensing, etc that might be needed in the staging environment.  The DW then gets updated from staging.  These staging tables can be on a separate server, with the result that there would be relatively little IO on the main production server.
    If you have a different reason why you feel you need to have a complete copy of your production data to be the source of your DW, can you share it with us?

    Original author: https://github.com/SQL-FineBuild/Common/wiki/ 1-click install and best practice configuration of SQL Server 2019, 2017 2016, 2014, 2012, 2008 R2, 2008 and 2005.

    When I give food to the poor they call me a saint. When I ask why they are poor they call me a communist - Archbishop Hélder Câmara

  • EdVassie - Wednesday, April 5, 2017 3:44 AM

    Why do you want to use a copy of your Production data as the source for your DW?
    If the primary reason is to get a copy of the data at a fixed point in time, then you can do this using standard SQL Server Snapshots.
    If the primary reason is to minimise IO on the production server, this impact can be planned to be less than you might think.  A typical DW implementation will extract the changes to be applied to Staging Tables, and do any data clensing, etc that might be needed in the staging environment.  The DW then gets updated from staging.  These staging tables can be on a separate server, with the result that there would be relatively little IO on the main production server.
    If you have a different reason why you feel you need to have a complete copy of your production data to be the source of your DW, can you share it with us?

    Hi Ed,

    You're correct--we want to extract the changes and apply them to the staging tables for data cleansing and manipulating....and those staging tables reside on a separate server from our production database. However, the thought was, since we're already taking nightly backups of our production database, why wouldn't we just restore that backup on the server with the staging tables so that the ETL process can extract the information from that database and have literally no negative impact on our actual live database?

    So perhaps I should have said that the copy of the production database would be the source of the staging tables and not the DW itself. Regardless, given my explanation, what would you suggest?

    And thanks, all, for bearing with me. some of these concepts are a bit new to me, so I'm trying to get a handle on them.

    Mike

    Mike Scalise, PMP
    https://www.michaelscalise.com

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply