Backup server

  • Admingod

    SSCertifiable

    Points: 5892

    We will have backup server maintenance with the outage between 15mins to 1 hr. So there won't be log backups during that time and business is aware of that. I don't see any issues doing this during business hours instead of after hours. Do you agree or anticipate any other issues.Please advise?

  • Mr. Brian Gale

    SSC-Insane

    Points: 23080

    My  preference is any production level server, maintenance is done in a controlled maintenance window outside of business hours where possible.

    You are giving yourself a 1 hour window to do the maintenance. During this time the company is running, database changes are happening and backups to the database are not.  This means your RTO and RPO are at risk during this time too.

    Lets say your maintenance takes exactly an hour (60 minutes).  At the 59 minute mark, your database server shuts down and won't come back up.  You need to restore from backup.  Are you OK with that hour of data be gone? If you have a good "downtime" maintenance window, database changes shouldn't be happening and thus no data loss in this worst-case scenario.

    And the other issue - if you  do it during company uptime you have a LOT more chance of interruptions.  Support calls for example.  My maintenance downtime windows are specifically for that - maintenance.  I focus 100%  on the maintenance and don't change tasks until that maintenance is done.  Therefore, I want minimal interruptions and I do not work on support calls during that window.

    I like to hope for the best, but plan for the worst.

  • andreas.kreuzberg

    SSCertifiable

    Points: 6055

    Hi,

    and if can't do Log-Backups, your transactionlog will growth, it depends on your application, if you are able to handle this in one hour.

    Kind regards,

    Andreas

  • Grant Fritchey

    SSC Guru

    Points: 396622

    Mr. Brian Gale wrote:

    I like to hope for the best, but plan for the worst.

    This sums it up. You're taking a risk. It'll probably be OK. But it might not be.

    ----------------------------------------------------
    The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood...
    Theodore Roosevelt

    The Scary DBA
    Author of: SQL Server 2017 Query Performance Tuning, 5th Edition and SQL Server Execution Plans, 3rd Edition
    Product Evangelist for Red Gate Software

  • Jeff Moden

    SSC Guru

    Points: 996863

    Admingod wrote:

    We will have backup server maintenance with the outage between 15mins to 1 hr. So there won't be log backups during that time and business is aware of that. I don't see any issues doing this during business hours instead of after hours. Do you agree or anticipate any other issues.Please advise?

    "It Depends".  If the log files aren't being backed up during that time frame, are any of them going to outgrow the disks they reside on?

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.
    "Change is inevitable... change for the better is not".
    "If "pre-optimization" is the root of all evil, then what does the resulting no optimization lead to?"

    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Admingod

    SSCertifiable

    Points: 5892

    Thanks I agree with all you. What if you have lots of servers across the board? It would be hard to find the good maintenance window right? So there would be a risk in it no matter what. Do you agree?

  • homebrew01

    SSC Guru

    Points: 55188

    Could you set up a local Maint Plan to take Trans Log backup during the maintenance window ?

  • Mr. Brian Gale

    SSC-Insane

    Points: 23080

    Where I work, I have 60+SQL instances with roughly 5 databases per instance.  Not an ideal setup with 3 physical servers, but it works for what we need.

    We have our  backup drive as a network share on a virtual host that has 2 fail over servers.  So if I need to do maintenance on the backup drive server, I fail it over to a secondary server while I do maintenance and I fail it back when I am done.  The downtime for that is roughly 1 second.  So as  long as no backups are running at the time, the failover is fast enough that nobody notices.

    Now, if the physical disk hosting the backups was to require maintenance, our SAN team (likely the SAN itself) would handle failing that over to a secondary disk and it would be completely transparent to me.

    Now if there was SAN maintenance, that at my company requires full downtime.  You power off or reboot the SAN, all the VM's will reboot too as their disks just went poof! Thankfully, getting full downtime like that isn't too difficult - we don't operate on weekends or evenings (most of the time), so downtime on a Saturday at 10:00 AM until Sunday at 10:00 PM is not an unrealistic request.

  • Admingod

    SSCertifiable

    Points: 5892

    homebrew01 wrote:

    Could you set up a local Maint Plan to take Trans Log backup during the maintenance window ?

    We have multiple servers. So you suggesting to change all backups(You mean log backups) to local until the maintenance is complete?

     

  • Admingod

    SSCertifiable

    Points: 5892

    Storage guys found the window where there is no activity on the backup server. Would you agree that would be the best window?

  • Mr. Brian Gale

    SSC-Insane

    Points: 23080

    "no activity on the backup server" to me means "no backup jobs are running at this time from SQL", correct?  Are you 100% confident that your work will fit in that 1 hour window?  This is assuming the maintenance goes completely sideways and you need to undo the work (not a "rebuild the server from scratch" worst case scenario, but a more optimistic worst case scenario such as if your maintenance is to power off the server and replace the batteries in the UPS due to their age and when you go to power up the UPS, the new battery doesn't work so you need to put the old battery back in).

    My rule of thumb for a downtime window is to pick the most realistic worst case scenario (not the absolute worst of the server room exploding in the middle of maintenance for example), estimate the time to do that process, double the time so I have some time to debug any problems that came up, and add another 25% onto that so I have a LOT of time.

    Depending on the maintenance being performed, your realistic worst case scenario may be that the server doesn't come back online after you reboot it.  I do not know your scenario.

    If it was me, I would be requesting to move the backup disk off of the backup server while the maintenance is being performed and move the disk back afterwards.

    In your particular case, it may be acceptable to use that window or might not.  At my workplace, we have our backups running on a server with failover to 2 other servers.  So unless all 3 servers are offline, our backups continue to run without interruption (unless a failover happens in the middle of a backup).  I do not like having a window where backups might not run and having a single backup server is a very scary thought to me.

  • Admingod

    SSCertifiable

    Points: 5892

    Thanks brian. I like the idea of requesting to move the backup disk off the backup server while the maintenance is performed. Can you give more details on this?

  • Sue_H

    SSC Guru

    Points: 90707

    Mr. Brian Gale wrote:

    At my workplace, we have our backups running on a server with failover to 2 other servers.  So unless all 3 servers are offline, our backups continue to run without interruption (unless a failover happens in the middle of a backup).  I do not like having a window where backups might not run and having a single backup server is a very scary thought to me.

    I think that's a better strategy. With a single backup server with no redundancy, your backup server is a single point of failure. For backups of all things. It makes no sense to me to have things setup with a single point of failure for backups.

    Sue

  • Admingod

    SSCertifiable

    Points: 5892

    Just to clarify about the maintenance. Let me recap. The maintenance is performed on the backup server from the storage team, so backup volumes would be unavailable during the maintenance once complete it would be available. Basically, it will appear as a reboot of backup server to move the virtual server from old to new storage.

  • Mr. Brian Gale

    SSC-Insane

    Points: 23080

    My approach then would be something completely different then.  If you are changing the back end disks I would do a completely different approach than what you are doing.  Might be more work (VERY VERY likely to be more work), but is much safer (my opinion).

    What I'd do - make a secondary backup server VM now.  Make that secondary backup server VM have the NEW disk from the new SAN (presuming when you say "new storage" you mean new SAN).  At this point, you have 2 backup servers and 2 locations where you can store your backups.  The new server has no backup data there, but the old one has all of your backups.  Next, find your downtime window and copy the data over (unless your storage admins already moved the data, then skip this step).  Next, set up rsync on both servers to copy over new files at some point after the backup has completed.  This way backups between the two servers are in sync AND you have a secondary backup server.

    Now, go into each and every SQL backup job that you have and then my next bit of advice changes depending on how you do backups.  If you have your own in-house scripts, hopefully they have a parameter for the backup location and backup file names.  This is the easiest (my opinion) way to handle it.  You run the backup against the primary backup server.  If that stored procedure call fails (ie backup fails), assume the VM is offline, so fire your backup to the secondary backup server.  If you have maintenance plans in place instead of an in-house backup script, then set your failed backup step to be to do a backup on the secondary server.  If you use Ola Hallogren's scripts, I don't know much about those, but I imagine it is a similar process as the "in-house scripts".  If you have some other method for backups, then follow their steps.  Either way, get at LEAST 1 secondary server for your backups and set up some form of replication of the backup data (that is what rsync will do for you).

    This might sound like overkill, but if you ever hit a point where your primary backup server VM crashes during a disaster recovery moment, you can just flip on over to the secondary and let the server admin team deal with the down server while you continue restoring your databases!  Plus, you can happily do maintenance on the primary backup server and know that the secondary will handle the workload and your backups will run happily.

    The above assumes that the OLD storage is still available for use.

    If you absoltely cannot leave the backup server on the old storage for some reason (old storage being retired), I would request to get enough disk to do the above with a minimum of 2 VM's, even if they are using the same physical disk.  This way when you need to do maintenance on a single backup server, you can fail the disk over and have the backup jobs continue to run happily.

    My preferred backup hardware strategy is to have a minimum of 2  servers (physical or VM) with isolated disks (so one disk corrupting doesn't mean backups are toast) and at least 1 offsite backup.  My setup isn't perfect as we have 3 physical boxes in the same blade array (single point of failure if the blade array fails, but if it fails and the SAN is still good, we have our backups still, just need to replace the blades), the disk is a shared floating SAN disk (floating in that it is hosted on server A, unless server A is offline, then it is hosted on B unless B is done then C) and we backup that to tape which is shipped offsite nightly.  So our "worst case scenario" is 1 day of data loss in the event the server room explodes 1 second before we grab the tapes to ship off site.  Since the backups are pushed to tape hourly, if the SAN dies, the tape backups are good to get our data back.

     

    What I would recommend is working with your server team and determine what your RPO and RTO is.  Right now, having a single server handling your SQL backups, that server dies and needs to be recovered, how much time are you allowed to be down?  If you plan for an absolute worst case - your server room and all equipment in it is destroyed.  Not likely to happen, but it could.  Determine how long each of the small parts takes to get back up (your RTO) and how much data will be lost (RPO).  The worst case scenario is not likely to happen, but all parties impacted should be aware of how long it'll take to fix it.  AND if it is all documented, you have your butt covered if the problem ever does happen.  And it covers all the small bits inbetween.  For example, if the SAN were to have a power surge that fried the controller and caused all the disks to blow their boards, your SQL stuff is down (as is everything else).  The big boss comes to your desk and asks "how long until the SQL is back online", you can grab your document and say "approximately 2 and a half weeks.  We need to get the SAN back online.  Since it is hardware failure, the SAN will need to be replaced.  The downtime for that is 2 weeks to order a new one in.  Once that gets in, we will need to configure it and restore from our tape backups.  Here, we are looking at about 72 hours once the SAN is online.  Once the backup to the SAN is complete, the databases will come online on their own.".  The big boss should have seen this document too and they may think to themselves that buying a cold spare SAN can save 2 weeks of downtime in the event of a catastrophic failure is worth the cost of a second SAN sitting unused in a storage room somewhere.  On the other hand, maybe your company is OK with having the company down for 2 weeks while a new SAN is ordered in.  I don't know your business.

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply