Failover, Patching and Hallengren Backups

  • I am using a replication tools that has been having issues the last couple of patch cycles.  I am doing some troubleshooting and I see a few issues.

     

    1. When patching the AG servers a failover occurred.  To patch AG servers, should we reboot the secondary first and then the primary?  We never thought of a failover happening when we rebooted the servers for patching.
    2. When the failover occurred, the hallengren backup run and put the log backup on the file share but under the secondary replica log folder?  Is there a way to use hallengren and have both replicas back up to the same folders?
    3. Is there a known issue with Microsoft SQL Server and broken chains?  When patching was over and we restarted the replication, there were broken log chains noticed.

    Just trying to do some troubleshooting on certain issues that we have noticed lately.

    The servers are SQL Server 2017.  The AGs were set to manual.  A failover shouldn't have happened nor should a broken log chain.  So trying to determine what happened before next patching cycle.

     

    Thanks for any and all input. When WwW

    Things will work out.  Get back up, change some parameters and recode.

    1. Yes
    2. Depends on your configuration
    3.  Define issues, you're aware that replication needs some time to catch up after one node going down? What are broken chains? LSN mismatch is not a broken Transaction Log chain.
  • Great information.

    During patching, both servers are in the same group to be rebooted.  We never thought about separating the servers that are involved in an AG and then insuring the secondaries are rebooted first.

    The odd thing about this issue was that the AG was set to manual.  There shouldn't have been a failover when the servers are set to manual.

    When the failover occurred, the backup ran (since the replica was now the primary).  A back up is not on the secondary replica (now primary).

    Then another failover must have occurred although I am not seeing that in the logs.  But the original primary is now back to being primary.  We start the replication tool and bam.  There is a replication error due to a missing log file.

     

     

     

    Things will work out.  Get back up, change some parameters and recode.

  • These kinds of issues is one reason I do not recommend auto-patching and restarting clusters of any kind.  These should always be manually patched and restarted to ensure that quorum is always maintained during the process and each node is patched, restarted and ready before it can be utilized to support any services.

    One of the problems with maintenance in an AG - is the non-shared storage between nodes.  Ideally, in this type of environment you would have network storage available and setup your maintenance to backup to that network storage using the UNC path.  This would alleviate any issues/concerns with which node is performing the backups.

    If possible - when patching a cluster I recommend failing over only a single time, that way you run for a month on node 1 - the next month on node 2 and the following month are back to node 1.  The order of operations would be:

    1. Identify the current node hosting the services
    2. Patch and restart the other node - validate system is back up and available.
    3. Manually fail over services to the newly patched and restarted node
    4. Patch and restart the first node

    This reduces overall downtime to a single fail over event - and allows for patching and restarting to occur prior to the scheduled downtime.  The patching and restarting does not impact the active system and therefore does not impact the users or the application - the only impact is the fail over.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • For AlwaysOn AGs, you can look at there is an extended Events Health Session that can be examined to find information about failovers.  I would also be curious to know what the SQL Server Error Log shows happened during the time you're describing.

    I would agree with you that if both nodes in the AG are set to manual failover, then an unexpected failover should not occur. Be sure to check the AG settings to ensure that both nodes are set to manual failover.

    The behavior your describing with the Ola Hallengren scripts sounds like what happens when you take a backup of a database when it actually isn't part of an AG.  If a backup is part of an AG, then Ola places all the backups in a single directory.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply