Restart SQL Agent on AG fail over

  • I'm trying to figure out a way to automatically restart the SQL Agent service when an AG fails over to a secondary node. I've created a job on each AG node that runs every 15 seconds and detects when a fail over has happened. It then issues a SQL Agent stop command and then a SQL Agent start command. The problem is that once the stop command happens the start command never fires because the SQL Agent is now stopped. How can I get around this?

  • Out of curiosity, why do you have to restart the agent?

  • We've discovered a bug with AGs. The first time it occurred after a fail over the SQL Agent jobs on the secondary kept executing even though they were disabled. A restart of the Agent service corrected it. The second occurrence of the bug happened when we did a quick fail over test. Failed the primary from the A node to the B node and then back. Didn't notice for a while that the Agent jobs on A weren't running, even though they were enabled. Again, a restart of the Agent service fixed it. We've opened a case with MS to get to the bottom of it, but in the mean time we want to automatically restart the Agent service on both the primary and secondary whenever a fail over occurs.

  • I subscribed to the thread so post again if you find out more.

    Instead of restarting the agent, you could try have a job to trigger on failover on both nodes (likely through a WMI alert). If it fails over and is now a secondary then it can iterate and sp_stop_job all running jobs (and disable them so they don't trigger again). If it's now a primary then it can iterate all the jobs to disable then re-enable them; the reason being that this will reset the next run time for all of them.

    This is assuming the jobs execute frequently, and or the failover takes too long, and the next time to run them gets missed (this can happen on normal instances, but can also happens when the time service fixes the time and it jumps ahead and misses a schedule in the same way and then they just drop off).

    We have a morning report that runs and checks for it on our servers; apparently it goes all the way back to SQL 2000.

    Of course it could be something totally different.

  • Sorry folks in advance, but I've tried to find a thread with something more close to the question I have, yet I've not.

    So, here it goes...

    I created a SQL Alert with the Error number: 1480 (DB changing its role due to AG Failover event)

    I set an email notification and I set Execute job, e.g. AGFailOverResponse, within the Response section of that alert.

    My question is:

    How can I determine, via SQL script, which AG has actually failed over (since we can have more than one on the cluster)?

    I want to perform some specific actions related to that particular AG, e.g. Enable/Disable CDC, within the step(s) of that SQL job (read within some SPROC).

    We have a solution which works fine in case of one AG, but what if there is another AG added to the cluster later? Then what?

    I appreciate you advice.

    Thank you!

  • How are you controlling the restart of the SQL Agent?

    I've done somethig simular in the past and used a powershell script that was called from the Windows Scehduled task to control the resarting of the agent. I guess it could be done from the SLQ Agent though by forking a powershell script to run outside the SQL Agent process...

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply