Cluster failover problems

  • We have a sql 2000 cluster on 2003. We have had repeated problems with it not failing over after becoming unresponsive. It was not responding for 3 hours the other night. There was some kind of memory leak and had to have the power cut. Node 2 was up, but when opening cluster manager it said it could not connect to the cluster.

    All the settings should be default with affect the group checked for sql server and 3 fails in 900 seconds.

    We've had other times when has been overburdened by a runaway query and became unresponsive, requiring a manual failover.

    From what I've read, when node 2 cannot get @@servername from node 1, sql will be restarted. If it restarts 3 times in 900 seconds, it will fail over.

    Any tips to get this going?

  • What does the Event log show you?

    The cluster service logs rather specifically what it's trying to do (system log, not app log). Did it even detect an issue, or was it just running REALLY slowly. If it was REALLY taxed - it may very well become unresponsive to SOME requests, and yet - the server is still processing stuff.

    Beware of failback. I've had clustered servers fail in the middle of something big, then try to recover, except that the rollback/roll forward process takes longer than 900 s. So you get "cluster ping pong": fail to node A, try to recover, but recovery takes to long, so fail over to B, and do it all over again....

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • Did you check all IP addresses and see whether or not they are correct as expected?

  • Hi Matt - I appreciate your help.

    I have redacted servernames with NODE1, NODE2, CLUSTERNAME

    Here are some logs.

    First, it looks like we got this error at 7:04AM, 6:05PM, 6:19PM, 12:30AM - 1:06AM

    Event Type:Error

    Event Source:Srv

    Event Category:None

    Event ID:2019

    Date:2/11/2008

    Time:12:44:47 AM

    User:N/A

    Computer:NODE1

    Description:

    The server was unable to allocate from the system nonpaged pool because the pool was empty.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    Data:

    0000: 00 00 04 00 01 00 54 00 ......T.

    ETC...

    ----------------------------------------------------------------------

    Then we got a few variations on this error, with NODE1 varying to the Virtual SQL network name and some other aliases that seem to be set up from before my time here. Looking at the cluster they we have a networkname for what we normally call sql as, then two more in the sql group as well.

    Event Type:Warning

    Event Source:ClusSvc

    Event Category:Network Name Resource

    Event ID:1119

    Date:2/11/2008

    Time:1:07:26 AM

    User:N/A

    Computer:NODE1

    Description:

    The registration of DNS name CLUSTERNAME.DOMAINNAME.CITY for resource 'Cluster Name' over adapter 'City' failed for the following reason:

    An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    Data:

    0000: 47 27 00 00 G'..

    --------------------------------------------------------

    Then we got more of the first - then this:

    Event Type:Error

    Event Source:BROWSER

    Event Category:None

    Event ID:8032

    Date:2/11/2008

    Time:2:44:38 AM

    User:N/A

    Computer:NODE1

    Description:

    The browser service has failed to retrieve the backup list too many times on transport \Device\NetBT_Tcpip_{F860749C-EB51-49AD-AE96-283B068D1CEF}. The backup browser is stopping.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    Data:

    0000: aa 05 00 00 ª...

    --------

    Then more of the first until someone rebooted the server.

  • SQL ORACLE (2/20/2008)


    Did you check all IP addresses and see whether or not they are correct as expected?

    I looked in the cluster admin at the network interfaces listed and then pinged the nodes. They are the same. Don't know where else to check.

    They fail over manually just fine.

  • The first error you got is usually a symptom of a rather massive memory leak (I kind of recall seeing it associated with an out of control virus scanner). Interestingly enough - that's an OS message - not so much a SQL Server message.

    If you have something like NAI's anti-virus product running - check for a patch.

    You might care to check out PerfMon, and/or the Page Pool monitor (from the Win2K resource kit, but should still work on 2003).

    http://support.microsoft.com/?id=177415

    Of course - if it does turn out to be SQL Server having "taken" all of the memory, and not something else - you need to spend some time "tuning" the min and max memory settings. It may be that you're "stealing" to much memory away from the OS, leaving it unable to do its thing (it's going to need some resources to make the failover happen).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • We have the memory capped - with 2GB shy of the total server memory. This is a dedicated server with just SQL on it. I don't believe it has antivirus running either.

    I think our strategy may have to be to monitor for the memory errors and then fire up the monitoring. Let it run for a bit and then do failovers and reboots and analyze the monitors afterwards.

    The server team should be paging us when this happens again.

  • sam (2/21/2008)


    We have the memory capped - with 2GB shy of the total server memory. This is a dedicated server with just SQL on it. I don't believe it has antivirus running either.

    I think our strategy may have to be to monitor for the memory errors and then fire up the monitoring. Let it run for a bit and then do failovers and reboots and analyze the monitors afterwards.

    The server team should be paging us when this happens again.

    If you have a developer that's any good at Windows development - perhaps set up a trigger on the available pages (using WMI queries for example) going below a certain degree (before it gets to "depleted"). Perhaps fire up SQL Profiler at that based on that threshold (just so you know what was happening before the nasty error).

    ----------------------------------------------------------------------------------
    Your lack of planning does not constitute an emergency on my part...unless you're my manager...or a director and above...or a really loud-spoken end-user..All right - what was my emergency again?

  • We've opted for the server guys to monitor the memory usage of all processes and set up an alert for when the error comes up again. I reviewed the past logs and it correlates pretty well with problems. We're moving all the dbs on this box to a 64 bit 2005 machine over the next 6 months. Hopefully it's smooth sailing until then.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply