Node failed to come up

  • We have a two node setup, Node and Disk Majority, Win Server 2008 with SQL Server 2008, Enterprise. Cluster configuration is set to restart on current node (15 min) if resource fails and to fail over all resources if restart is unsuccessful. One of the instances became unavailable, could not remotely connect to it and could not fail it over from the second node. Cluster manager froze up. Here is part of the cluster log right around the time this happened from the victim server:

    00000988.000027f0::2012/01/19-23:32:23.350 ERR [API] s_ApiCloseNetwork: ERROR_INVALID_HANDLE(6)' because of 'Cannot unregister handle 11

    '

    00000988.000027f0::2012/01/19-23:32:23.350 ERR [API] s_ApiCloseNetwork: ERROR_INVALID_HANDLE(6)' because of 'Cannot unregister handle 10

    '

    00000988.000027f0::2012/01/19-23:32:23.350 ERR [API] s_ApiCloseNetwork: ERROR_INVALID_HANDLE(6)' because of 'Cannot unregister handle 8

    '

    00000988.000027f0::2012/01/19-23:32:23.350 ERR [API] s_ApiCloseNetwork: ERROR_INVALID_HANDLE(6)' because of 'Cannot unregister handle 7

    What could this mean?

    More errors after that:

    [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Server Native Client 10.0]TCP Provider: An existing connection was forcibly closed by the remote host.

    printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Server Native Client 10.0]Communication link failure

    OnlineThread: QP is not online.

    What is puzzling is why the resources did not fail over for about 45 minutes until we hard rebooted the unresponsive server?

    Thanks.

  • Any troubleshooting tips?

  • If you can, I would bring both nodes down. Then power on the one with the problem first and see if the clusters go online, then bring the second node up.

    This has worked for me on several occasions.

  • Are you by any chance using NOD32 Antivirus and firewall?

    Leo

    Leo
    Nothing in life is ever so complicated that with a little work it can't be made more complicated.

  • _Beetlejuice (1/24/2012)


    If you can, I would bring both nodes down. Then power on the one with the problem first and see if the clusters go online, then bring the second node up.

    This has worked for me on several occasions.

    Had to do that eventually. Once restarted the victim server, it failed over properly to the other server but both servers stayed online only for several minutes before going offline again. So had to reboot the second server as well. :angry:

  • Leo.Miller (1/24/2012)


    Are you by any chance using NOD32 Antivirus and firewall?

    Leo

    No antivirus.

  • Have you checked for any errors in the windows event logs?

  • check the cluster error events within failover cluster manager for more info

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • Did check the logs, but nothing stands out...

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply