Regarding Cluster Failover

  • Hi All,

    I was troubleshooting a connectivity issue for a cluster server.

    SQL 2k5 on Windows 2k3.

    Found an instance is down, manually brought back online. However, I am wondering howcome it didn't failover to another node rather than being failed.

    Does the failover only happens when the node is down?

    can someone advise me how to check why it didn't failedover/

    Thanks,

    SueTons.

    Regards,
    SQLisAwe5oMe.

  • have you checked the windows logs? I'm sure something would be in there if a failover was attempted but couldn't succeed for some reason.

    ______________________________________________________________________________________________
    Forum posting etiquette.[/url] Get your answers faster.

  • check c:\windows\system32\cluster\cluster.log

  • Within the Configuration are both servers possible owners of the resources within the SQL Resource Group? Also are there any errors being reported within the windows event logs that indicate there was an issue when trying to failover to the secondary Node and was the Cluster Service also running on the Passive Node

    The failover mechanism uses a "looks alive and IsAlive check. The "looks alive" check takes place every 5 seconds on the host node within the failover cluster. Whereas the a more in depth check called the IsAlive check takes place every 60 seconds using SELECT @@SERVERNAME.

  • SQLCrazyCertified (11/20/2012)


    Hi All,

    I was troubleshooting a connectivity issue for a cluster server.

    SQL 2k5 on Windows 2k3.

    Found an instance is down, manually brought back online. However, I am wondering howcome it didn't failover to another node rather than being failed.

    Does the failover only happens when the node is down?

    can someone advise me how to check why it didn't failedover/

    Thanks,

    SueTons.

    The default generally is to try to restart locally, if this is unsuccessful try 3 times on the partner. If this is unsuccessful the group goes offline.

    Check the event logs but also, with the group offline move it to the partner node if its not already.

    Now try to bring the resources online one at a time in this order

    • Network IP
    • Network name
    • Disk resources
    • Sql server service
    • Sql server agent service

    Look out for any resources that fail, most common is the disk resources not failing properly from one node to another.

    Btw in future it would help if you post in the correct forum

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • From eventvwr, I see this....

    "A fatal error occurred while reading the input stream from the network. The session will be terminated."

    I also see another error every 4 mins or so with different SPID#.

    "The client was unable to reuse the session with SPID 243, which had been reset for connection pooling. This error may have been caused by an earlier operation failing. Check the error logs for failed operations immediately before this error message."

    Another error

    "Error -2147023545 - Configuration information could not be read from the domain controller, either because the machine is unavailable, or access has been denied."

    Couple of warninings as this

    "The configuration information of the performance library "C:\WINNT\system32\sqlctr90.dll" for the "MSSQL$InstanceName" service does not match the trusted performance library information stored in the registry. The functions in this library will not be treated as trusted."

    These are some of the repetitive errors/warnings that I see and cannot make much sense out of it.

    Anyway, the instance was manually brought online and working properly, however, I was trying to find the root cause of it.

    If you guys can make much more sense out of the errors above let me know.

    Thansk,

    SueTons.

    Regards,
    SQLisAwe5oMe.

  • "A fatal error occurred while reading the input stream from the network. The session will be terminated" - This is a network adapter error, most probably a driver issue.

    To determine the current status of TCP Chimney Offload,

    Use administrative credentials to open a command prompt and run the following command

    netsh int tcp show global

  • Are there any messages suggesting a problem with "Heartbeat"? Or network connectivity isseus on the passive node in general? Heartbeat is a special network connection between the two nodes used to monitor. each other. If the passive node was having netwrk issues while the active node crashed, that would explain what occurred.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply