SQL Cluster stops working

  • Two nodes Active\Active MS SQL Server Cluster

    Win2k3 enterprise edition, SP1

    SQL2k SP4

    Once or two times per month we get such situation when one of the cluster nodes unexpectedly tries to restart and hungs. Sometimes hungs the first node, sometimes the second.  This time the node B had hunged and its SQL error log doesn't show anything valuable. The Node A SQL error log:

    2007-04-04 13:32:17.83 spid2     LogWriter: Operating system error 21(error not found) encountered.

    2007-04-04 13:32:17.85 spid2     Write error during log flush. Shutting down server

    2007-04-04 13:32:17.85 spid18    Error: 9001, Severity: 21, State: 4

    2007-04-04 13:32:17.85 spid18    The log for database 'ABC' is not available..

    2007-04-04 13:32:17.86 spid18    Error: 9001, Severity: 21, State: 1

    2007-04-04 13:32:17.86 spid18    The log for database 'ABC' is not available..

    2007-04-04 13:32:17.88 spid18    Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2007-04-04 13:32:17.88 spid18    Database 'ABC' cannot be opened. It has been marked SUSPECT by recovery. See the SQL Server errorlog for more information.

    2007-04-04 13:32:17.97 spid18    Error: 3314, Severity: 21, State: 4

    2007-04-04 13:32:17.97 spid18    Error while undoing logged operation in database 'ABC'. Error at log record ID (956831:19771:58)..

    2007-04-04 13:32:18.24 spid205   BackLinkLogBlockReadAheadAsync: Operating system error 21(error not found) encountered.

    2007-04-04 13:32:23.68 spid2     LogWriter: Operating system error 21(error not found) encountered.

    2007-04-04 13:32:23.68 spid2     Write error during log flush. Shutting down server

    The thing is when one node hungs,  cluster resources groups located on this node doesn't failover to the other node and the Cluster service becomes unavailable too.

    I found one interesting MS article but I'm not sure whether it can be the case:

    http://support.microsoft.com/default.aspx?scid=kb;EN-US;838765

    We have 8GB of RAM on the each server and use the following switch in the boot.ini file:

    "/fastdetect /3GB /PAE /NoExecute=OptOut"

    Does anyone know what is happening here?

  • The system error 21 indicates that when the logWriter would like to write logs, a device is not ready. Here is a link for the system errors. To find more information on your problem, you may try to google the key phrase, the device is not ready.

    http://msdn2.microsoft.com/en-us/library/ms681382.aspx

  • Hi,

    looks to me like there is a problem with the relation between sql-cluster group and physical drives.

    is each physical disc a member of only one cluster group?

    had a case once where the tempdb and errorlogs were on a disk that somehow got later to be a member of not the sql server cluster group, but the backup exec cluster group. When the backup exec goup went offline or switched nodes the sql server lost the disk, an so lost tempdb.

    That configuration was created by an backup exec cluster specialist...

     

    regards

    karl

    Best regards
    karl

  • Hi,

    Yes, they are. Each disk belongs only to one cluster group. As I remmember you cannot add the same disk to different cluster resources groups.

    On weekend we updated firmware on these servers and on our SAN. Hope this will help. I'll come back here in case of the same problem

    Thanks

  • Did this error go . If yes how .

    Thanks in advance .

  • I'm interested in this error too. We have in our SAP cluster.

    In our case, I suppose the disk fails because of a chain of failures starting from a SAP failure.

  • Well, much time have passed. We solved the problem at the end. The cause was firmware. After updating them few times problem has gone

  • Thank you for the clue.

    But we have recently updated all server firmwares...

    Was it the firmware of the FC adapter?

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply