AlwaysOn fails every hour

  • Hi All,

    I got AlwaysOn alert from our monitoring system that some databases are not synced, then I check the sql server log and event viewer, found there is pattern of the issue, which occurs every hour.

    The error indicates there is connection timeout, googled the message I double checked the NT AUTHORITY\SYSTEM permission, it looks good, then I checked the port 5022 for endpoint, found svchost also listening to this port, not sure it is correct or not(our other cluster has same configuration and I can see svchost is listening the endpoint port).

    I copied all messages below, does anyone experience this issue before?(I just notice when I observe the sql server around 7:33, there is no alwayson error in sql server log, but the event viewer shows errors, that is weird).

    SQL Server log Occurs at 6:33, 5:33, 4:33 every hour except the 7:33 when I was observing...

    AlwaysOn Availability Groups connection with secondary database terminated for primary database 'XXXYYY' on the availability replica 'XXX3' with Replica ID: {9b3bd423-f2fc-44c5-9831-41c94c4d6de2}. This is an informational message only. No user action is required.

    A connection timeout has occurred while attempting to establish a connection to availability replica 'XXX3' with id [41F87192-C846-4355-A8DC-C788EC56E93E]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.

    AlwaysOn Availability Groups connection with secondary database established for primary database 'XXXYYY' on the availability replica 'XXX3' with Replica ID: {41f87192-c846-4355-a8dc-c788ec56e93e}. This is an informational message only. No user action is required.

    Event Viewer, occurs at 7:33, 6:33, 5:33, 4:33 ... every hour

    Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.

    Clustered role 'XXXRole' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    The Cluster service failed to bring clustered role 'XXXRole' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    As the endpoint port is 5022, so I output the netstat found following ports info:

    netstat -anb

    [svchost.exe]

    TCP 0.0.0.0:5022 0.0.0.0:0 LISTENING

    [svchost.exe]

    TCP 10.1.16.181:5022 10.1.16.182:55377 ESTABLISHED

    [sqlservr.exe]

    TCP 10.1.16.181:5022 10.1.16.183:63670 ESTABLISHED

    [sqlservr.exe]

    TCP 10.1.16.181:5022 10.1.16.183:63670 ESTABLISHED

    [sqlservr.exe]

    TCP 10.1.16.181:51465 10.1.138.161:49159 TIME_WAIT

    TCP 10.1.16.181:51496 10.1.16.182:5022 ESTABLISHED

    [SQLAGENT.EXE]

    TCP 10.1.16.181:62876 10.1.16.183:5022 ESTABLISHED

    [svchost.exe]

    TCP [::]:5022 [::]:0 LISTENING

  • i1888 (6/10/2015)


    Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.

    Clustered role 'XXXRole' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    Cluster resource 'XXXRes' of type 'SQL Server Availability Group' in clustered role 'XXXRole' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    The Cluster service failed to bring clustered role 'XXXRole' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    looks like the alwayson groups clustered resource is offline have you investigated this

    -----------------------------------------------------------------------------------------------------------

    "Ya can't make an omelette without breaking just a few eggs" 😉

  • I looked through the errors in Cluster Events, there are 3 type of errors keep being generated for many cluster resources every hour, there are no other errors

    Message 1:

    Cluster resource 'XXXA' of type 'SQL Server Availability Group' in clustered role 'XXXA' failed.

    Based on the failure policies for the resource and role, the cluster service may try to bring the resource online on this node or move the group to another node of the cluster and then restart it. Check the resource and group state using Failover Cluster Manager or the Get-ClusterResource Windows PowerShell cmdlet.

    Message 2:

    The Cluster service failed to bring clustered role 'XXXA' completely online or offline. One or more resources may be in a failed state. This may impact the availability of the clustered role.

    Message 3:

    Clustered role 'XXXA' has exceeded its failover threshold. It has exhausted the configured number of failover attempts within the failover period of time allotted to it and will be left in a failed state. No additional attempts will be made to bring the role online or fail it over to another node in the cluster. Please check the events associated with the failure. After the issues causing the failure are resolved the role can be brought online manually or the cluster may attempt to bring it online again after the restart delay period.

    The situation is we have 3 nodes A, B and C, A and B are in sync mode and manual failover, C is in Async, I am thinking since it is manual failover mode, how can the clustered role exceeds the failover threshold, maybe my understanding is wrong...

  • above error occurs every hour, but the SQL Server AlwaysON doesn't seem to be affected every hour...

  • Lots connection timeout and waitforharden found in extend events.

  • cannot share photo though, the photo link is http://s30.This image host is not supported, please use another/tvnvvvsxd/alwayson_001.png

  • I don't understand in the SQL Diag log, it says caused by client or server login timeout expiration, but from the log itself, the time spent during all stages are 0 ms. I am very confused by the SQL Server AlwaysON.

    This may have been caused by client or server login timeout expiration. Time spent during login: total 0 ms, enqueued 0 ms, network writes 0 ms, network reads 0 ms, establishing SSL 0 ms, network reads during SSL 0 ms, network writes during SSL 0 ms, secure calls during SSL 0 ms, enqueued during SSL 0 ms, negotiating SSPI 0 ms, network reads during SSPI 0 ms, network writes during SSPI 0 ms, secure calls during SSPI 0 ms, enqueued during SSPI 0 ms, validating login 0 ms, including user-defined login processing 0 ms.

    here is intact error message:

    [hadrag] SQL Server component 'query_processing' health state has been changed from 'warning' to 'clean' at 2015-06-22 12:27:39.833

    0636806642149100426925182080000000MEMORYBROKER_FOR_COMMITTEDGROW0636806642149100426925182080000000MEMORYBROKER_FOR_HASHED_DATA_PAGESGROW0636806642149100426925182080000000MEMORYBROKER_FOR_XTPGROW028732499536824352175614645808645446070228813743895334413689500869200819199765423679363498804075606836712903364MEM_STEADY0001Off0On1Off028732499536824352175614005808645446070228813743895334413689500869200819199765423679363498804075606836712940721MEMPHYSICAL_HIGH0101Off0On1Off028732499536824352175626045808645446068444813743895334413689500869200819199765423679363498804075606572713189202MEMPHYSICAL_LOW0001Off2Ignore1Off1783020105false4LOGON0x00000010BUFFERfalseNetwork error code 0x2746 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total 0 ms, enqueued 0 ms, network writes 0 ms, network reads 0 ms, establishing SSL 0 ms, network reads during SSL 0 ms, network writes during SSL 0 ms, secure calls during SSL 0 ms, enqueued during SSL 0 ms, negotiating SSPI 0 ms, network reads during SSPI 0 ms, network writes during SSPI 0 ms, secure calls during SSPI 0 ms, enqueued during SSPI 0 ms, validating login 0 ms, including user-defined login processing 0 ms. [CLIENT: 10.1.16.142]1783020105false4LOGON0x00000010BUFFERfalseNetwork error code 0x2746 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total 1 ms, enqueued 0 ms, network writes 0 ms, network reads 0 ms, establishing SSL 0 ms, network reads during SSL 0 ms, network writes during SSL 0 ms, secure calls during SSL 0 ms, enqueued during SSL 0 ms, negotiating SSPI 0 ms, network reads during SSPI 0 ms, network writes during SSPI 0 ms, secure calls during SSPI 0 ms, enqueued during SSPI 0 ms, validating login 0 ms, including user-defined login processing 0 ms. [CLIENT: 10.1.16.142]1783020105false4LOGON0x00000010BUFFERfalseNetwork error code 0x2746 occurred while establishing a connection; the connection has been closed. This may have been caused by client or server login timeout expiration. Time spent during login: total 1 ms, enqueued 0 ms, network writes 0 ms, network reads 0 ms, establishing SSL 0 ms, network reads during SSL 0 ms, network writes during SSL 0 ms, secure calls during SSL 0 ms, enqueued during SSL 0 ms, negotiating SSPI 0 ms, network reads during SSPI 0 ms, network writes during SSPI 0 ms, secure calls during SSPI 0 ms, enqueued during SSPI 0 ms, validating login 0 ms, including user-defined login processing 0 ms. [CLIENT: 10.1.16.142]0046772406250012734375067-12288990720489832492LognTimers1TDS38610054017830710714335769310054000x00000026DisconnectDueToReadError, NetworkErrorFoundInInputStream, NormalDisconnect38003700380032100D69B93AE-C9DA-4C12-AE3A-F65FF73F93BEFF9A815F-AF5B-4AEA-8ABE-0F3C457D6D1110.1.20.12110.1.16.2410720489832492LognTimers1TDS44810054017830710714335084210054000x00000026DisconnectDueToReadError, NetworkErrorFoundInInputStream, NormalDisconnect38001800380014200E52F8D7B-94C1-47C2-823A-453EF19168DEC1CB981B-06E3-42E6-875A-BAA22B8D83D310.1.20.12210.1.16.2510056887750000014171875064-12288990720489832492LognTimers1TDS102410054017830710714335141710054000x00000026DisconnectDueToReadError, NetworkErrorFoundInInputStream, NormalDisconnect380019003800151007ED82416-2ECE-4E30-8551-468AA1538B14CEEA5242-BF96-40A1-91F8-0DC73ECFEBED10.1.20.12210.1.16.2450046974046875014953125074-983049901985820983275023ImpersonateSecurityContextNLShimImpersonate01992740943275023ImpersonateSecurityContextNLShimImpersonate01992182577335023ImpersonateSecurityContextNLShimImpersonate01992740948955023ImpersonateSecurityContextNLShimImpersonate019921825710415023ImpersonateSecurityContextNLShimImpersonate019848515911325023ImpersonateSecurityContextNLShimImpersonate019907222811325023ImpersonateSecurityContextNLShimImpersonate019901589012115023ImpersonateSecurityContextNLShimImpersonate019916134912135023ImpersonateSecurityContextNLShimImpersonate019846560012145023ImpersonateSecurityContextNLShimImpersonate019846559713735023ImpersonateSecurityContextNLShimImpersonate015528762114945023ImpersonateSecurityContextNLShimImpersonate019935784415255023ImpersonateSecurityContextNLShimImpersonate019851552515265023ImpersonateSecurityContextNLShimImpersonate019910873615285023ImpersonateSecurityContextNLShimImpersonate019929771515335023ImpersonateSecurityContextNLShimImpersonate019929771515345023ImpersonateSecurityContextNLShimImpersonate019858854115335023ImpersonateSecurityContextNLShimImpersonate019938683610005023ImpersonateSecurityContextNLShimImpersonate019929771515415023ImpersonateSecurityContextNLShimImpersonate019923221615455023ImpersonateSecurityContextNLShimImpersonate019837822015495023ImpersonateSecurityContextNLShimImpersonate019879040415515023ImpersonateSecurityContextNLShimImpersonate019891985815635023ImpersonateSecurityContextNLShimImpersonate015528762315675023ImpersonateSecurityContextNLShimImpersonate019913772915685023ImpersonateSecurityContextNLShimImpersonate015523118415695023ImpersonateSecurityContextNLShimImpersonate015523118215705023ImpersonateSecurityContextNLShimImpersonate019926442815715023ImpersonateSecurityContextNLShimImpersonate019932456015745023ImpersonateSecurityContextNLShimImpersonate019912957315755023ImpersonateSecurityContextNLShimImpersonate019928912515785023ImpersonateSecurityContextNLShimImpersonate019931811515825023ImpersonateSecurityContextNLShimImpersonate019916886615835023ImpersonateSecurityContextNLShimImpersonate019882583815865023ImpersonateSecurityContextNLShimImpersonate01988014445925023ImpersonateSecurityContextNLShimImpersonate019878074015875023ImpersonateSecurityContextNLShimImpersonate019909675915885023ImpersonateSecurityContextNLShimImpersonate019860249915915023ImpersonateSecurityContextNLShimImpersonate019943300615925023ImpersonateSecurityContextNLShimImpersonate019849589715935023ImpersonateSecurityContextNLShimImpersonate019938468715945023ImpersonateSecurityContextNLShimImpersonate019932670615965023ImpersonateSecurityContextNLShimImpersonate019935462215975023ImpersonateSecurityContextNLShimImpersonate019933422215985023ImpersonateSecurityContextNLShimImpersonate019879899415995023ImpersonateSecurityContextNLShimImpersonate019935033016005023ImpersonateSecurityContextNLShimImpersonate019924080616015023ImpersonateSecurityContextNLShimImpersonate019884566816025023ImpersonateSecurityContextNLShimImpersonate019879792016035023ImpersonateSecurityContextNLShimImpersonate019916349616045023ImpersonateSecurityContextNLShimImpersonate019916349616055023ImpersonateSecurityContextNLShimImpersonate019912313116065023ImpersonateSecurityContextNLShimImpersonate019861753116075023ImpersonateSecurityContextNLShimImpersonate019859820416085023ImpersonateSecurityContextNLShimImpersonate019924832516095023ImpersonateSecurityContextNLShimImpersonate019887194216105023ImpersonateSecurityContextNLShimImpersonate019928268216115023ImpersonateSecurityContextNLShimImpersonate01993148953275023ImpersonateSecurityContextNLShimImpersonate01993105993275023ImpersonateSecurityContextNLShimImpersonate01992891275925023ImpersonateSecurityContextNLShimImpersonate01992236296265023ImpersonateSecurityContextNLShimImpersonate01992236276635023ImpersonateSecurityContextNLShimImpersonate01992139628175023ImpersonateSecurityContextNLShimImpersonate01993063048955023ImpersonateSecurityContextNLShimImpersonate01984666703275023ImpersonateSecurityContextNLShimImpersonate01992891256635023ImpersonateSecurityContextNLShimImpersonate01993407678175023ImpersonateSecurityContextNLShimImpersonate019931811510005023ImpersonateSecurityContextNLShimImpersonate019891985810415023ImpersonateSecurityContextNLShimImpersonate019846560010785023ImpersonateSecurityContextNLShimImpersonate019934076411175023ImpersonateSecurityContextNLShimImpersonate019924510111325023ImpersonateSecurityContextNLShimImpersonate019934076712115023ImpersonateSecurityContextNLShimImpersonate019917316012135023ImpersonateSecurityContextNLShimImpersonate019877574112355023ImpersonateSecurityContextNLShimImpersonate019845539612745023ImpersonateSecurityContextNLShimImpersonate019917316212815023ImpersonateSecurityContextNLShimImpersonate019917316013695023ImpersonateSecurityContextNLShimImpersonate019917316213735023ImpersonateSecurityContextNLShimImpersonate019938468713735023ImpersonateSecurityContextNLShimImpersonate019918497413835023ImpersonateSecurityContextNLShimImpersonate015515920814905023ImpersonateSecurityContextNLShimImpersonate019924080614945023ImpersonateSecurityContextNLShimImpersonate019938469014955023ImpersonateSecurityContextNLShimImpersonate019887194515175023ImpersonateSecurityContextNLShimImpersonate019923436315255023ImpersonateSecurityContextNLShimImpersonate019916886615265023ImpersonateSecurityContextNLShimImpersonate019851552515285023ImpersonateSecurityContextNLShimImpersonate019848515915335023ImpersonateSecurityContextNLShimImpersonate019935032815345023ImpersonateSecurityContextNLShimImpersonate019926228315415023ImpersonateSecurityContextNLShimImpersonate019924510115455023ImpersonateSecurityContextNLShimImpersonate019858209815465023ImpersonateSecurityContextNLShimImpersonate019858854115495023ImpersonateSecurityContextNLShimImpersonate019923221615515023ImpersonateSecurityContextNLShimImpersonate019854022515635023ImpersonateSecurityContextNLShimImpersonate019852841315675023ImpersonateSecurityContextNLShimImpersonate01992794613275023ImpersonateSecurityContextNLShimImpersonate01984553967855023ImpersonateSecurityContextNLShimImpersonate01989386487855023ImpersonateSecurityContextNLShimImpersonate01993148953275023ImpersonateSecurityContextNLShimImpersonate01993105997855023ImpersonateSecurityContextNLShimImpersonate01992794617855023ImpersonateSecurityContextNLShimImpersonate01992794618955023ImpersonateSecurityContextNLShimImpersonate019846559710005023ImpersonateSecurityContextNLShimImpersonate019931489510415023ImpersonateSecurityContextNLShimImpersonate019924080610595023ImpersonateSecurityContextNLShimImpersonate01984958978015023ImpersonateSecurityContextNLShimImpersonate

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply