DBmirroring unexpectedly failover

Question

DBmirroring unexpectedly failover

Dhruva_51

SSChampion

Points: 10521
More actions
April 19, 2013 at 7:30 am

#276677

Hi All,
I had an unexpected Auto failover from Principal to Mirror server.
We saw a network spike from 32MB to 1117MB during that period in the reports but the spike was normal during business working hours.
The mirror is configured in HIgh safety with Automatic failover with witness server mode(synch)
One task was happening during that time was copy of 1.8GB compressed backup copy to the principal server.
Does the network spike happens because of this? As we do this all the time, i dont expect this as the issue.
Could not found any specific errors in the log-
The errors we found were as below:
I would like to know why exactly the failover happened. Please someone can help me in analysing the rootcause of this failover.
Error 1:
The command failed because the database mirror is busy. Reissue the command later.
Error 2:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\templogfile\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000514. The offset of the latest long I/O is: 0x000000000b5200
Error 3:
The mirroring connection to "TCP://XXXXXXX:5022" has timed out for database "dbname" after 10 seconds without a response. Check the service and network connections.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 1

muthyala_51 (4/19/2013)
Hi All,
I had an unexpected Auto failover from Principal to Mirror server.
We saw a network spike from 32MB to 1117MB during that period in the reports but the spike was normal during business working hours.
The mirror is configured in HIgh safety with Automatic failover with witness server mode(synch)
One task was happening during that time was copy of 1.8GB compressed backup copy to the principal server.
Does the network spike happens because of this? As we do this all the time, i dont expect this as the issue.
Could not found any specific errors in the log-
The errors we found were as below:
I would like to know why exactly the failover happened. Please someone can help me in analysing the rootcause of this failover.
Error 1:
The command failed because the database mirror is busy. Reissue the command later.
Error 2:
SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [E:\templogfile\templog.ldf] in database [tempdb] (2). The OS file handle is 0x0000000000000514. The offset of the latest long I/O is: 0x000000000b5200
Error 3:
The mirroring connection to "TCP://XXXXXXX:5022" has timed out for database "dbname" after 10 seconds without a response. Check the service and network connections.

The spike may have caused a delay in communication between the principal and witness servers. You may want to increase the timeout for failover from 10 seconds to 30 seconds. We had to do this at a previous employer where I had setup database mirroring as we had issues with our network. It was not the stablest of networks and we had periodic glitches during high volume times.

Dhruva_51 SSChampion Points: 10521 More actions · Answer 2

But increasing the response time might not give us the actual root cause why it happened.

I am looking more into I/O error what we received- looks to be DISK I/O issue. I have ran the Perfmon counter and saw that the Avg DiskSec/Transfer is >0.015 seconds during File copy

. Can you direct me on this? Thanks.

Dhruva_51 SSChampion Points: 10521 More actions · Answer 3

One more thing to add, the servers are Virtual (Principal, Mirror and witness).

Lynn Pettis SSC Guru Points: 442467 More actions · Answer 4

muthyala_51 (4/19/2013)
But increasing the response time might not give us the actual root cause why it happened.
I am looking more into I/O error what we received- looks to be DISK I/O issue. I have ran the Perfmon counter and saw that the Avg DiskSec/Transfer is >0.015 seconds during File copy
Also noticed during the File copy of file size around 4GB to the one of the disk drives- the SQL server got hang and everything was frozen for couple of minutes and the status of Database on Mirror server were in (Disconnected/In recovery mode), they came to normal state after few minutes. Can you direct me on this? Thanks.

Root cause? Your principal and witness servers were unable to communicate during the timeout period, resulted in the witness making a determination that the prinicapl server was down and initiated a failover to the mirror.

Why? Not enough network bandwidth to communicate due to large data transfer(s) occuring.

Once again, I had this issue at a previous employer, the resolution was to increase the timeout period before a failover occured. This solved the issue of our somewhat instable network causing a failover when there really wasn't a problem. Our automatic failover worked fine when there were real problems with our servers.

Neeraj Dwivedi SSCertifiable Points: 6768 More actions · Answer 5

Lynn Pettis is right. But one variable here is the virtualization of SQL Server. If you have vMotion enabled and due to memory/ CPU ballooning if the Principal or Mirror is moved, this can happen.

I have seen this in our environment and now we have Disable DRS for SQL VMs for that reason only.