Today I got a database mirror case, I was told the all databases were in disconnected\in recovery status on both primary and mirror server, it is a dev environment.
Here just share my troubleshooting steps:
1. First, I login primary and mirror server. I got the same database status on both 2 servers
so it showed to me:
a) since all databases had problem, it should be a system level error, not single database issue
b) database in "disconnected/In Recovery" status, it indicts the server not only lost communication with the partner server(primary or mirror), but also lost communication with witness. so sql server can not bring database online on either servers, or else it will cause split-brain problem.
the question now is : why the 3 servers lost communication with each other?
2. Check connection among 3 servers with telnet command
telnet [sqlservername] [port]
all 3 servers can connect each other, so the network should be fine, and endpoing port is opened, firewall should be fine.
3. Check SQL Server Errorlog
I checked the sql server errorlog on all 3 servers(primary, mirror, and witness), I found same error message below:
2013-11-06 23:07:47.70 Logon Database Mirroring login attempt by user 'xxx\xxxxx-db.' failed with error: 'Connection handshake failed. The login 'xxx\xxxxx-db' does not have CONNECT permission on the endpoint. State 84.'. [CLIENT: xx.xx.xx.xx]
so it should be permission account of account 'xxx\xxxxx-db'
4. Check Account 'xxx\xxxxx-db' in sql server
By checking account permission, it shows the account is sys admin on every sql servers. so there should be no permission issue.
in your case, if you find account is not sys admin, make sure you grant connect endpoint permission to that account.
so what's wrong here?
Although the account is sys admin in sql server, but it doesn't mean the account is 100% ok, because it is window domain account, when using window authentication mode, sql server will ask window to authenticate the user account first(by NTLM or Kerberos), if windows authentication is failed, sql server will reject that login request.
so I tried to login sql server with the account 'xxx\xxxxx-db', this time I got error message below:
Good. we are closed to the answer. normally, you can find the detail information of login failed error in errorlog just like below:
2006-02-27 00:02:00.34 Logon Error: 18456, Severity: 14, State: 8.
2006-02-27 00:02:00.34 Logon Login failed for user '<user name>'. [CLIENT: <ip address>]
the most important information is "State" code:
Attempt to use a Windows login name with SQL Authentication
Login disabled and password mismatch
Valid login but server access failure
SQL Server service paused
But unfortunately, this time I didn't find anything from errorlog! then how to capture the detail information? you are right, Profiler trace!
I run the Profiler trace on sql server, and enabled the "Audit login failed" event only. finally I got the detail error message below
by searching the error message, I found the link below
It prove my guess that the window authentication has problem, in other words, the Account should have problem.
what is the next step?
5. Check if there was any changes made before the error occurred
I am not a AD expert, troubleshooting the window account authentication is little bit hard for me, I can do it, but it might take several hours or days to troubleshooting it. so the easiest way is checking with the guy who report this issue to me, or ask AD expert to look into it.
I asked the engineer who report this issue : is there any change they made on the account 'xxx\xxxxx-db' before the error occurred? suddenly they seems to remember something, and told me wait a minute, then they come back to me and said the issue has been solved, the mirror is recovered now. I went back to the server, it is true, all database are in mirror synchronized status.
I asked them what happen, they told they made some changes on that account(like pull the account out of some domain groups), if I didn't mention the account to them, they would never think about the problem was caused by their change. so problem is solved.
Despite no root cause here, we have some steps to narrow down the problem, and find a fastest way to solve it.