This is the last part of my confession series since this is showing me only in negative light. I wish there were positives that I could blog about but I guess there are not many. I must be doing OK since I am still employed here and have been promoted as manager of our department.
The last confession from me is that I do NOT trust clustering. Everyone says learn from your mistakes and that is exactly what has happened in this case. I know that there are some high profiled SAN advocates but this is how I feel.
We used to have our Database clustered some three years backs. The SAN was from a very reputed company. It was set up pretty good. We did not have any performance issues with the SAN. We did not notice much WAITs due to IO.
The first symptom started when the Cluster started failing over to the other node and then falling back. We did not figure out what exactly was wrong. We did talk to the support and they gave us enough reasons about why it could happen that it became impossible to pinpoint the reason.
One day while I was at the church with my family I got a call from our support team saying that the Database was down. I rushed to the datacenter where we were hosting our Servers and found out that we lost our SAN. Support was called and they could not help us bring the SAN back up. The Admins fought with the SAN for nearly three hours before they finally gave up on it. The worst part of the whole thing was that our replicated DB’s was also using the SAN for its storage. There fore we did not just loose the production server but we lost our replication environment as well.
Thankfully we had a warm standby (Log Shipping) that was not sharing the SAN with RAID 10 configuration. We had to bring that on. We were shipping the log once every five minutes. We lost around three to four minutes of data. But at least we had something. We had to set up replication from the scratch.
The top engineers from the SAN providers came down to our island to figure out it went down. They were able to bring the SAN back up online after two days. That was too late for us. And the worst part of it was that they could not figure out why the SAN went down. When your own product goes down and cant figure out why it went down, then something is not just right. They were saying that it could be the IO controller but they never gave a definite answer. From that day on we decided that we will not be using shared disk anymore. I am sure lots of you will disagree but for me the last straw was the fact that no one was able to tell us why exactly it went down.