SQL Clone
SQLServerCentral is supported by Redgate
Log in  ::  Register  ::  Not logged in

Confessions of a DBA – Part 4

This is the last part of my confession series since this is showing me only in negative light. I wish there were positives that I could blog about but I guess there are not many. I must be doing OK since I am still employed here and have been promoted as manager of our department.

The last confession from me is that I do NOT trust clustering. Everyone says learn from your mistakes and that is exactly what has happened in this case. I know that there are some high profiled SAN advocates but this is how I feel.

We used to have our Database clustered some three years backs. The SAN was from a very reputed company. It was set up pretty good. We did not have any performance issues with the SAN. We did not notice much WAITs due to IO.

The first symptom started when the Cluster started failing over to the other node and then falling back. We did not figure out what exactly was wrong. We did talk to the support and they gave us enough reasons about why it could happen that it became impossible to pinpoint the reason.

One day while I was at the church with my family I got a call from our support team saying that the Database was down. I rushed to the datacenter where we were hosting our Servers and found out that we lost our SAN. Support was called and they could not help us bring the SAN back up. The Admins fought with the SAN for nearly three hours before they finally gave up on it. The worst part of the whole thing was that our replicated DB’s was also using the SAN for its storage. There fore we did not just loose the production server but we lost our replication environment as well.

Thankfully we had a warm standby (Log Shipping) that was not sharing the SAN with RAID 10 configuration. We had to bring that on. We were shipping the log once every five minutes. We lost around three to four minutes of data. But at least we had something. We had to set up replication from the scratch.

The top engineers from the SAN providers came down to our island to figure out it went down. They were able to bring the SAN back up online after two days. That was too late for us. And the worst part of it was that they could not figure out why the SAN went down. When your own product goes down and cant figure out why it went down, then something is not just right. They were saying that it could be the IO controller but they never gave a definite answer. From that day on we decided that we will not be using shared disk anymore. I am sure lots of you will disagree but for me the last straw was the fact that no one was able to tell us why exactly it went down.


Posted by John Sansom on 19 April 2011

Roy, this post is certainly not showing you in a negative light. Far from it sir!

The most successful people got to where they are today by making more mistakes than everyone else! The difference is that they were not afraid to make them.

Making mistakes can often be one of the first steps on the road to learning great things. After all, learning more is what being a DBA is all about.

Keep making mistakes ;-)

Posted by Roy Ernest on 19 April 2011

Thanks John for the words of encouragement... One thing I know is that if I make a mistake, I do not repeat it again. :-)

Posted by Jason Brimhall on 19 April 2011

If you can recognize that you made a mistake and then learn from that mistake - there is nothing negative about that.

Posted by Roy Ernest on 19 April 2011

True True... But not everyone think like that.

Leave a Comment

Please register or log in to leave a comment.