SQL 2005 Cluster FUBARed- I am being asked to repair it tomorrow.

  • Hello All,

    I have been reading SQL Server Central for years, but never posted before. However I am being asked to fix something that someone else screwed up and I wanted any advice you may have to offer. (I have removed this guy's admin rights so he can't screw anything up in the future, yes, it was my mistake that he had them to begin with)

    I have a 2 cluster node (Active/Passive) running Windows 2003 Enterprise SP2 along with SQL 2005 STD with SP2. The cluster machine has local disks for the OS, but all of the SQL Data drives are on an iSCSI SAN. This cluster is NOT in production yet but they want it ready ASAP so I need to get cracking on fixing the issue.

    On Friday another guy in the company was replacing the NICs used for the iSCSI connection due to performance issues we found with the NICs themselves. Everything went fine on the passive node, so he failed the cluster over to it and started working on the other machine. All went well until the reboot of the 2nd machine, cluster services wouldn't start, the error being that the Cluster DB was corrupt.

    He decided that he would evict the bad node and readd it to the cluster so it would resync the database since the active node was running fine. So he evicted the passive node and attempted to re-add it. This failed with the error that it wasn't able to update the Cluster DB on the new node. So he ran "cluster node /force" on the evicted node. This completed and he again attempted to add the node again, it again failed with unable to sync the DB on the new node.

    This is where he royally messed things up, he ran "cluster node /force' on the ACTIVE node. This of course cleaned up the cluster database on the active node effectively killing the cluster. So now SQL won't run on either server and the cluster needs to be setup again.

    So I have 2 machines where the OS is just fine and SQL 2005 is installed in clustered mode on both nodes that are now not a member of a cluster. I need to recreate the cluster and get SQL back into clustered mode.

    My plan of action was to attempt to uninstall SQL on both servers, recreate the cluster, reinstall SQL in clustered mode, then apply SQL SP2. Does that sound like a good course of action?

    I would appreciate ANY tips anyone would have.

  • Clustering can be complicated and I'm not sure how many people would have encountered this. If this isn't critical data and you have backups, or can afford delays or losses, I might tackle it as you mentioned.

    If not, call MS.

  • Thanks

    the data isn't critical.

    I just need to get it repaired.

    I guess I will attempt what I laid out.

    If anyone has suggestions, feel free to lay them out.

  • I am with Steve here: call Microsoft Support. You have an incredibly esoteric situation that even they may not have encountered. Uninstalling/reinstalling SQL Server MAY work, but I wouldn't be surprised if you are advised to restart from formatted drives. 🙁 If nothing else, it would be guaranteed to work. Just redoing SQL Server may appear to work but have some issue that screws you 7 months from now when you really need the clustering to be bullet-proof.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thanks for the advice everyone.

    I was able to successfully repair the cluster.

    It was pretty easy.

    These are the steps that I followed to recover the cluster:

    1) I first removed the cluster database from both nodes following the steps on http://support.microsoft.com/kb/282227

    2) I then ran "cluster node /forcecleanup" on each node to ensure that it really was cleaned up.

    3) I then followed the instructions to manually uninstall SQL. Basically you just need to uninstall SQL on each node. http://msdn.microsoft.com/en-us/library/ms180973(SQL.90).aspx

    4) I setup the cluster following the normal best practices from MS.

    5) I checked MSDTC to make sure it was installed and configured it on the cluster.

    6) I installed clustered SQL 2005 RTM, then applied SQL SP2.

    I checked all the logs and they were error free. I restored the SQL databases we had from a backup and everything is back online. All the apps that were using the cluster are back online and I am going to monitor it for issues.

  • Thanks for this valuable feedback. (with kb refs) :smooooth:

    If you still have the occasion ... apply the latest tested cumulative update .

    Did you find a reason for your clusterdb to crash ??

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • I have no problem coming back and updating a request for help. All to often I find people asking questions and it ends with them posting "Thanks, I got it fixed" without any more details.

    I do want to apply the latest update, but I wanted to run it for a few days at the same patch level it was prior to the rebuild. If everything goes well I will update it then. However if I patch it now and I need to troubleshoot an issue today or tomorrow, I won't be able to know if the rebuild or the patching caused the issue.

    The reason for the corrupt Cluster DB was due to NICs being moved around. We were having issues with Broadcom NICs with our SAN (it is a known issue with the model of Broadcom NIC we had). During the swap of NICs the DB didn't like it. No hardware was actually changed, we just changed what NICs we were using to talk to the SAN. 1 of the nodes went just fine, however the 2nd one wasn't as smooth.

  • I recommend (if not done yet) that you do a forced-failover (and fail back) under load to ensure everything is truly functional. Glad you had a (relatively) simple resolution!

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Yep, that was part of my testing...everything so far seems to be stable.

  • Just another flash in the brain ..... or what's left of it;)

    When you switched the NICs, did you rename the NIC ?

    We tend to use "public network connection" for our NIC that is connected to the network.

    I wouldn't be surprised the cluster software would flip if your card's logical name is no longer present.

    Johan

    Learn to play, play to learn !

    Dont drive faster than your guardian angel can fly ...
    but keeping both feet on the ground wont get you anywhere :w00t:

    - How to post Performance Problems
    - How to post data/code to get the best help[/url]

    - How to prevent a sore throat after hours of presenting ppt

    press F1 for solution, press shift+F1 for urgent solution 😀

    Need a bit of Powershell? How about this

    Who am I ? Sometimes this is me but most of the time this is me

  • ALZDBA (10/22/2008)


    Just another flash in the brain ..... or what's left of it;)

    When you switched the NICs, did you rename the NIC ?

    We tend to use "public network connection" for our NIC that is connected to the network.

    I wouldn't be surprised the cluster software would flip if your card's logical name is no longer present.

    I sure as heck hope the cluster service isn't dependent on something as transitory as a logical name. Probably uses deviceid or MAC address (I would hope).

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • I did rename the NICS, but not for the reason you mentioned. We have 6 NICs in the each machine so I named them to keep them straight.

    Either way, I rebuilt the cluster setup from scratch and so I don't think there would be any confusion based on the old names.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply