Mirrored Replication and a failed failover

  • I've run into an interesting situation that I can't see the best and safest way out.

    We run a mirrored DB that uses replication (separate clustered distributor and about 10 load balanced subscribers for multiple reads etc...).

    On Monday evening a firewall change was pushed through by our sysadmins that was problematic. I asked for a port to be opened and they unknowingly forced a group policy for desktops on our DB servers. They know about the mess they caused but that's outside of my immediate problems here. The end result was basically all of the ports we needed (such as 1433) were closed. Due to the deliberate random push of the policy we lost things at slightly different times. First to go was the principal. When that went the mirroring did what it should have done and started failover (we are on 2008 Standard version by the way). During the failover we lost what was the original mirror (in the same way). Not long after the rest toppled (inc the distributor which was acting as the witness but will soon be a different instance). We had a good 20k connections on the front end at the time.

    We identified the problem quickly but were left with a dead site. BOTH the original principal and mirror showed as disconnected and being the principal. Replication monitor showed BOTH servers/instances as having the publications (although it no longer does). Having to make a decision quickly to get things working I binned the mirror (using ALTER DATABASE XX SET PARTNER OFF) and I got us a working site. There were no connections to the mirror so I have a reliable set of data on the principal. The hours downtime probably cost the company £30k so it had to be a quick decision. The sacrifice as I could see it was to lose HA and gain a working site. I ran this on the mirror as well to cover my bases and then took a backup of both just in case.

    From where I am now, that looked like the right decision as we are working well. What we don't have is the mirroring.

    Looking at the mirror server, what I need to do is bin the copy of the DB and re-do the mirror. However, I can't remove the DB because it is being replicated (not surprising).

    I can stop that server/instance without any detrimental consequences on the site. However, I removed a none-essential publication and it removed it from the original principal as well. Therein lies my dilemma. I've seen this once before in a test environment but can't for the life of me remember what I did to sort it.

    Replication on the original principal is working fine.

    What I need to do is remove the DB from the original mirror without affecting the existing replication. I'm cautious as you would expect. Re-doing replication is not really an option as we have roughly a billion rows in the DB and it will take a couple of hours downtime to push elsewhere. Re-building the mirror once done is a doddle, but how can I do this and be as safe as possible. If I have to bin the instance or OS that's do-able, but there must be an easy way out that I've not found yet.

    Any help would be appreciated

    Thanks

    Ryan

  • Sussed it. I set the (bad) copy to offline and then dropped the DB. Everything is as it should be and I can start to put the mirroring back on again soon.

    For extra paranoia we also altered the switch to completely isolate that server while we did it. It might not have been needed, but was an easy tweak to add to the mix.

  • Ryan,

    Slightly off topic but, it sounds to me you are using mirroring for HA and replicating your mirrored DB. Do your subscribers also need to be highly available?

    We looked at doing something similar but needed our OLTP DB (Publisher) and our Reporting DB (Subscriber) for the highly available. Mirroring both DBs with a witness works fine but the distribution DB is still a single point of failure.

    We eventually when with a failover cluster for the lot for HA but I would be interested to hear if you had a solution for the distribution DB.

    Cheers

    Andrew

  • Andrew,

    Our distributor is clustered (active/passive) so that answers some of the issues we needed to solve for that side. That leaves the shared disks as the single failure point but with multiple spindles on a RAID 10 set we're covered as well as I could sensibly do with last years budget. For next year, I'm looking into GDC (Geographically Dispersed Clustering) specifically for the distributor so that we can fail over to a different data centre (possibly long distance). I'm expecting to start that work in 6 months or so which means it's early days working everything out. The basic thought I've got there is to allow for two SAN's and split writing for replication. I need to do more reading up in all fairness, but need to end up with being able to move the lot if needs be with minimum downtime. We use an EMC Clarion so I've been reading up on RecoverPoint which initially looks very interesting.

    For the main DB, it could well mean the principal and mirror at different data centres and the witness at a third site. That does concern me somewhat so I'm also considering GDC for that too. We're looking at how we do this with 60 front end servers and 10+ DB's and everything else.

    For the subscribers they are all identical VM's (except the one for reporting which has more indexes). We load balance at the network level so the connection string points to a 'group' IP address. With having them identical I can effectively move servers in and out of production as I need to. I ended up with two sets of servers for two different needs plus one for reporting and another for general queries. We have needed to point everything to the main DB and run that VERY high for a short period when doing the initial setup. We can take the load on the main server if we absolutely had to. It's scary stuff when you see the workload going on. I've done a huge amount to allow it to handle significant load.

    But the basics are :

    Mirror -> Clustered Distributor -> Load Balanced Subscribers

    I also split replication into two sets so if I mess up I don't affect the other. I've gone a very long way into the setup we have which is way beyond what I would normally have done. I'm part way through writing that up and may submit to this site once finished.

    Ryan

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply