I found that the brittleness in replication is due to the complexity and flexibility of the facility. The brittleness was more down to inadvertent misconfiguration than an inherent problem with the facility itself. There are thousands of options and many with profound implications.
- Will deletes be replicated?
- Will the entire publication be reset and therefore a complete refresh on every subscriber?
- How will updates be handled?
- Will a published article cause a full drop and recreate on the subscribers?
- Will the replication stored procs be auto-generated on the target?
Get this wrong and you will get the appearance of brittleness. For me the knowledge to make it robust and reliable was hard and painfully won. I think the pain to learn the ins and outs of the system is the problem. The team I worked with gained a detailed understanding of replication and the implications of the settings flags to the extent that a data analyst requiring access to a new table that was previously only available in the production database would have it in a few minutes.
One of the pain points in SQL2000 has been addressed and that was that a replication failure would only warn you three times and then cease to warn. For that reason we had to switch the distribution agent from continuous run to 5 minute batch runs. This meant that each run would warn you three times.
I would not want to lose the Publisher/Distributor/Subscriber model as this is far more flexible than Master/Slave.
The performance of replication post 2008 is astonishing. It used to be that a problem communicating with the distributor or subscribers needed to be fixed within a very short time window otherwise production log files would bloat and eventually cause an outage. One time we were worried that one of our busiest databases would not survive downtime to service pack a distributor and transactions would take weeks to catch up on the subscribers. It took about 30 minutes to catch up while also coping with peak load.