The Brittleness of Replication

  • Another one I thought of, having to use the hosts file so you can see replication monitor, as you can't see it when your instances have multiple part names, sqlinstance.live.local, etc.

    qh

    [font="Tahoma"]Who looks outside, dreams; who looks inside, awakes. – Carl Jung.[/font]
  • Steve Jones - SSC Editor (8/14/2014)


    Brandon J Williams (8/13/2014)


    My point is that replication novices are quick to point the finger at replication when it's not replication that is the problem.

    We actually haven't done much in the way of getting a stable replication environment, it just works. Sure, we've read BOL, but as far as getting it stable, that's it.

    You're conflating your experience with the capability or robustness of replication. I've had to be stable in places, I've had it inexplicably fail. Certainly some of the failures have to do with a lack of bandwidth, or maybe disk space, or networking, or data, or something else.

    However that's where replication is brittle. It will fail with things that shouldn't cause it to fail.

    Replication is not brittle. When it fails, due to something like lack of bandwidth, disk space, networking, or data, that is not replication's fault. Any other piece of software could fail under those circumstances. Replication is no different.

    Just because it hasn't for you, doesn't mean it can't or won't.

    That is true. The learning curve with replication can be steep, and while there is usually one way of setting up replication correctly, there are many ways to set it up incorrectly.

    The fact you reinitialize subscriptions leads to my point that it has plenty of room for improvement (along with other features).

    I think you misunderstand. There are publication and article properties, that if changed, require that the snapshot be regenerated and/or subscriptions be reinitialized. This is no secret, it is spelled out in BOL. So what we have in that case is the need for a maintenance window. The ability to reinitialize subscriptions quickly has to do with shrinking that maintenance window down as small as possible.

    The impression that replication is brittle is just inaccurate. Just like any other technology, its concepts must be understood to avoid common pitfalls.

  • I'll still disagree with you, Brandon. If something doesn't tolerate imperfections in the environment, it's a bit brittle. There are plenty of ways that replication could auto recover from these events. The fact that it doesn't means it's not as robust as it could be. Or perhaps, should be.

  • Back in the SQL2000 days replication would retry 3 times if it failed and on each attempt it would provide the DBA with an alert. Then it would give up so those 3 emails would be lost amongst the morrass of "important" emails. To get around this we simply switched the jobs to run every minute or so each time it would retry 3 times and give 3 alerts.

    I'd forgotten about the need for a hosts file hack, particularly in multi-part server names. That needs fixing.

    Given the more distributed nature of data these days and that cloud based systems can and do fail replication does need to be improved. It was easy to live with when distributed data was less mission critical.

  • If the network goes down for whatever reason, if it comes back up within the retention period, replication will pick up where it left off. It auto recovers from that, no problems there.

    What specifically would you like it to auto recover from?

  • OK, I think we have really beat this one to death. Let's move on.

    Rick
    Disaster Recovery = Backup ( Backup ( Your Backup ) )

  • I'd like to be able to turn it off without turning it on again when I do a restore into test.

  • I have really un-fond memories of doing replication with an Oracle publisher (with SQL 2005). That was a nightmare, random data conversion errors that were incredibly hard to track (errors didn't indicate in any way what article was the problem), subscribers being invalidated for no apparent reason, at one point the snapshot agent hung after the 255th article any time it ran. That required a complete tear-down and reconfigure to get past.

    Keep things simple and replication works. Try to get fancy and it's a recipe for pain.

    At the very least it (and many other components of SQL) needs better monitoring built in. Way better.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Steve Jones - SSC Editor (8/14/2014)


    I'll still disagree with you, Brandon. If something doesn't tolerate imperfections in the environment, it's a bit brittle. There are plenty of ways that replication could auto recover from these events. The fact that it doesn't means it's not as robust as it could be. Or perhaps, should be.

    It certainly didn't handle problems well back in SQL2000. It was fairly easy though to detect some of its failures to handle problems and hadle them for it, without any human intervention (that's why I say "fragile" rather than "brittle"), and in my opinion that is already enough to make it clear that the monitoring and the error management were of a standard far less than acceptable.

    Far more important were the explicable errors: inserting row with already existing primary key, updating row with primary key that didn't exist, deleting nrow that didn't exist - the only thing updating the database at the subscriber was replication, there was no software that wrote to it, no-one was updating it, but these things happened. They were a big problem, because we were using replication to create a (cold) standby copies of critical databases on customer sites, and we aimed to restore service very rapidly (minutes, not hours) if a server went kaput, and servers did go kaput, now and again. If a main server had gone phut while a subscription was being reinitialised and we were left with recovery from backups we wouldn't have met that target, which would have damaged our reputation even though our contracts allowed a much longer time to recover. Servers broke simply because hardware breaks sometimes, especially if it's in a country where mains electricity voltage sometimes fluctuates wildly enough to cause damage even to kit which is certified for use in that country, or climate and computer room air conditioning is such that the equipment is being run at something quite a bit higher than its proper operating temperature, and most of our customers were in such countries and ran such computer room cooling.

    Tom

Viewing 9 posts - 31 through 38 (of 38 total)

You must be logged in to reply to this topic. Login to reply