SQL Clone
SQLServerCentral is supported by Redgate
 
Log in  ::  Register  ::  Not logged in
 
 
 


The Brittleness of Replication


The Brittleness of Replication

Author
Message
Steve Jones
Steve Jones
SSC Guru
SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)

Group: Administrators
Points: 62777 Visits: 19111
Brandon J Williams (8/13/2014)

My point is that replication novices are quick to point the finger at replication when it's not replication that is the problem.

We actually haven't done much in the way of getting a stable replication environment, it just works. Sure, we've read BOL, but as far as getting it stable, that's it.


You're conflating your experience with the capability or robustness of replication. I've had to be stable in places, I've had it inexplicably fail. Certainly some of the failures have to do with a lack of bandwidth, or maybe disk space, or networking, or data, or something else.

However that's where replication is brittle. It will fail with things that shouldn't cause it to fail. Just because it hasn't for you, doesn't mean it can't or won't. The fact you reinitialize subscriptions leads to my point that it has plenty of room for improvement (along with other features).

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
quackhandle1975
quackhandle1975
Hall of Fame
Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)Hall of Fame (3.2K reputation)

Group: General Forum Members
Points: 3239 Visits: 1240
Another one I thought of, having to use the hosts file so you can see replication monitor, as you can't see it when your instances have multiple part names, sqlinstance.live.local, etc.

qh

Who looks outside, dreams; who looks inside, awakes. – Carl Jung.
Brandon J Williams
Brandon J Williams
Valued Member
Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)

Group: General Forum Members
Points: 67 Visits: 930
Steve Jones - SSC Editor (8/14/2014)
Brandon J Williams (8/13/2014)

My point is that replication novices are quick to point the finger at replication when it's not replication that is the problem.

We actually haven't done much in the way of getting a stable replication environment, it just works. Sure, we've read BOL, but as far as getting it stable, that's it.


You're conflating your experience with the capability or robustness of replication. I've had to be stable in places, I've had it inexplicably fail. Certainly some of the failures have to do with a lack of bandwidth, or maybe disk space, or networking, or data, or something else.

However that's where replication is brittle. It will fail with things that shouldn't cause it to fail.


Replication is not brittle. When it fails, due to something like lack of bandwidth, disk space, networking, or data, that is not replication's fault. Any other piece of software could fail under those circumstances. Replication is no different.

Just because it hasn't for you, doesn't mean it can't or won't.


That is true. The learning curve with replication can be steep, and while there is usually one way of setting up replication correctly, there are many ways to set it up incorrectly.

The fact you reinitialize subscriptions leads to my point that it has plenty of room for improvement (along with other features).


I think you misunderstand. There are publication and article properties, that if changed, require that the snapshot be regenerated and/or subscriptions be reinitialized. This is no secret, it is spelled out in BOL. So what we have in that case is the need for a maintenance window. The ability to reinitialize subscriptions quickly has to do with shrinking that maintenance window down as small as possible.

The impression that replication is brittle is just inaccurate. Just like any other technology, its concepts must be understood to avoid common pitfalls.
Steve Jones
Steve Jones
SSC Guru
SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)SSC Guru (62K reputation)

Group: Administrators
Points: 62777 Visits: 19111
I'll still disagree with you, Brandon. If something doesn't tolerate imperfections in the environment, it's a bit brittle. There are plenty of ways that replication could auto recover from these events. The fact that it doesn't means it's not as robust as it could be. Or perhaps, should be.

Follow me on Twitter: @way0utwest
Forum Etiquette: How to post data/code on a forum to get the best help
My Blog: www.voiceofthedba.com
David.Poole
David.Poole
SSCertifiable
SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)SSCertifiable (7.6K reputation)

Group: General Forum Members
Points: 7621 Visits: 3285
Back in the SQL2000 days replication would retry 3 times if it failed and on each attempt it would provide the DBA with an alert. Then it would give up so those 3 emails would be lost amongst the morrass of "important" emails. To get around this we simply switched the jobs to run every minute or so each time it would retry 3 times and give 3 alerts.

I'd forgotten about the need for a hosts file hack, particularly in multi-part server names. That needs fixing.

Given the more distributed nature of data these days and that cloud based systems can and do fail replication does need to be improved. It was easy to live with when distributed data was less mission critical.

LinkedIn Profile

Newbie on www.simple-talk.com
Brandon J Williams
Brandon J Williams
Valued Member
Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)Valued Member (67 reputation)

Group: General Forum Members
Points: 67 Visits: 930
If the network goes down for whatever reason, if it comes back up within the retention period, replication will pick up where it left off. It auto recovers from that, no problems there.

What specifically would you like it to auto recover from?
skeleton567
skeleton567
Old Hand
Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)Old Hand (378 reputation)

Group: General Forum Members
Points: 378 Visits: 391
OK, I think we have really beat this one to death. Let's move on.
Robert.Sterbal
Robert.Sterbal
Old Hand
Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)Old Hand (337 reputation)

Group: General Forum Members
Points: 337 Visits: 2000
I'd like to be able to turn it off without turning it on again when I do a restore into test.
GilaMonster
GilaMonster
SSC Guru
SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)SSC Guru (87K reputation)

Group: General Forum Members
Points: 87553 Visits: 45272
I have really un-fond memories of doing replication with an Oracle publisher (with SQL 2005). That was a nightmare, random data conversion errors that were incredibly hard to track (errors didn't indicate in any way what article was the problem), subscribers being invalidated for no apparent reason, at one point the snapshot agent hung after the 255th article any time it ran. That required a complete tear-down and reconfigure to get past.

Keep things simple and replication works. Try to get fancy and it's a recipe for pain.
At the very least it (and many other components of SQL) needs better monitoring built in. Way better.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass


TomThomson
TomThomson
SSChampion
SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)SSChampion (14K reputation)

Group: General Forum Members
Points: 14328 Visits: 12197
Steve Jones - SSC Editor (8/14/2014)
I'll still disagree with you, Brandon. If something doesn't tolerate imperfections in the environment, it's a bit brittle. There are plenty of ways that replication could auto recover from these events. The fact that it doesn't means it's not as robust as it could be. Or perhaps, should be.

It certainly didn't handle problems well back in SQL2000. It was fairly easy though to detect some of its failures to handle problems and hadle them for it, without any human intervention (that's why I say "fragile" rather than "brittle"), and in my opinion that is already enough to make it clear that the monitoring and the error management were of a standard far less than acceptable.

Far more important were the explicable errors: inserting row with already existing primary key, updating row with primary key that didn't exist, deleting nrow that didn't exist - the only thing updating the database at the subscriber was replication, there was no software that wrote to it, no-one was updating it, but these things happened. They were a big problem, because we were using replication to create a (cold) standby copies of critical databases on customer sites, and we aimed to restore service very rapidly (minutes, not hours) if a server went kaput, and servers did go kaput, now and again. If a main server had gone phut while a subscription was being reinitialised and we were left with recovery from backups we wouldn't have met that target, which would have damaged our reputation even though our contracts allowed a much longer time to recover. Servers broke simply because hardware breaks sometimes, especially if it's in a country where mains electricity voltage sometimes fluctuates wildly enough to cause damage even to kit which is certified for use in that country, or climate and computer room air conditioning is such that the equipment is being run at something quite a bit higher than its proper operating temperature, and most of our customers were in such countries and ran such computer room cooling.

Tom

Go


Permissions

You can't post new topics.
You can't post topic replies.
You can't post new polls.
You can't post replies to polls.
You can't edit your own topics.
You can't delete your own topics.
You can't edit other topics.
You can't delete other topics.
You can't edit your own posts.
You can't edit other posts.
You can't delete your own posts.
You can't delete other posts.
You can't post events.
You can't edit your own events.
You can't edit other events.
You can't delete your own events.
You can't delete other events.
You can't send private messages.
You can't send emails.
You can read topics.
You can't vote in polls.
You can't upload attachments.
You can download attachments.
You can't post HTML code.
You can't edit HTML code.
You can't post IFCode.
You can't post JavaScript.
You can post emoticons.
You can't post or upload images.

Select a forum

































































































































































SQLServerCentral


Search