Conflict resolution

  • I have been bitten by what I consider a problem or bug in conflict resolution - twice now. I was hoping someone has seen this or knows how to prevent it, because I haven't been able to find any information on it.

    Here's the situation. I have merge replication setup between a SQL2000 publisher and a couple dozen Desktop engines (MSDE 2000). On some of the replicated tables I have triggers that are NOT set "NOT FOR REPLICATION", in other words, the triggers will fire when the merge agent modifies data. This is what I intended. Now, the problem that occured is an errant maintenance script "broke" one of these triggers - one time by incorrectly altering a table that the trigger wrote to, and another time by removing a function that the trigger used. These were errors on my part, but the resulting havac on the data via conflict resolution is what really irks me. The scenario goes like this: a record is inserted (doesn't matter where - subscriber or publisher). When that record is replicated to a subscriber that has a "broken" trigger, the conflict resolver winds up DELETING the record from the publisher, and subsequently all other subscribers that may have already received it. The conflict description (reason_text) says "A record was inserted at <some server> but could not be inserted at <broken server> due to an unspecified error."

    Why would it decide to remove the record just because one subscriber had a problem with it? I thought about priority options, but I don't think that would have made a difference. I currently have all subscriptions set to use the priority of the publisher, since all subscribers and all users at the publisher are peers. No one has more "rank" than the others.

    Anyone seen this behaivor? Know of a fix?

    Jay


    Jay Madren

  • Because the transaction cannot be considered committed until its committed on all subscribers in a merge replication topology, if any of them roll back the transaction (ie. fails it.) that roolback is merge replicated back to all participants. This is an inherint part of merge replication and how it should work by design. When your working with multiple databases updating a common set of data, you must view it from the larger vantage point. You wouldn't want a transaction partially updated through your databases. That would wreak all kinds of havoc.

    I would suggest updating all of your code in all of your databases to a common base which does not fail the transactions and then re-initializing your replications to syncronize your data.

  • Hmm. I understand that rationale when used in context of enforcing contraints. But this was caused by an error. Either way though, the process should notify the original record creator that there was a problem (or violation) so he can take appropriate action. What happened here though is several records were deleted over the course of a couple days and no one knew about it until they ran some summary reports.

    As far as a common code base, that's what I have. The problem came when I was pushing out some updates to that base. These are all on-demand dialup or VPN, so I can not directly access them to perform the updates and verify their success. I have to send scripts that may not get picked up for days. Overall, this whole thing is very complicated and difficult to keep cohesive, so this little glitch is yet another aggravation (though a little more significant).

    So, a followup question is, since I can never guarantee that my code is infallible, how do I build in protection against losing records in this manner? And in case you're thinking interactive conflict resolution, that's not a workable solution.

    Thanks,

    Jay


    Jay Madren

  • You might need to figure out some form of versioning and version checking schema to prevent procs or triggers from running if the code is not updated properly, would be my first line of thought. In other words, if the version in the proc didn't match the version in the local database table, do something to synch them before going any further.

    You might also be able to do it with a more robust set of error logging routines in your triggers or code, that merge replicated the errors logged back to your central repository as well, where you could send out new updates to correct the logged problems.

    In a situation like that, I think I would have a master set of tables which were not replicated, which held the definitive data, and a set of tables merge replicated which were the transactions going into my master tables. In other words, seperate the transactions from the repository. If this were done in combination with the robust error logging and the code versioning, then the scenario would go more like this.

    A transaction is replicated from or to a subscriber, the trigger would check its own version against a table containing the current version info and if found to be the correct version would process the transaction record into the other master tables. This would allow for logging, auditing, etc. On an incorrect version, it could generate an error which is replicated back to your central area, and you would be aware and could correct. either way, write your code so that an error would not prevent the insert from occurring somewhere.

    Placing a trigger on a replicated column is really one of those not good things to do. Thats not to say that its not done all over, its just not a good practice. The replicated record cannot commit until the code from the trigger executes and this holds up replication, ties up network resources, and on failures, rolls back the transaction that the trigger failed on. Try to redesign the process to seperate the two.

Viewing 4 posts - 1 through 3 (of 3 total)

You must be logged in to reply to this topic. Login to reply