Trans replication errors 2627 and 20598

Question

Trans replication errors 2627 and 20598

CHIRAYU SHAH

SSC Veteran

Points: 245
More actions
March 29, 2009 at 11:23 pm

#201666

We have SQL 2005 SP2 + Aug08 security patch (3068) publisher, distributor and six subscribers. We have set up transactional replication where publisher,distributor and subscribers are on separate servers.
Also, publisher database is being used as principal for database mirroring.
This set up had been working fine for almost a year. But, two out of six subscribers are failing with the following errors now.
- Can not insert duplicate key. Primary key violation on 'xxxxx_pk'. Error: 2627
- The row was not found at the Subscriber when applying the replicated commands. Error: 20598
We have made sure that there is no user interference at subscribers. So, no user processes are deleting or inserting rows.
We have also tried to reinitialize snapshot on them, but it fails after sometime with any of these errors.
We would really appreciate any help on this issue.
Thanks.................

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply

Erik Bitemo SSCrazy Points: 2062 More actions · Answer 1

Hi,

You may want to turn on logging for the problematic agents and find the exact rows failing. Edit the job step by adding the option "-Output E:\ReplDistrAgent.log " to the end of the parameter list of the distribution agent. !! do not leave it logging longer than you really need !!

Obvious question: what is the behavior in case the target object exists? if it's set to do nothing, than it will cause you problems. Also, you may try to create the subscription without keys and check the violations later.

-- Erik http://blog.rollback.hu

chirayu547 SSC Enthusiast Points: 179 More actions · Answer 2

Hi Erik,

Thanks for your reply.

I have added -outputverboselevel 3 with -output H:\DistAgentVerbose.txt on Distribution agent job.

We have set up Pull subscriptions with non-updateable subscribers.

We get the Transaction seq no. and command Id from verbose logs, but it really doesn't explain why it fails on 2 out of 6 subscribers.

We are replicating Unique constraints and clustered indexes.

Any idea about what's happening?

Erik Bitemo SSCrazy Points: 2062 More actions · Answer 3

If this is an already initialized subscription, I suggest you adding -CommitBatchsize and -CommitBatchThreshold with a low value (I had to set them to 1 in a similar situation), this way you can see the specific commands in the logfile when they fail and then you can check the data. Again, remove the extra parameters when you don't need them anymore.

-- Erik http://blog.rollback.hu

noeld SSC Guru Points: 96590 More actions · Answer 4

You can usually see tran_seq_number and command_id on replication monitor.

Use those on sp_browsereplcmds and you will see which rows are the troublemakers.

Now to me it could be an indication that someone modified the subscribed databases manually!

* Noel

chirayu547 SSC Enthusiast Points: 179 More actions · Answer 5

Hi Erik and Noel,

We already tried both -CommitBatchSize and -CommitBatchThreshold with lower values, but it still failed.

We have set up NOT FOR REPLICATION triggers on the particular table where the primary key violation or deletion occurs, but we couldn't find anything logged in History table.

Yes, I found the exact insert command (MS_InsXXXXXX ) where it was failing from distribution database. I manually deleted that row on subscriber to see if it picks up from there.

I am still monitoring it for now as it's trying to catch up on 331,000 commands.

What do I wonder is if the row already exists on subscriber through Snapshot or log reader agent, then why it gets replicated again?

Could there be a goof up in Tran seq no? or is the same command coming twice because of some latency issue?

noeld SSC Guru Points: 96590 More actions · Answer 6

If someone ( or some program) inserted the row on the subscriber this error can happen.

If you manually delete it replication picks up where it left off.

There is no such a thing as replication sending the stuff twice all operations are transactional so either they succeed or not.

There are definitely bugs in the replication code but in my experience 99.99% of the time that these things happen is because there was someone that executed insert/delete/update commands on the subscriber thinking that they were updating the primary!

* Noel

chirayu547 SSC Enthusiast Points: 179 More actions · Answer 7

I understand where are you coming from, but we haven't found anything on history tables for triggers (NOT FOR REPLICATION) on subscribers yet.

All subscribers are read-only.

It doesn't really make sense that replication work on 4 subscribers fine. Only 2 subscribers are failing.

Can database mirroring interfer with replication? Could there be a bad snapshot delivered?

noeld SSC Guru Points: 96590 More actions · Answer 8

I have seen occasions in which the rows do NOT reach the subscribers when Reindexing is run on the publisher but never seen that a row is sent twice.

By the way having triggers DO NOT fully protect you. For example if someone uses bcp without firing the triggers you will never notice.

True that it is an extreme case but this is just an example so that you understand that there are way around these stuff.

* Noel

Erik Bitemo SSCrazy Points: 2062 More actions · Answer 9

Hi,

If I were you, I'd consider removing the constraints and then checking if I can figure out what happened. If it turns out to be something "expected", you can instruct replication to ignore these kinds of errors, but it's better to dig deeper and see what's going on.

The -commitbatchsize parameter won't fix the problem, only makes easier to identify the failing row(s) in the agent log.

-- Erik http://blog.rollback.hu

BrerSQL SSCommitted Points: 1672 More actions · Answer 10

I know this is an old topic but I ran into the same situation and was able to determine that my filters were causing me problems.

I figured I would post my findings so that someone else can benefit from my pain.

I have a situation where I only replicate rows from certain tables that have a StartDate of getdate() minus 20 days or greater.

Logically as days go by rows will fall out of my replications sliding window.

Example...

If I have a row that has a startDate of January 1, 2011 and I re-initialize on January 20, 2011

The January 1 row will replicate but on January 21st the row will no longer receive updates because it has fallen out of the sliding window....lets call this a stale row. I am not sure if this is how it is supposed to work but this is the behavior that I have observed.

Now...If later (say on January 30th) I go and update the StartDate for that stale row to a day in the future then I will receivethe 2627 error (Violation of PRIMARY Key Constraint...).

My assumption is that replication does not keep track of every row that it initially replicated. If the row falls out of the filter range in my case then replication no longer cares about it and assumes that it no longer exists on the replication servers.

This can be bad if other updates are occuring to the stale rows on the publishing DB which essentially puts your replication database out of synch for those stale rows.

I have not determined my ultimate solution yet but I was able to resolve the error in two ways...PLEASE UNDERSTAND THAT I AM NEW TO REPLICATION and these solutions may not be the best way to resolve this issue, I am still investigating and my solution may come down to better design. Anyway here is what I found to work.

1 - On your distribution Agent profile use 'Continue on data consistency errors'. The transaction will be skipped when consistency errors are logged. This is probably not a great solution but it does keep your replication chugging along and you can address the issue at your convenience depending on the critical nature of the data.

OR

2 - Modify the sp_MSins_dboTABLENAME stored procedure so that it checks to see if the row exists prior to doing the insert. In my case, if the row exists I call the SPMSupd_dboTABLENAME stored procedure forcing it to update all columns which essentially re-synchronizes the row. From what I read, you should avoid modifying these procedures.

I have reproduced and tested both of these solutions.

Hope this helps someone.