Transactional Replication Errors

  • We have transactional replication set up between an OLTP and reporting server in a production environment. We are getting a '20598 Agent Custom Shutdown' error during times of high transaction activity. Any thoughts on what could be causing this error? Looked at the MSDN site but it wasn't very helpful.

  • My mistake....the correct error # is 20578.

    We are also getting a 20598 error that states "The row was not found at the Subscriber when applying the replicated command."

  • First let me get an idea of a few things. 1 how much RAM and HD space do you have (what type of HD setup)? What version of SQL and what SP is installed? Have you tried a more current server pack if one is available? What does CPU and Memory utilization look like during these times when these errors appear? Have you check the SQL Server Logs and Event Viewer to see if any more details?

    Not really sure about cause.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • These are pretty good sized machines; a Compaq cluster, each server with 8GB RAM. They are connected to a RAID box with 36 drives in RAID 10. It's SQL Server 2000 with no service packs, but they were going to install SP2 this weekend. I don't have access to the machine to be able to see performance statistics, but will look into this, as well as the SQL Server logs and Event Viewer.

  • Sounds beefy but those statistics may show a bottleneck somewhere causing it to do this. Let us know what you find. You may also want to setup Profiler to do a trace and see if you can find specifically when these failures begin or if something specific triggers it.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • Do you really think it's a bottleck? I don't see how a bottleneck could cause the 'row not found at the subscriber' error (unless the error message is misleading). When we do large deletes, it looks like it deletes the rows from the replicated database as well. Then, when it tries to replicate the delete, the rows are not found. Can you explain how a bottleneck could cause this? Thanks,

  • Are you replicating the actions, or the stored proc call? You're saying the delete works fine on the publisher but fails on the subscriber? How many rows in the delete transaction? The worst that should happen would be a timeout even with heavy load.

    Andy

  • 1. We are replicating the stored proc call.

    2. Correct. The delete works at the publisher, and when it tries the replicated command at the subscriber, it finds that the rows have already been deleted(!).

    3. The latest of these errors occurred yesterday and according to ms_repl_errors, 28 rows were affected.

  • The row has already been deleted, or the delete failed, say due to a foreign key or something? Do users have delete access to the subscriber? Just looked at how we handle deletes, using standard setup:

    create procedure "sp_MSdel_ib" @pkc1 varchar(12)

    as

    delete "ib"

    where "primarykey" = @pkc1

    if @@rowcount = 0

    if @@microsoftversion>0x07320000

    exec sp_MSreplraiserror 20598

    You "could" remove the code that raises the error.

    Andy

  • Just to be sure I have it clear, you're saying that the rows are deleted due to a prior command, such as one that invokes a cascade delete with a foreign key constraint?

    So the subscriber deletes the row with the cascade delete, and then receives the stored proc call which is trying to delete the rows again?

    It makes sense, but wouldn't the replication mechanism in this case just replicate the original command and not the stored proc call?

  • Thats a possibility.

    How the transactions get sent across depends on how you configured the pub, the default is using the proc methodology I posted earlier. Search for "Using Custom Stored Procedures in Articles" in BOL.

    Andy

  • Also consider triggers as the source of the 'problem'. Remember the NOT FOR REPLICATION clause.

  • I am also getting the 20598 error ('The row was not found at the Subscriber when applying the replicated command.'). The problem that I have is that when I look at the subscriber, that row is still there. When I reinitialize the article for that table, things get fixed, but the problem is recurring frequently.

    I set up this replication recently, and it is my first time working with replication, so I may have missed something simple. I am using transactional replication, with one publisher and one subscriber. Both are SQL Server 2000 with SP2.

    Is there any information anywhere on how to troubleshoot replication? BOL doesn't seem to have enough information on this.

    James C Loesch


    James C Loesch

  • One thing to do is make absolutely certain no one is making changes to the subscriber (unless you intend to have updating subscribers) while you troubleshoot. Then I'd recommend profiling both servers while you try to duplicate the problem based on any theory you have or has been presented here. Overall I've found transactional replication to be highly reliable, so dont give up hope!

    Andy

    http://www.sqlservercentral.com/columnists/awarren/

  • No one is updating the subscriber. No one even knows that it exists.

    I have no theories that I haven't already proven false. That's why I posted. One thing that makes it difficult is that so far its been a different table each time.

    James C Loesch


    James C Loesch

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply