SQLServerCentral Article

Super Fast Transactional Replication Repair

,

Problem

Your production SQL Server transactional replication just failed and the business impact is critical. How do you get replication restored quickly? In my previous SQL Central article, The Trouble With Replication and Large Articles, I discussed how to build replication templates to resolve numerous hung replication commands. That article addressed all publications. In this article we look at how you can restore SQL Server transactional replication for a single publication in a matter of minutes. Impossible?  You're just 5 minutes from fixing replication, so read on.

Solution

A replication failure can be terrifying to a new DBA. Even an experienced DBA can become frustrated with its complexities. Earlier in my career, I had replication outages for days. Over time I have developed a technique to restore replication in minutes. If you have the ‘row not found on subscriber’ error there can be 1 or 1000s of stuck commands. There are several methods available to restore your system, but my favorite technique for fast restore is to alter the system generated stored procedures to unblock hung commands.  I follow this with Red Gate’s Data Compare to resolve any data integrity issues.  Let’s examine how this works. 

Background

In this example the publisher, Pubr, contains two publications: Pub1 and Pub2. Sub1 and Sub2 are the subscribers as shown in figure 1. Both Pub1 and Pub2 have three articles. Pub1 is still running but Pub2 has failed with ‘row not found on subscriber’.  A failure of any article will block all commands on all articles in the containing publication, since article 4 has an error; articles 5 and 6 are blocked.  I can hear my boss yelling already! 

Figure 2 shows the 36,000 commands blocked by this error.  This number will continue to grow, causing critical systems to become out of synch.  One technique is to fix the errors one by one, but this approach quickly become tedious and can take a long time. You need a quick fix.  I prefer a quick four step technique to first unblock the stuck commands then insert the missing row which caused the failure.

Four Easy  Steps

Whenever you add a subscriber SQL Server generates three replication stored procedures for each article, an insert, update and delete. They are found on the subscriber database’s stored procedures folder.   Figure 3 shows the delete stored procedures for articles 4, 5 and 6. 

When a row is deleted on a replicated table the distributor sends a command to the subscriber to execute the delete stored procedure.  It is this store procedure that throws the ‘row not found’ 20598 error.  Let’s examine the quick fix.

Step One

Locate the offending article.  This is the same initial step Microsoft Support recommends to find stuck replication commands.  Connect to SQL via management using the SQL host name, IP addresses cannot be used with replication.  Right click the Replication folder to start the replication monitor and expand the publisher into publications.  Next, expand the failing publication in subscribers and right click to view details. The broken subscriber will appear with the x-ed out red circle as in figure 4. 

Use CTRL+C to copy the transaction sequence number into the query below in figure 5 and execute on server with the distributor database.   

This will result in xact_seqno of the error which in this case is a truncated version of the same string as above, notice the zeros in figure 5.  I have removed some columns in figure 6 for clarity.

Copy and paste the figure 6’s publisher_database_id, xact_seqno and command_id values into the query in figure 7 and execute.

This will display the command that failed, notice it refers to the system generated stored procedure  sp_MSdel_dboArticle_4.  Some columns in figure 8 removed for clarity.

Locate the stored procedure on the subscriber database, right click and modify it into a query window. Comment out the error trap then execute to save the modification.

USE [ReplTestReceive]
GO
/****** Object:  StoredProcedure [dbo].[sp_MSdel_dboArticle_4]    Script Date: 03/31/2013 21:05:47 ******/SET ANSI_NULLS ON
SET QUOTED_IDENTIFIER ON
GO
ALTER procedure [dbo].[sp_MSdel_dboArticle_4] @pkc1 bigint
AS
BEGIN
   delete [dbo].[Article_4] where [Col1] = @pkc1
/*if @@rowcount = 0
    if @@microsoftversion>0x07320000
        exec sp_MSreplraiserror 20598  */
END  

Step Two

Next, return to the replication monitor – the blocked commands will start clearing in seconds!  The top line on the monitor will turn to green and have a status of running as in Figure 9. 

 Figure 9

Take some time and go tell your boss you repaired the replication issue in minutes, but don’t mention how easy it was.  Then come back to resolve the missing row.

Step Three

Now that replication is fixed and the pressure is off you still need to resolved the missing data on the subscriber.  There are several methods to synchronize data but I prefer Red Gate’s Data Compare utility.  It is a table level tool.  Figure 10 below shows how to find missing rows in a table.  I use the IP of the SQL server and a sysadmin level userid.  If the database drop down is populated then you’ve entered a valid useid/password.

Figure 10

Click Tables and Views tab and select Article_4 compare check box.  A good practice is to click ‘None’ first to remove any previously selected tables.  Notice the primary key is automatically selected as the comparison key. Another reason for having primary keys.

Figure 11

There is 1 row difference under the source only heading of figure 12; this is the missing row which caused the replication error.  You may find that subscriber rows can be deleted directly by programs, or worse yet, programmers with production write access.  If so, this is needs to be resolved otherwise more replication errors will occur. 

Figure 12

Click Compare Now at bottom of panel, (not shown)

Figure 13

Click Next to Deploy. 

Figure 14

You can review the script but not modify it.  The missing row insert is shown in Figure 14.  Click Deploy Now.

Step Four

Finally, go back and uncomment the sp_MSdel_dboArticle_4 changed above in step one.

Congratulations, you can now repair SQL replication in minutes and if your boss is like mine he will be impressed  - for about 1 minute.

Other options

Of course you can also code –skip error at the distribution agent but this is for all articles.   I prefer this article level approach, why fix what isn’t broken?

Acknowledgement:

Reviewers are critical to writers, thanks to Paul Bush at Reply.com for his editing prowess. 

Resources

Rate

4.6 (30)

You rated this post out of 5. Change rating

Share

Share

Rate

4.6 (30)

You rated this post out of 5. Change rating