Publisher to Distributor Replication latency high when database is in FULL recovery model

  • Hello everyone,

    We have a SQL Server 2005 SP3 (9.0.4053) configured for transactional replication (Log Reader configured for continuous). The Publisher and Distributor are on the same server.

    We have a job that runs once per hour that drops and recreates a table that holds approx 160M rows of data. This table is NOT a published table. With the database in FULL recovery model, the execution of this procedure causes the Log Reader Agent to halt. New commands are not generated and there are several messages in the "Publisher to Distributor History" tab saying "The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately X {Where x is between 1,500,000 and 5,000,000} log records have been scanned in the pass #{n},{n} of which were marked for replication, elapsed time X {where X is well over 1,000,000,000} (ms)"

    This command runs for approximately 6 mins. During this execution and for 3-5 minutes afterward no transactions can be processed by the Publisher to Distributor process. A tracer token initiated at the start of execution will show a 6-17 minute latency under Publisher to Distributor and no transactions on published tables are committed at the subscribers for the same 6-17 minutes.

    When we change the recovery model to SIMPLE this halt of Log Reader is not encountered. Log Reader Agent continues to enter other transactions to the distributor (and subsequently to subscribers) and tracer tokens initiated at the start of execution with little to no latency (max total latency is 4 seconds).

    Why would the recovery model have this effect? From everything I've read on-line this is not the expected behavior and the recovery model should not have this type of impact. I know that in FULL recovery model the transaction log should be backed up frequently to avoid delays in Log Reader but this issue can be seen within 60 seconds of setting the recovery model from SIMPLE to FULL. This issue presents even when this job is only activity on the server.

    Any advice?

    I appreciate any guidence,

    Jason

    ----------------------------------------------

    Also worth noting: This can be reproduced on this server with any table having a large number of records. During diagnostics we created a test table and filled it with all the rows from sys.objects until the table had over 150M rows. Then copied the data from this test table into another test table in the same manner as the problematic procedure

    SELECT [columnlist] INTO TestTable2 FROM TestTable1

    Also tried creating the table in advance and executing an INSERT INTO instead of SELECT INTO

    INSERT INTO TestTable2 (columnlist) SELECT (columnlist) FROM TestTable1

    During the execution the same behavior of halting replication was seen while in FULL recovery and resolved when in SIMPLE recovery.

    -

  • I can't point you at anything that confirms what you are witnessing, but I can understand it (sort of).

    When a transaction is written to the transaction log, it is either marked for replication or not. Parts of the log cannot be marked for re-use until all transactions marked for replication have been processed. Adding the recovery model into the mix...

    In FULL recovery, although your large transaction is not marked for recovery, that part of the transaction log cannot be marked for re-use until it is backed up, and so the Log Reader has to grind its way through that part of the log.

    In SIMPLE recovery, SQL Server can mark the part of the log with your large transaction for re-use, and the Log Reader can skip it entirely.

    As I said, I can't point you at anything to confirm this, so this is just an educated guess as to how SQL Server goes about its businees.

  • If you really want people to lok at other threads, either here on ssc or other forums, it would help if you would use the [ url ] [ /url ] IFCode shortcuts around your link. It makes your url clickable instead of having to cut and paste it to the address bar.

    http://social.msdn.microsoft.com/Forums/en-US/sqlreplication/thread/1828bfc9-b9d9-412d-97f6-e8a0568cbbdf

  • Hi Jason,

    were you able to find out the answer? We are having a very similar problem. The difference is that we TRUNCATE the tables that are not published. But we have same replication behavior as you describe. Have you consider moving the tables to tempDB?

    Thank you,

    Lana

  • Hi sfeldman,

    I never found out what the resolution was because I left the company shortly after this post. At my departure we didn't attempt moving to the tempdb, it wouldn't have been an option because the table being populated is a permanent table.

    Maybe it will help if I tell you about where I left off.

    When I left, the only work-around I had was to leave the system in simple recovery mode. This didn't close the issue as the product we were supporting had a requirement to support Full recovery model so the attempts at resolution went two paths,

    Our issue for our client was that the replication delay was causing an error in our application, “Failed to validate database replication”, while deleting and/or saving a new or modified [product name omitted] Query is caused by two factors:

    To address the issue we changed recover to simple (close monitoring) and started two paths

    1. A case was opened with Microsoft

    2. Our developers (in case MS wouldn't fix it) began to modify the product to not require immediate replication between "Sites" thus eliminating the error condition. Also investigated reducing the number of reads and writes of the offending procedure.

    Work was underway (not completed) when I left.

    Make sure you have the latest fixes from Microsoft because if MS fixed this it would have been part of their GA hotfixes (the company I worked for would only accept official and supported fixes).

    -

  • Thanks a lot for the reply. We are working on the issue right now. We were able to narrow-down the problem to locks between SQL agent and distribution cleanup job or history cleanup job. We've disabled the jobs for now and researching what are optimal parameters and frequency for them.

  • The best suggestion I could give is to batch your inserts into the table instead of trying to do it as a single insert. Be sure that you put a slight delay between batches to give SQL Server some breathing room for other work.

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply