Obscure Replication Error

  • I've had multi-terabyte transactional replication going for a couple years now, for the most part things flow quite smoothly with the occasional slowdown when massive updates are performed. However since March 22nd, 2013 I've had the log reader agent blow up and stop responding.

    There's not much in the way of a descriptive error, nor much to read in the log/dump files other than it appears to point out an issue with the EXE itself...perhaps a memory leak or a bug somewhere in the distribution agent executable?

    In the dump file, this is basically the only useful information I can see - hoping experts around here have experienced something similar and can offer suggestions:

    Process Name:DISTRIB.exe : C:\Program Files\Microsoft SQL Server\100\COM\DISTRIB.exe

    Process Architecture:x64

    Exception Code:0xC0000005

    Exception Information:The thread tried to read from or write to a virtual address for which it does not have the appropriate access.

    Here's the output from the Replication Agent job:

    Message

    The replication agent encountered a failure. See the previous job step history message or Replication Monitor for more information. The step failed.

    Date3/23/2013 1:34:26 AM

    LogJob History (MYSERVER-CA_TABLES-SUBSCRIBER-22)

    Step ID2

    ServerDISTRIBUTOR

    Job NameMYSERVER-CA_TABLES-SUBSCRIBER-22

    Step NameRun agent.

    Duration17.06:24:48

    Sql Severity0

    Sql Message ID0

    Operator Emailed

    Operator Net sent

    Operator Paged

    Retries Attempted0

    Message

    2013-04-09 12:26:55.798 88 transaction(s) with 232 command(s) were delivered.

    2013-04-09 12:26:55.798 Delivering replicated transactions

    2013-04-09 12:27:11.251 Delivering replicated transactions

    2013-04-09 12:27:21.751 114 transaction(s) with 321 command(s) were delivered.

    2013-04-09 12:27:32.673 101 transaction(s) with 206 command(s) were delivered.

    2013-04-09 12:27:37.938 100 transaction(s) with 168 command(s) were delivered.

    2013-04-09 12:27:50.814 Delivering replicated transactions

    2013-04-09 12:27:58.907 Delivering replicated transactions

    2013-04-09 12:28:00.142 107 transaction(s) with 288 command(s) were delivered.

    2013-04-09 12:28:19.376 Delivering replicated transactions

    2013-04-09 12:28:24.564 101 transaction(s) with 256 command(s) were delivered.

    2013-04-09 12:28:35.783 100 transaction(s) with 216 command(s) were delivered.

    2013-04-09 12:28:54.658 Delivering replicated transactions

    2013-04-09 12:29:01.751 Delivering replicated transactions

    2013-04-09 12:29:01.783 101 transaction(s) with 253 command(s) were delivered.

    2013-04-09 12:29:14.377 100 transaction(s) with 162 command(s) were delivered.

    2013-04-09 12:59:14.404

    HYT00 Query timeout expired 0

    2013-04-09 12:59:14.404 <stats state="2" fetch="1543" wait="71597" cmds="1252" callstogetreplcmds="519922"><sincelaststats elapsedtime="1838" fetch="0" wait="1838" cmds="1252" cmdspersec="0.000000"/></stats>

    ************************ STATISTICS SINCE AGENT STARTED ***********************

    04-09-2013 07:59:14

    Total Run Time (ms) : 1491874890 Total Work Time : 70358184

    Total Num Trans : 1282424 Num Trans/Sec : 18.23

    Total Num Cmds : 3599854 Num Cmds/Sec : 51.16

    Total Idle Time : 1375300094

    Writer Thread Stats

    Total Number of Retries : 4

    Time Spent on Exec : 4520346

    Time Spent on Commits (ms): 149885 Commits/Sec : 1.26

    Time to Apply Cmds (ms) : 8228658 Cmds/Sec : 437.48

    Time Cmd Queue Empty (ms) : -1297083607 Empty Q Waits > 10ms: 24054

    Total Time Request Blk(ms): 78216487

    P2P Work Time (ms) : 0 P2P Cmds Skipped : 0

    Reader Thread Stats

    Calls to Retrieve Cmds : 519922

    Time to Retrieve Cmds (ms): 70358184 Cmds/Sec : 51.16

    Time Cmd Queue Full (ms) : 2784738 Full Q Waits > 10ms : 5585

    #1Num Cmds : 774239 Exec (ms) : 4520346 Commit (ms) : 145004

    Process (ms): 7831768 Last xact : 0x00068c0e0001f56e000c

    #2Num Cmds : 802248 Exec (ms) : 4539826 Commit (ms) : 149885

    Process (ms): 7910496 Last xact : 0x00068c0e0001f5fb000b

    #3Num Cmds : 779073 Exec (ms) : 4451619 Commit (ms) : 131556

    Process (ms): 7929972 Last xact : 0x00068c0e0001f62a000b

    #4Num Cmds : 1244294 Exec (ms) : 6389841 Commit (ms) : 146076

    Process (ms): 8228658 Last xact : 0x00068c0e0001f59d000d

    Last global update to sub xact : 0x00068c0e0001f62a000b

    *******************************************************************************

    2013-04-09 12:59:14.404 Delivering replicated transactions

    2013-04-09 12:59:14.404 Delivering replicated transactions

    2013-04-09 12:59:14.404 Delivering replicated transactions

    2013-04-09 12:59:14.404 <stats state="1" work="70358" idle="1375300"><reader fetch="1543" wait="71597"/><writer write="8228" wait="2997883"/><sincelaststats elapsedtime="1947" work="109" cmds="2146" cmdspersec="19.000000"><reader fetch="0" wait="1947"/><writer write="169" wait="1863"/></sincelaststats></stats>

    ********************************************************************************

    Microsoft (R) SQL Server Replication Agent

    A replication agent encountered a fatal error and was shut down. A mini-dump has been generated at the following location:

    C:\Program Files\Microsoft SQL Server\100\Shared\ErrorDumps\ReplAgent20130409075914_0.mdmp

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Is this the log reader agent or the distributor agent? You mention both.

    What happens when the agent job restarts?

    Do you have the job set to auto restart on failure (common recommendation)?

  • arnipetursson (4/9/2013)


    Is this the log reader agent or the distributor agent? You mention both.

    What happens when the agent job restarts?

    Do you have the job set to auto restart on failure (common recommendation)?

    My mistake, this is the Replication Distributor. When restarting the job it runs just fine. I do not have it set up to auto restart on failure but do send a page/email to the DBA operator to alert of the error.

    I s'pose there's no harm in setting it to retry at least once...

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Actually, by default this agent is set up to retry at 1-minute intervals...

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • You say this occurs when massive updates are performed. Notice the error Query timeout expired in your log. Try increasing the Distribution Agent -QueryTimeout parameter. The default is 1800 seconds, or 30 minutes, which corresponds with the timestamps for which the timeout occurred.

    2013-04-09 12:29:14.377 100 transaction(s) with 162 command(s) were delivered.

    2013-04-09 12:59:14.404

    HYT00 Query timeout expired 0

    Try increasing the -QueryTimeout to 5400.

  • Also increase default value of the ReadBatchSize in Log reader agent & CommitBatchSize in Distribution Agent.

  • Thanks for the advice, I will try these and monitor over the upcoming weeks

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply