Distrib agent stalls and can't catch up

  • Hi,

    I have a SQL 2000 publishing to a secondary instance on SQL 2005. This SQL 2005 hosts both distribution and the replicated database.

    Subscription is in PUSH mode. Both publisher and distrib/subscriber are strong configurations (HP Proliant 8x3Ghz, local SAS 15KRPM, etc...). Distributor is running in continuous mode, with a profile which has all values by default except the timeout set to 36000 seconds instead of 1800, because when running against large MSrepl_commands, it fails with a timeout error.

    On the inbound side, I have 80 clients inserting massively into the publisher in OLTP fashion.

    Now the MSrepl_commands and MSrepl_transactions queues are filling way too fast for the distrib to catch up. I've set 2 counters last night on logreader cmds/sec and distrib cmds/sec. Logreader reports 300 cmds/sec and distrib 100 cmds/sec, and sometimes even reports nothing and seems to be completely inactive. The MSrepl_commands is now 12 Gb big. I would like to understand why the distributor is so much under the logreader.

    There are 3 SQL connections on the subscriber related to the distrib.exe: (ie under the same hostprocess)

    - one doing MSget_repl_commands, very few IOs/sec reported in the sys.sysprocesses physical_io column.

    - one tracked under profiler session and supposed to apply the transactions at the subscriber, but stucked at the begin tran doing no IO and not waiting for any resource to be freed (last_wait_type is MISC and waittime is always at 0)

    - one doing sp_MSadd_distribution_history and sometimes blocked for quite a long time (more than 200 seconds average) by the distribution cleanup procedure doing a:

    EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 120

    Notice that I set the min_distretention to 0 because I don't want the transactions that have been applied already to stay in the distribution database.

    I don't know what to do to quicken the distribution process. Is there a way to avoid blocking between the distribution cleanup and the distrib.exe ?

    I would try to avoid modifying parameters in the distrib agent profile because a restart of the distrib agent costs a lot (it must restart its MS_get_repl_commands from the beginning and the table is getting bigger and bigger every minute, so I fear the situation would get worse every time I would restart the distrib.exe).

    Thanks for sharing,

    David B.

    David B.

  • The first thing to is to change the schedule of the Distribution Cleanup job. By default it runs every 10 minutes which will kill a busy replication system. We run ours once every hour. It sounds like you should also consider a 3 server architecture so that publisher, distributor and subscriber each has its own server.

    I don't know how many tables you are publishing or how many publications you have but creating more publications with a dedicated distribution agent per publication could boost throughput.

    Hope some of that helps,

    Mike

  • Thanks Mike,

    You are right, I've considered once to split the single publication into multiple ones, but for some reason it didn't happen.

    I've disabled the distrib cleanup temporarily to let the distrib continue. I can see a little more activity on the IO side for MS_repl_get_commands.

    I'm wondering, maybe it has to do with the continuous mode. Actually, we orignally designed the replication this way because the DSS apps working on the subcriber need the data as soon as it has been validated on the publisher. But maybe I could set the distrib agent to run at close and regular intervals to avoid bumping into the cleanup process, what do you think ?

    For your information, there are 20 tables and about 40 Gb published in this publication.

    David B.

    David B.

  • Given our experience, I would say stick with the continuous running. Running on a schedule will build up latency and the distribution agents can slow down as the distribution database gets bigger. We replicate in the region of 800 million transactions a day around our entire infrastructure which is roughly 200 million per day to each of 4 subscribers. This represents about 30-40GB per day per server. We run all our agents in continuous mode but that's not to say you shouldn't experiment. We also run the 3 server architecture I mentioned earlier.

    Only getting 100 cmds/sec out of the distribution agent is not good really. I'd expect to see something more like 500-800 cmds/sec or even higher if insert in the only activity. Hopefully, everything is indexed properly. Nothing lying around which shouldn't be there! Apologies if that's an insult.

    I really would recommend creating more publications to get better parallelism into your design. We only replicate 9 tables off our main publishers but have 3 publications each with a dedicated distribution agent.

    Mike

  • Thanks Mike,

    I realized I didn't mention two things in the design:

    - There are two subscribers on this publication (so 2 distrib.exe): one is the matter of concern, and is local to the publisher, in the same room (both in London).

    - The second one is based in Germany, subscribing to the same amount of data, and runs perfectly with no latency and a 600 cmds/sec rate.

    Both distributors are running under the same profile, and the only difference in their parameters is that the distrib for the remote subscriber uses SQL authentification (-SubscriberLogin + -SubscriberPassword), whereas the local one uses integrated authent on behalf of the SQL Agent account.

    This is nonsense I know.

    The other factor is that the publication was created with @immediate_sync = 0 to avoid the snapshot agent to create an entire snapshot each time it resynchronizes the subscribers. (when I add a article or when I change a column in an article, remember the publisher is SQL 2000).

    Thanks

    David B.

    David B.

  • Both distributors are running under the same profile, and the only difference in their parameters is that the distrib for the remote subscriber uses SQL authentification (-SubscriberLogin + -SubscriberPassword), whereas the local one uses integrated authent on behalf of the SQL Agent account.

    This is nonsense I know.

    May not be nonsense. I have witnessed extreme slowdown of replication (albeit on SQL2K) where the DC was a bit flaky, and didn't respond quickly, so the server elected to authenticate against a different (remote) DC. Forcing it back to its local DC made a HUGE difference.

    If that's the only difference between the 2 subscribers, it's worth trying SQL authentication, even if it just rules that out as the cause of the problem.

  • Thanks Ian,

    I'd like nevertheless to let the distribution agent complete its reading on MSrepl_commands and start applying before I consider stopping it to try a different configuration. FYI, the replication pair is not in a domain, and the authent mech is just local NTLM.

    The MS_get_repl_commands makes very few IOPS and is waiting on IO_COMPLETION most of the time. Profiler shows no activity (even with SP:StmtStarting and SP:StmtCompleted turned on), so it may be in the middle of the long select xact_seqno...from MSrepl_commands (64 millions rows now).

    Then I've sampled the fn_virtualfilestats and the perfmon (PhysicalDisk sec / Reads) for an hour, and it showed very few IOs from SQL Server on distribution's data file, along with very poor service times on the disks like 65 ms average (its a raid 5 group, another nonsense when you see logreader writing like hell and all the reading activity ---).

    The stats on MSrepl_commands have been updated by the last distrib cleanup, the index is forced however in MS_get_repl_commands, I don't think of a bad plan.

    I'm stuck on this one.

    Thanks, David B.

    David B.

  • Hang on, we have eventids 129 from HpCISSs2 on the bow, I assume it refers to the raid controller. It could indicate a problem with the on-board cache.

    Let you know.

    Thanks,

    Daivd B.

    David B.

  • Seems more like IO issues than anything else. IS the user DB and the Distribution DB on the same controller? The log files too?

    The events that are thrown write now could mean that cache might be disabled, but since this started throwing now, that should not be the only issue. RAID 5 is good depending on what you are using it for.

    For write-intensive applications, RAID 1 or RAID 1+0 are probably better choices (albeit higher in terms of hardware cost), as the performance of RAID 5 will begin to substantially decrease in a write-heavy environment.

    Also keep in mind that the User DB and the distribution DB uses Random right. Only the log is sequential right. Raid 5 is better for sequential rights.

    Just my 2 cents

    -Roy

  • Hi Roy,

    I agree on the RAID stuff, but you know how it works. DBA's are barely involved from the project's kickoff. The server was already set up with this RAID 5 disks group even before I had a chance to give my opinion. There are still battles to win before we eventually win the war (cf http://miracleas.com/BAARF/) 😉

    One hour ago the distributor failed exceedeing the -QueryTimeout. It hasn't been able to read most of the MSrepl_commands table in 10 hours.

    And the IO subsystem lead is quite confusing: the RAID 5 is able to deliver some 80MB/s whenever a checkpoint occurs, and not more than 4,3 Gb/s when the distrib.exe tries to read MSrepl_commands.

    This server looks like an idle system.

    I still have no news from the system engineer about this cache failure (FYI the event was visible until last week, not only right now, by the time the latency started to grow).

    I'll try to post an answer when I have news from the SE. Thanks for your interest.

    David B.

    David B.

  • Sounds like you have a fundamental problem somewhere because you shouldn't be timing out. In the odd bad moment I've had distribution databases with a backlog of 100GB and never had my distribution agent timeout. I assume there aren't any huge transactions being created by your publisher.

    Can I just confirm one thing. You have a publisher and a subscriber located in the same room and a subscriber in Germany. The distributor sits where? On the subscriber which is in the same room as the publisher?

    Mike

  • Yes, exactly.

    There are no huge transactions opened at the publisher, the other subscriber (in Germany) is replicating with no latency at 600 cmd/s.

    I've also noticed that when initializing just after the timeout error, the distrib used // threads for a while and the IO jumped at a fair rate, and then got back to 1 thread leading the IOs down again.

    David B.

    David B.

  • This is just a shot in the dark but are there any blocking in the subscriber DB? Seems like it starts to catch up then it is getting blocked by another process.

    -Roy

  • Hi all,

    Sorry to get back to you this late, I was travelling last week, not a chance to have a look at the forums.

    Anyway, I followed Mike's piece of advice and delayed the distribution cleanup. Actually, I scheduled the job once a day at night when the system is quiet. It is MUCH better since then, and last friday the queues were emptied when the activity slowed down in the late afternoon, so a complete blocking was avoided. Better, not perfect though, because the latency is still just beyond acceptable.

    We'll have chance to upgrade the publisher to a SQL 2005 next week end, and to rebuild the replication in the same time. I'll probably take this chance to split my publication into multiple ones.

    Roy, there was indeed blocking on the subscriber, but not involving replication. I asked the BI teams reporting on the subscribed databases to use with NOLOCK in their queries, I haven't seen any blocking between the distributor and the reporting since then. But they still have some skewed parallel threads blocking and waiting on CXPACKET, so I suggested to add a MAXDOP 1 too.

    Thanks,

    David B.

    David B.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply