Click here to monitor SSC
SQLServerCentral is supported by Red Gate Software Ltd.
 
Log in  ::  Register  ::  Not logged in
 
 
 
        
Home       Members    Calendar    Who's On


Add to briefcase 12»»

Distrib agent stalls and can't catch up Expand / Collapse
Author
Message
Posted Friday, October 9, 2009 1:50 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Hi,

I have a SQL 2000 publishing to a secondary instance on SQL 2005. This SQL 2005 hosts both distribution and the replicated database.
Subscription is in PUSH mode. Both publisher and distrib/subscriber are strong configurations (HP Proliant 8x3Ghz, local SAS 15KRPM, etc...). Distributor is running in continuous mode, with a profile which has all values by default except the timeout set to 36000 seconds instead of 1800, because when running against large MSrepl_commands, it fails with a timeout error.

On the inbound side, I have 80 clients inserting massively into the publisher in OLTP fashion.

Now the MSrepl_commands and MSrepl_transactions queues are filling way too fast for the distrib to catch up. I've set 2 counters last night on logreader cmds/sec and distrib cmds/sec. Logreader reports 300 cmds/sec and distrib 100 cmds/sec, and sometimes even reports nothing and seems to be completely inactive. The MSrepl_commands is now 12 Gb big. I would like to understand why the distributor is so much under the logreader.

There are 3 SQL connections on the subscriber related to the distrib.exe: (ie under the same hostprocess)
- one doing MSget_repl_commands, very few IOs/sec reported in the sys.sysprocesses physical_io column.
- one tracked under profiler session and supposed to apply the transactions at the subscriber, but stucked at the begin tran doing no IO and not waiting for any resource to be freed (last_wait_type is MISC and waittime is always at 0)
- one doing sp_MSadd_distribution_history and sometimes blocked for quite a long time (more than 200 seconds average) by the distribution cleanup procedure doing a:
EXEC dbo.sp_MSdistribution_cleanup @min_distretention = 0, @max_distretention = 120

Notice that I set the min_distretention to 0 because I don't want the transactions that have been applied already to stay in the distribution database.

I don't know what to do to quicken the distribution process. Is there a way to avoid blocking between the distribution cleanup and the distrib.exe ?
I would try to avoid modifying parameters in the distrib agent profile because a restart of the distrib agent costs a lot (it must restart its MS_get_repl_commands from the beginning and the table is getting bigger and bigger every minute, so I fear the situation would get worse every time I would restart the distrib.exe).

Thanks for sharing,

David B.








David B.
Post #800554
Posted Friday, October 9, 2009 2:10 AM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Thursday, October 30, 2014 2:57 AM
Points: 503, Visits: 611
The first thing to is to change the schedule of the Distribution Cleanup job. By default it runs every 10 minutes which will kill a busy replication system. We run ours once every hour. It sounds like you should also consider a 3 server architecture so that publisher, distributor and subscriber each has its own server.

I don't know how many tables you are publishing or how many publications you have but creating more publications with a dedicated distribution agent per publication could boost throughput.

Hope some of that helps,

Mike
Post #800562
Posted Friday, October 9, 2009 2:20 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Thanks Mike,

You are right, I've considered once to split the single publication into multiple ones, but for some reason it didn't happen.
I've disabled the distrib cleanup temporarily to let the distrib continue. I can see a little more activity on the IO side for MS_repl_get_commands.

I'm wondering, maybe it has to do with the continuous mode. Actually, we orignally designed the replication this way because the DSS apps working on the subcriber need the data as soon as it has been validated on the publisher. But maybe I could set the distrib agent to run at close and regular intervals to avoid bumping into the cleanup process, what do you think ?

For your information, there are 20 tables and about 40 Gb published in this publication.

David B.


David B.
Post #800568
Posted Friday, October 9, 2009 2:58 AM
Mr or Mrs. 500

Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500Mr or Mrs. 500

Group: General Forum Members
Last Login: Thursday, October 30, 2014 2:57 AM
Points: 503, Visits: 611
Given our experience, I would say stick with the continuous running. Running on a schedule will build up latency and the distribution agents can slow down as the distribution database gets bigger. We replicate in the region of 800 million transactions a day around our entire infrastructure which is roughly 200 million per day to each of 4 subscribers. This represents about 30-40GB per day per server. We run all our agents in continuous mode but that's not to say you shouldn't experiment. We also run the 3 server architecture I mentioned earlier.

Only getting 100 cmds/sec out of the distribution agent is not good really. I'd expect to see something more like 500-800 cmds/sec or even higher if insert in the only activity. Hopefully, everything is indexed properly. Nothing lying around which shouldn't be there! Apologies if that's an insult.

I really would recommend creating more publications to get better parallelism into your design. We only replicate 9 tables off our main publishers but have 3 publications each with a dedicated distribution agent.

Mike
Post #800589
Posted Friday, October 9, 2009 3:32 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Thanks Mike,

I realized I didn't mention two things in the design:
- There are two subscribers on this publication (so 2 distrib.exe): one is the matter of concern, and is local to the publisher, in the same room (both in London).
- The second one is based in Germany, subscribing to the same amount of data, and runs perfectly with no latency and a 600 cmds/sec rate.

Both distributors are running under the same profile, and the only difference in their parameters is that the distrib for the remote subscriber uses SQL authentification (-SubscriberLogin + -SubscriberPassword), whereas the local one uses integrated authent on behalf of the SQL Agent account.

This is nonsense I know.

The other factor is that the publication was created with @immediate_sync = 0 to avoid the snapshot agent to create an entire snapshot each time it resynchronizes the subscribers. (when I add a article or when I change a column in an article, remember the publisher is SQL 2000).


Thanks
David B.


David B.
Post #800606
Posted Friday, October 9, 2009 4:05 AM


Ten Centuries

Ten CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen CenturiesTen Centuries

Group: General Forum Members
Last Login: Wednesday, September 17, 2014 9:19 AM
Points: 1,307, Visits: 4,508
Both distributors are running under the same profile, and the only difference in their parameters is that the distrib for the remote subscriber uses SQL authentification (-SubscriberLogin + -SubscriberPassword), whereas the local one uses integrated authent on behalf of the SQL Agent account.

This is nonsense I know.


May not be nonsense. I have witnessed extreme slowdown of replication (albeit on SQL2K) where the DC was a bit flaky, and didn't respond quickly, so the server elected to authenticate against a different (remote) DC. Forcing it back to its local DC made a HUGE difference.

If that's the only difference between the 2 subscribers, it's worth trying SQL authentication, even if it just rules that out as the cause of the problem.



Post #800622
Posted Friday, October 9, 2009 5:23 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Thanks Ian,

I'd like nevertheless to let the distribution agent complete its reading on MSrepl_commands and start applying before I consider stopping it to try a different configuration. FYI, the replication pair is not in a domain, and the authent mech is just local NTLM.

The MS_get_repl_commands makes very few IOPS and is waiting on IO_COMPLETION most of the time. Profiler shows no activity (even with SP:StmtStarting and SP:StmtCompleted turned on), so it may be in the middle of the long select xact_seqno...from MSrepl_commands (64 millions rows now).

Then I've sampled the fn_virtualfilestats and the perfmon (PhysicalDisk sec / Reads) for an hour, and it showed very few IOs from SQL Server on distribution's data file, along with very poor service times on the disks like 65 ms average (its a raid 5 group, another nonsense when you see logreader writing like hell and all the reading activity ---).

The stats on MSrepl_commands have been updated by the last distrib cleanup, the index is forced however in MS_get_repl_commands, I don't think of a bad plan.

I'm stuck on this one.

Thanks, David B.


David B.
Post #800648
Posted Friday, October 9, 2009 5:45 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Hang on, we have eventids 129 from HpCISSs2 on the bow, I assume it refers to the raid controller. It could indicate a problem with the on-board cache.
Let you know.

Thanks,

Daivd B.


David B.
Post #800660
Posted Friday, October 9, 2009 7:43 AM


SSCrazy

SSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazySSCrazy

Group: General Forum Members
Last Login: 2 days ago @ 7:39 AM
Points: 2,370, Visits: 6,765
Seems more like IO issues than anything else. IS the user DB and the Distribution DB on the same controller? The log files too?
The events that are thrown write now could mean that cache might be disabled, but since this started throwing now, that should not be the only issue. RAID 5 is good depending on what you are using it for.
For write-intensive applications, RAID 1 or RAID 1+0 are probably better choices (albeit higher in terms of hardware cost), as the performance of RAID 5 will begin to substantially decrease in a write-heavy environment.
Also keep in mind that the User DB and the distribution DB uses Random right. Only the log is sequential right. Raid 5 is better for sequential rights.
Just my 2 cents


-Roy
Post #800743
Posted Friday, October 9, 2009 9:36 AM


Valued Member

Valued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued MemberValued Member

Group: General Forum Members
Last Login: Monday, August 11, 2014 2:43 AM
Points: 61, Visits: 700
Hi Roy,

I agree on the RAID stuff, but you know how it works. DBA's are barely involved from the project's kickoff. The server was already set up with this RAID 5 disks group even before I had a chance to give my opinion. There are still battles to win before we eventually win the war (cf http://miracleas.com/BAARF/)

One hour ago the distributor failed exceedeing the -QueryTimeout. It hasn't been able to read most of the MSrepl_commands table in 10 hours.
And the IO subsystem lead is quite confusing: the RAID 5 is able to deliver some 80MB/s whenever a checkpoint occurs, and not more than 4,3 Gb/s when the distrib.exe tries to read MSrepl_commands.
This server looks like an idle system.

I still have no news from the system engineer about this cache failure (FYI the event was visible until last week, not only right now, by the time the latency started to grow).

I'll try to post an answer when I have news from the SE. Thanks for your interest.

David B.


David B.
Post #800882
« Prev Topic | Next Topic »

Add to briefcase 12»»

Permissions Expand / Collapse