the distribution agent has very high latency (transactional replication)

  • Hi forum Members,

    I need help to nail down certain replication latency issue(I am not much familiar with replication)

    I am experiencing high latency issue on distributor to subscriber in transactional replication setup. sometimes for particular artcile "A"(single object) the latency can be as high as 3 hours where as for other article "B" (multiple objects) it can vary between 0 to 5 minutes. Though i do know that sometimes we have more than 1 million row updates happening on article "A" object within 30-45 minutes and at that time replication kind of gets bogged down and takes almost 3-4 hours to recover from this high transactional volume.

    In my environment publisher and subscriber are physical boxes with 24 CPU (2GHz) and 96 Gb memory, where as distributor is running on VMware having 4 CPU (2GHz) and 8 GB memory. All 3 servers are window 2003 and sql 2005 boxes.

    First of all my question is, does having this distributor running on VMware is impacting performance? will increasing horsepower and memory on distributor will help me (though i have never seen CPU consumption more than 30%)? will moving this to physical box will help it (My operations guy do not prefer to switch to physical box until i prove the benefit of doing this, if necessary)?

    Also, I have provided couple samples from the setup scripts that I used to setup publication as well as article. Please provide feed back if something wrong with this.

    Finally a very vague question, is it possible to keep DR site almost close to real time to OLTP ( 1 to 5 minutes of dataloss is acceptable in case of failure on publisher) using transactional replication (considering that we might have 2-3 million commands generated within minutes)? I am just trying to be optimistic here:).

    Thanks in advance.

    MJ

    --------------------------------------------------------------

    Below is the SQL I used to setup publication

    EXEC sp_addpublication

    @publication = N'table A',

    @description = @description,

    @sync_method = N'concurrent',

    @retention = 0,

    @allow_push = N'true',

    @allow_anonymous = N'false',

    @enabled_for_internet = N'false',

    @snapshot_in_defaultfolder = N'true',

    @compress_snapshot = N'false',

    @ftp_port = 21,

    @ftp_login = N'anonymous',

    @allow_subscription_copy = N'false',

    @add_to_active_directory = N'false',

    @repl_freq = N'continuous',

    @status = N'active',

    @independent_agent = N'true',

    @immediate_sync = N'true',

    @allow_sync_tran = N'false',

    @autogen_sync_procs = N'false',

    @allow_queued_tran = N'false',

    @allow_dts = N'false',

    @replicate_ddl = 1,

    @allow_initialize_from_backup = N'true'

    following for article setup

    EXEC sp_addarticle

    @publication = 'Table A', @article = table_A,

    @source_owner = 'dbo', @source_object = 'table_A',

    @type = 'logbased', @description = '', @creation_script = '', @pre_creation_cmd = 'drop',

    @schema_option = 0x000000000803509F, @identityrangemanagementoption = 'manual',

    @destination_table = 'table_A', @destination_owner = 'dbo', @status = 24,

    @vertical_partition = 'false',

    @ins_cmd = 'CALL [dbo].[sp_MSins_dbotable_A]',

    @del_cmd = 'CALL [dbo].[sp_MSdel_dbotable_A]',

    @upd_cmd = 'SCALL [dbo].[sp_MSupd_dbotable_A]

  • How many rows are there in your msrepl_commands table on the distribution server (do a SELECT sum(rows) FROM sys.partitions WHERE object_id = OBJECT_ID('msrepl_commands') AND index_id IN (1,0) - if your table is really large a count(*) on the table might take a very long time)?

    Are there any network waits on the subscriber applying the commands, any IO waits on the subscriber applying the commands etc (you can check with perfmon/profiler/sp_whoisactive etc)?

    Does the update which affects 1 million rows happen within a transaction? Or are these individual updates totalling up to 1 million rows affected?

    First of all my question is, does having this distributor running on VMware is impacting performance? will increasing horsepower and memory on distributor will help me (though i have never seen CPU consumption more than 30%)? will moving this to physical box will help it (My operations guy do not prefer to switch to physical box until i prove the benefit of doing this, if necessary)?

    The recommendation is to have the distributor on a separate server (which is already the case in your setup) - I remember reading some article (which I cannot locate now) that mentions that SQL Server performs around 10%-15% better on a physical box v/s a VM.

    Finally a very vague question, is it possible to keep DR site almost close to real time to OLTP ( 1 to 5 minutes of dataloss is acceptable in case of failure on publisher) using transactional replication (considering that we might have 2-3 million commands generated within minutes)? I am just trying to be optimistic here:).

    It depends 🙂 Do you know what your average commands/sec transfer rate is? i.e assume your average throughput to the subscriber is 3000 commands/second and if you generate 3 million commands in 3 minutes on the publisher it would take 15 minutes for all of this to propogate to the subscriber - your DR plan should consider what happens if you have an outage on the publisher while these commands are being replicated. In my opinion transactional replication isn't the best DR strategy.

    And regarding the setting up of publication and articles:

    a) Your publication is set with immediate_sync set to true. Check the articles below for performance impacts of this setting:

    http://www.replicationanswers.com/TransactionalOptimisation.asp

    http://blogs.msdn.com/b/chrissk/archive/2009/07/27/how-replication-setting-immediate-sync-may-cause-transactional-replication-distribution-database-growth.aspx

    b) Check this -> http://msdn.microsoft.com/en-us/library/ms151762(v=sql.90).aspx for some switches which might help the distribution agent performance (for e.g. -SubscriptionStreams, -ReadBatchSize).

  • Mandip

    I am experiencing high latency issue on distributor to subscriber in transactional replication setup. sometimes for particular artcile "A"(single object) the latency can be as high as 3 hours where as for other article "B" (multiple objects) it can vary between 0 to 5 minutes. Though i do know that sometimes we have more than 1 million row updates happening on article "A" object within 30-45 minutes and at that time replication kind of gets bogged down and takes almost 3-4 hours to recover from this high transactional volume.

    What is the line speed between the distibutor and the subscriber? I have a similiar configuration, but my distributor in on it's own hardware. We pump millions of trasnactions an hour through without issues, BUT...

    I have several tables that whenever I have to reinitialize take minutes to load locally over a 1GB line but hours over a T1.

    As for VM vs dedicated server, we are experimenting currently with VM. We had a server setup to run our SQL Sentry server and SQL Server on a VM server. It was a dog. We could not open consistently any of the servers being monitored using SQL Sentry. CPU was running at 50 - 60% constantly. That was with 4 CPU's. We moved the database from the VM to another hardware server and we are not having any issues.

    Steve Jimmo
    Sr DBA
    “If we ever forget that we are One Nation Under God, then we will be a Nation gone under." - Ronald Reagan

  • How many rows are there in your msrepl_commands table on the distribution server = 112078447

    Yes, I do see ASYNC_NETWORK_IO (most signifiacant Waiting Tasks rate 750 tasks/s) and WRITELOG.

    No, its not single update. it can be many updates which amount to 1 Million commands in 10-15 minutes. maximum single transactionis touching 5000 updates.

    Average command rate veries between 100 - 500 commands/sec over the time period.

    I will review the immediate_sync setting too, thanks for pointing it out.

  • What is the line speed between the distibutor and the subscriber? i believe here we are talking about LAN speed. if yes, it's 1 GB.

  • How many rows are there in your msrepl_commands table on the distribution server = 112078447

    That's a large number of rows - this could cause performance issues reading commands to be applied to the subscriber.

    Yes, I do see ASYNC_NETWORK_IO (most signifiacant Waiting Tasks rate 750 tasks/s) and WRITELOG.

    The pdf attached at this site - http://blogs.msdn.com/b/chrissk/archive/2010/03/11/slides-from-sql-saturday-33-troubleshooting-transactional-replication.aspx - has some details on checking the data in MSLogreader_history and MsDistribution_history to get information on why the agents are slow (i.e. is there a problem reading or writing etc).

    Average command rate veries between 100 - 500 commands/sec over the time period.

    This depends on a lot of factors - but the number you have is pretty low. If you have a test environment you can try various distribution agent/log reader switches to see if any of them improve the performance.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply