Always On Availability Group - Log Send Rate Horrible

  • Hi -

    We've been using AlwaysOn Availability Groups for a few months. Prior to Always On, we leveraged SQL 2008 R2 Async mirroring.

    We have a dedicated 10 GB pipe between the two servers. The servers sit physically next to one another on a 10 GB Network.

    The SQL mirroring (in SQL 2008 R2) was able to maintain the mirroring process.

    Well - the performance of AlwaysOn is NOT good. The log_send_rate is absolutely horrible. The best rate we see if 11 MB. But normally the rate is 2 MB.

    Has anyone experienced the rate become so slow that the status becomes unhealthy and the database has be removed and reconfigured w/in the availability group?

  • Has anyone experienced this issue with large log generation?

    I've been working with Microsoft since January - Long story short, we haven't received a fix yet...

    Directly from Microsoft:

    Issue 1 – Allocating global PMO Object Becomes hotspot for logged activity with fast local storage (Impacts primary workload)

    The current timeline for Issue 1 (latency issue identified during production in late 2012) is that there may be a check-in of a

    fix which would be slated for SQL Server 2012 SP 1 CU 3, available in mid-March.

    IMPORTANT: Please know that you have the option to request an ‘On Demand’ fix, in which the product group,

    once fix is fully regression tested, can make a fix earlier than the CU 3 timeline.

    The earliest this fix would be available is mid-February. Please let us know if you want to pursue this fix.

    Issue 2 – Replica throughput limited due to expensive CRC (Impacts network throughput)

    We anticipate that your production environment will experience this issue (once Issue 1 is resolved). We have started investigating a fix that will increase throughput, but it is possible we will find another bottleneck. As with any optimization problem, this will take time to investigate and productize (roughtly 6-8 months).

  • There are a number of potential issues. I have been working through this exact issue in our own production environment.

    1. There is a hotfix in SP2 CU1 (http://support.microsoft.com/kb/2982843 "FIX: Longer latency for SQL Server 2012 database when you use Service Broker, database mirroring, and Availability Groups"). There is a less important, but still relevant hotfix in SP2 CU3 also (http://support.microsoft.com/kb/3012182 "FIX: Log_Send_Rate column in sys.dm_hadr_database_replica_states cannot reflect the rate accurately in SQL Server 2012 "). We have had the issue described by KB2982843, and seen it resolved. Its sporadic and we found it impossible to reproduce manually.

    2. Long running transactions. Understand that in my opinion, the log_send_rate is NOT what its described as. Its not measuring what is sent. Its actually measuring what is being removed from the send queue. And log records are actually only removed from the send queue when the *complete transaction* (all log records) has been received by every secondary, hardened to disk by every secondary, and an ack received from every secondary telling the primary "you can remove all log records for this transaction now, its done". That is a very different measurement from "Rate at which log records are being sent to the secondary databases, in kilobytes (KB)/second" on http://msdn.microsoft.com/en-us/library/ff877972.aspx.

    Long running transaction will affect your send rate (if its measured by how fast entire transactions are removed from the send queue, rather than by how fast it sends individual log records in kb/s), because you need to wait for the entire transaction to complete on the primary, reach all secondaries, get hardened to disk, and send acks back to the primary, BEFORE it can be removed from the send queue. This is why you may sometimes see impossible speeds occasionally. I see send rates of 61+Megabytes/sec speeds on a transatlantic link between the primary and secondary when the maximum link speed for individual connections is 2-3MB/sec. This is because 61MB has just been cleared from the local Log Pool (send queue), not because 61MB has been squeezed down a 1-2MB pipe. And its waiting for long running transactions to harden on all secondary's that will cause the speed to look slower than it actually is (assuming you have already installed the patches in point 1).

Viewing 3 posts - 1 through 2 (of 2 total)

You must be logged in to reply to this topic. Login to reply