Massive performance drop when using Synchronous commit mode on an Availability group.

  • This is a test environment for a future production environment.

    I've got two Physical systems running 1 vmware guest each.

    I'm using the HammerDB tpc-c tests for this.

    My process is to use hammerdb to build the tpcc database, resize all the files so there is no filegrowth during the test, no checkpoints until it's complete, then a timed 10 minute test, 8 warehouses, 8 users.

    This test results in roughly 140k tpm.

    Then I spin up the availability group between the two nodes and rerun the test with the AG in async mode. I get roughly the same 140k TPM results.

    Turn on the synchronous commit for both replicas and I get 1386 TPM.

    That's less than 1% of the performance in a normal or an async configuration.

    This can't be normal. Someone please tell me what I'm doing wrong.

    More information:

    When I look at wait types it looks like this after a typical run, in order of most to least.

    CXPACKET (CTFP=5, MAXDOP=0)

    LATCH_EX

    HADR_SYNC_COMMIT (this is the evidence of it waiting on something I guess)

    WRITELOG

  • This was removed by the editor as SPAM

  • mortalic (5/17/2016)


    This is a test environment for a future production environment.

    I've got two Physical systems running 1 vmware guest each.

    I'm using the HammerDB tpc-c tests for this.

    My process is to use hammerdb to build the tpcc database, resize all the files so there is no filegrowth during the test, no checkpoints until it's complete, then a timed 10 minute test, 8 warehouses, 8 users.

    This test results in roughly 140k tpm.

    Then I spin up the availability group between the two nodes and rerun the test with the AG in async mode. I get roughly the same 140k TPM results.

    Turn on the synchronous commit for both replicas and I get 1386 TPM.

    That's less than 1% of the performance in a normal or an async configuration.

    This can't be normal. Someone please tell me what I'm doing wrong.

    More information:

    When I look at wait types it looks like this after a typical run, in order of most to least.

    CXPACKET (CTFP=5, MAXDOP=0)

    LATCH_EX

    HADR_SYNC_COMMIT (this is the evidence of it waiting on something I guess)

    WRITELOG

    Why do you think it can't be right? What monitoring did you do under load (on both tests) so you could compare things to see what made it so slow? Differential File IO stalls and wait stats analyses at a minimum. You gave highest waits but no actual numbers nor a comparison between the two.

    You also didn't tell us the single most important thing for sync commit - network capabilities (latency chief among those, with throughput a distant but important second). Nor the IO capabilities on the secondary (way behind network in importance).

    Lets take it to the extreme and assume you have a network that can only transfer 1 BYTE of data per second. How much throughput do you think you will get?

    Also note that CTFP defaulted to 5 is universally too low on modern hardware. MAXDOP defaulted to 0 is almost universally bad if you are on NUMA hardware/VM.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Thank you for realizing I'm not a spam bot. Not sure what's up with that guy.

    I am hoping it isn't right because the drop in performance is so massive.

    I have been doing load testing on the individual system for various configs (various Cost threshold for parallelism, maxdop, different vmware tweaks, disk configurations etc...)

    for the last month or so. I have pretty good data on what I would expect the results to be including the worst wait types for these runs.

    For a typical 10 minute run with maxdop=0 and CTFP = 5 with 8 CPU's and two vNuma nodes I could expect up to 140,688 transactions per minute.

    With maxdop=1 (the slowest individual setting) I could expect to see transactions per minute around at least 23,363.

    Regarding raw disk performance, using diskspeed (sqlio replacement), I could see up to 1.7 Gigabytes per second in either read or a mix of reads or writes in the fastest disk configuration (all SSD).

    Regarding the network backbone, these are on a 40GB backplane, shared with one other system that is pretty much Idle.

    Configuring a second physical system on the same storage, adding a vmware guest of identical configuration, adding the same tpcc workload (again with hammerdb) I could see transaction per minute in the 140k range. Adding them to an availability group together, with async, and no readable secondary the test was repeatable at around 140k tpm. You could see around 10 MB's of disk I/O and about the same network bandwidth on the secondary replica while the TPCC workload was running.

    Changing the availability group to synchronous on both, the transactions per minute dropped down to 1386. During the test I switch the AG between async and synchronous, and the performance would rise or fall as you might expect based on the numbers above.

    Additionally, when running the tests, without the synchronous AG, the cpu usage would be pretty uniform 75% across all cores, with synchronous commit the cpu usage average < 8% and maxed out around 24%. Disk I/O was non existent and network I/O never peaked over 1 megabyte per second.

    I feel like I have to be missing something important if not obvious because google has lead me to various blog posts, some with people posting 12,000 transactions per second average with synchronous AG's.

    Hence why I'm here trying to understand what I have to be missing.

    Thanks in advance for trying to help.

    EDIT:

    Sorry just realized I didn't post the wait types. I am at home and don't have that spreadsheet easily available. I'll collect that tomorrow when I'm back at work.

  • OK, that's a LOT more information to go on. In no particular order:

    1) Observe how far behind the syncing gets during the async run. If it is way behind it indicates slowness in actually migrating and replaying tlog records. Once you finish your run see how long it takes to completely catch up. That full time divided by total transactions performed will indicate better what your actually "committed on the secondary" throughput is. You should also run dbcc sqlperf(logspace) regularly throughout the run to see how much space the tlog actually consumes.

    2) Differential wait stats analysis should point to the bottleneck, as you know. I look forward to seeing your numbers. And please do it as differential: take a snapshot, wait for some minutes, snapshot and diff the numbers and divide by time. A 3-5 minute run in the middle of each test should suffice here. I doubt file IO stalls will be bad, but I want proof. Even with SSDs in play I have seen clients with such grotesquely misconfigured/overprovisioned VMs that IO still sucks.

    3) Speaking of misconfigured, what is the network configuration of both VMs? It is REALLY easy to saturate a single 1Gb NIC, especially a poorly configured VM one. Paravirtualized in play? My guess is this is where you are getting killed.

    4) What is the exact build of SQL Server? There have been a LOT of bug fixes and improvements to AlwaysOn stuff over the years.

    5) Page splits: if you have default fill factors on all your indexes then depending on the use pattern you could be experiencing a large number of "bad" page splits, which bloats the tlog stuff that has to be synchronously committed on the secondary (as opposed to just queued up in the tlog to be sent later.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Read this if you are on older edition of SQL Server:

    https://support.microsoft.com/en-us/kb/2723814

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • SQL Server version is 12.0.4449.0 (2014 sp1 cu6)

    The asynchronous run:

    Here is a summary of waits after tpcc database build and followed by a 10 minute performance run.

    I restarted the sql service so that the waits are indicative of just this build and run.

    Results:

    CPU AVG: 75.824%

    CPU MAX: 95.903%

    Transactions Per Minute: 142,410

    My wait stats for the async run look to have a copy/paste failure, I'll have to re-run them and get back to you.

    The synchronous run:

    Here is a summary of waits after tpcc database build and followed by a 10 minute performance run.

    I restarted the sql service so that the waits are indicative of just this build and run.

    Results:

    CPU AVG: 7.995%

    CPU MAX: 24.132%

    Transactions Per Minute: 1386

    Primary replica after tpcc db build:

    HADR_SYNC_COMMIT 41240.90 seconds

    WRITELOG 507.17

    HADR_GROUP_COMMIT 423.45

    LATCH_EX 79.15

    PAGELATCH_EX 60.04

    PREEMPTIVE_OS_WRITEFILEGATHER 15.43

    PAGELATCH_UP 10.22

    LCK_M_X 5.55

    HADR_SYNCHRONIZING_THROTTLE 4.82

    Secondary replica after tpcc db build:

    REDO_THREAD_PENDING_WORK 5111.13 seconds

    HADR_DB_COMMAND 36.10

    PWAIT_HADR_WORKITEM_COMPLETED 24.29

    BACKUPTHREAD 19.27

    LCK_M_X 19.16

    ASYNC_IO_COMPLETION 19.14

    PREEMPTIVE_OS_WRITEFILEGATHER 18.64

    WRITELOG 16.13

    PAGEIOLATCH_EX 4.05

    PAGEIOLATCH_SH 1.57

    LCK_M_S 1.54

    IO_COMPLETION 1.12

    Primary replica after 10 minute performance run

    Secondary replica after 10 minute performance run

    Regarding vmware config, I'll need to wait until the systems guy gets in to know the details, so I'll get back to you. I tend to agree, this has to be the bottleneck.

    I sent him an email earlier about putting the guests on the same host to see if it changes the performance characteristics of these systems. From the OS guest side, it shows a pair of 10Gbps ethernet adapters. One for the cluster and one for the rest of the connectivity called testLan99-1. This testlan adapter appears to be where the majority of the traffic is going through. I did change the testlan to be the higher priority NIC in the list. That didn't seem to have any effect.

    Regarding page splits, there are some it appears, the values for fill factor are the defaults.

    Side note, I really do appreciate your help, thank you.

  • Patched up, so unless something is messed up with 2014 SP1 CU6 that build should be best.

    41K seconds HADR_SYNC_COMMIT waits? Like I said - network bottleneck. That is the wait for the log block to be hardened on the secondary. Clearly you cannot be successful with 11.5 HOURS of waits in 10 minutes!! :hehe:

    https://blogs.msdn.microsoft.com/psssql/2013/04/26/alwayson-hadron-learning-series-hadr_sync_commit-vs-writelog-wait/

    Also, run sp_whoisactive now and again throughout the load too. Probably see waiting spids for that or other related waits.

    After a run please get a detailed index physical stats set and check for massive fragmentation.

    Can't recall if I asked: is the secondary set up in ANY way to be readable?? If so that is a big no-no in my book due to the 14-byte version store pointer issue.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • If it looks like a duck.....

    Since I posted this, I had the systems guy move them onto the same physical host and the TPM shot up to 75k.

    It's for sure the network.

    Thank you for helping me walk through this. 75k is half the performance of the async commit, but It's acceptable and I can work with it. It still far exceeds the system it will be replacing.

    EDIT:

    Index fragmentation is not an issue, fresh indexes and stats created before each run.

    Secondary is not readable.

  • mortalic (5/18/2016)


    If it looks like a duck.....

    Since I posted this, I had the systems guy move them onto the same physical host and the TPM shot up to 75k.

    It's for sure the network.

    Thank you for helping me walk through this. 75k is half the performance of the async commit, but It's acceptable and I can work with it. It still far exceeds the system it will be replacing.

    EDIT:

    Index fragmentation is not an issue, fresh indexes and stats created before each run.

    Secondary is not readable.

    Glad you found the (obvious) smoking gun! The systems folks have some work to do to improve their VM configuration. And don't let them slide on this just by putting the VMs on the same machines. I bet they use a similar or identical config on their other VMs and have systems that while maybe not getting crushed quite as badly as this one still aren't performing up to what they should be.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • Indeed, we will be going over the vm config in depth shortly.

Viewing 11 posts - 1 through 10 (of 10 total)

You must be logged in to reply to this topic. Login to reply