AG Redo Log Performance

  • Friends and fellow DBAs,

    I'd appreciate your help, any help in this perplexing issue.

    I have a cluster of servers in a distributed AG. The app that uses this database is extremely transactional. We have a problem with redo logs on our secondary replica where these logs do not get processed fast enough. We're always looking at 300GB or 200GB of redo logs (redo_queue_size value) to be processed on the secondary.

     

    log_send_queue_size is 60 kb and redo_rate hovers around 9000kb/sec. the issue is that i want to speed up the process of these redo logs on the secondary replica. any information, I'd appreciate it.

  • When do you see the redo queue size growing - and does it ever catch up?  When you have large transactions - such as index rebuilds - the amount of data sent to the secondary will be quite large.  Since it takes a lot less time to send the data than it does to update the data - the system will fall behind on the secondary.

    If that is the case - it is normal and the system will catch up in due time.

    If you are constantly seeing the redo queue behind by several hundred GBs - then you have some type of performance issue on that secondary.  This could be IO, memory or CPU problems - and you need to figure out which before you can address the issue.  If this is the case - verify the secondary has the correct IO setup and configuration for the SAN it is attached to - and has enough CPU and memory to support the workload.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Hi Jeff,

    Thanks for the reply. It's almost always behind. Except async auto updates, I don't do any other index maintenance. I see a constantly multiple hundreds of GB of redo queue all the time on the secondary; however, disk latency for the secondary is in single digits, very high cpu count (64), bit-depth is 64 and we have 1TB of memory on the server. Physical resources are not lacking, I am confident of that.

    it's something else that's preventing the redo logs to get processed nicely and quickly, something software related. we're on a 2019 OS and the version of database is 2017 with the latest approved patches. It's part of DAG setup also.

    lemme know what ya think!

  • As I stated before - there is something preventing the redo queue on the secondary from performing as expected.  It could be blocked by other processes running or there is some type of performance issue.

    The first thing I would check is how tempdb is setup and configured - and is the IO subsystem actually performing as expected.  The version store for the redo log will be in tempdb and if that is having issues will cause a backlog.

    I would also review the amount of data changes occurring on the primary.  If you have several hundred GB's of data changes every day - it is going to take time to send that data and redo it on the secondary.  It may be an issue where the primary is changing too fast for the secondary to keep up.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Right on and hence lies the problem, some process that's preventing quick processing of redo logs. TempDB is setup very generously and nicely divided into multiple files on the server. Primary and secondary replicas are exactly mirrors of each other.

     

    one thing that caught my eye was how large tlog file is on the primary (roughly 188GB) and vfl is a bit high (in low 400s). but i'm not sure if this has a negative impact on the secondary processing redo.

  • I doubt it has anything to do with user database transaction logs - and 400 VLF files on a 188GB transaction log isn't too many.  Unless my math is wrong - that sizes each VLF around 480MB.

    If you are always behind on the secondary - that is, it never catches up - but it stays at a consistent lag then it seems to me that you must have had a process that caused that lag in the first place and the system is only able to keep up with the current amount of data being sent.

    In simple numbers, assume the daily send amount is 100GB - and the secondary is only able to apply 100GB per day.  Once you have a log of 100GB (or more) then the system will never be able to catch up until the amount of data sent from the primary is much less than the amount of data that can be applied on the secondary.

    So - either the secondary does not have the necessary resources available to process the expected amount of data, or there are other processes causing blocking at times.  You need to monitor the secondary and identify the cause.

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • This was removed by the editor as SPAM

  • I am seeing this exact problem on a brand new not yet production system.  2 identical nodes, no issues on the hardware or storage because the problem happens regardless of which node is primary.   The storage is all nVME and is knocking it out of the park as far as peformance on the primary node, but the redo queue starts growing out of control almost immediately and the redo rate stinks. One thing I have noticed is the redo rate starts ok..sometimes 220mb/s initially , but it slowly death sprials down, gradually slower and slower to as little as 20mb/s eventually, causing the t-log on my primary to eventually fill because it can't be backed up.  Interestingly enough, if I just recycle the 2nd server, it will catch up my faster by going into recovery than if I let it sit there and process the redo queue, but then of course its not available as a readable secondary.  We are writing hundreds of millions of rows, sometimes a billion simultaneously from an SSIS package that has parallel steps.

    I have found a lot of hits online on what to look at, disabling parallel redo, making sure no blocking (there are no users, brand new), no contention of any type on the secondary. Its a monster system, 1.5tb of ram, 80 cores, cpu hovers around 10% memory not even 40% used. Tempdb is 8 200gb files on a stand alone (no-raid) nvme drive, very fast.  The system is under no duress but the redo just is not fast enough. If anyone has any ideas on how to tune this Im all ears, presently this situation is not workable. Thanks

  • Ronald,

    One thing I found out through painstaking trial and error is that I can control the flow if I turn off the secondary read from the primary replica. When I do that, redo log quickly catches up. can you try this at your end and verify if it's the same for you?

  • Hi Shawn, thanks for your tip and your time.  I took your advice, and the downward spiral on my redo rate does appear to have reversed itself.  My redo rate at the time of the change was at 37141KB, after the change dipped under 37000, and now has climbed to 37946kb. Not a huge swing but this is the first I have observed it increasing by any amount, it typically only goes down. In my case not sure this will be enough, right now my redo queue is approaching 1TB in size, so unless the redo rate really starts going up drastically I'm still writing data a lot faster then I can re-do it on the other side.  I attached a look at my dashboard if you are interested.

    When you made this change did you actually have users reading off the secondary or do you believe this to purely be a setting causing this slowness?  This is a totally unused system I just landed and am running loaders on so I don't in fact have anyone using the secondary replica yet, but it is a model I want to use when we go live.

  • it didn't like my bmp, trying jpg instead

  • We had users who were reading from the secondary. Before, we were just trying to kill the blocking sessions but that wasn't always successful in reducing the queue size. However, turning off the secondary read always produced good results.

    The challenge for me now is to write a job that monitors the queue size continuously and if it is increasing, it will turn off the secondary read. It will also have to monitor if it's decreasing and then turning on the secondary read.

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply