Strange redo rate / redo que issue

  • I have had this issue for quite some time, i don't think i was quite able to identify what is causing the issue. We have a 20 TB database which is configured for  AG with 1 synchronous and 2 async copies. We have noticed very high redo que size with very low redo rate. Every thing that i have read so far looks like doesn't apply to my case. Below are my observation followed with few questions.

    i) On the secondary there is very low CPU usage between 20 - 30 % on 100+ core box.

    ii) I have confirmed there is no I/O issue, R/W latencies are well within 5 ms.

    iii) There is some activity on secondary that uses this one database, not sure if this is causing any issue.

    iv) No issue with n/w because i can see in the dashboard that log hardened time is with in few secs of source. That tells me secondary is receiving log changes on time, issue is applying those log changes. I am not sure why do folks keep pointing this as a problem related to n/w may be i am missing something here?

    v) We do not or very rarely see this issue on async copies, also there is no activity on async copies.

    vi) No blocking on secondary node on that database or any other database.

    v) Ran extended event for about 5 mins just on the spid (command='DB Startup') for that database on secondary, SOS_Scheduler_Yield shows on the top of the list.

    I strongly believe issue is on receiving side i.e. the secondary node, i believe there is some query or process holding up the snapshot . I just do not know how to identify. I do not think issue is on primary because my async copies are all upto date. Would like to know what others have done to identify the issue.

     

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • Did you ever find a cause for this?  We currently have a situation where a replica just won't keep up.  For some reason the Redo rate is extremely low (4779 KB/s) - where as other server (with different databases, but similar IO performance) have redo rates around 68,000 KB/s.  There's absolutely nothing that looks like it's causing this.  IO, CPU and Network are all quiet, and theres no blocking - the server just doesn't seem to be trying to keep up.  And this is while the primary is relatively quiet.

  • In my case, we typically got redo after a big ETL , especially transactions that do not execute in batch for example an ETL process that would load the data and rebuild a 200 GB index in one transaction, SQL server can only provide so much throughput for each redo thread regardless of size of the server.

    Have you tried flipping the replica to non-readable and see if it gets better?

  • Our replica's are non-readable.  After much headscratching I found that traceflag 3459 was enabled - forcing Redo to run in single threaded mode.  After disabling it, the problem went away in about 15 minutes.  It had been enabled in the past when a similar issue had appeared - presumably something in our recent SP3 upgrade may have changed the Redo operation - or it's just SQL 2016 being a pain...

  • Good to know.

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply