Checkpoint issues in SQL Server 2022

  • Hello All

    Recently we upgraded our production MSSQL Server from 2019 to 2022 and faced an issue with truncation of log file.

    Our environment is AAG with several replicas and we noticed that truncation_lsn randomly stucks on Secondaries and sometime on Primary. Currently we have just one workaround - it is server restart (for secondary) or failover (if happens on Primary), but we'd like to find a permanent solution.

    We noticed that truncation_lsn stucks right after below warnings

    Controller thread will stall to allow the close thread to catchup

    Close thread is falling behind: 4 checkpoints outstanding.

    We think that the issue is somehow relates to checkpoint and In-Memory, as we have XTP_Checkpoint wait if above warnings happen on Primary.

    If we run manual Checkpoint it never ends.

    I'll appreciate if someone can provide a clue or share some experience how to find a solution here.

    Thank you in advance

  • Thanks for posting your issue and hopefully someone will answer soon.

    This is an automated bump to increase visibility of your question.

  • If you are seeing XTP_Checkpoint wait when these warnings occur on the primary, it may indicate that there is a problem with the checkpoint process.

    Here are a few suggestions you could try to resolve the issue:

    1. Check the size of the transaction log file and make sure it is set to a sufficient size to handle your workload. If the log file is too small, it may be causing the checkpoint process to stall.
    2. Check the size of the In-Memory database and adjust it as needed. If the In-Memory database is too large, it may be causing the checkpoint process to stall.
    3. Check the disk space on the server where the transaction log file is located. If the disk is full or nearly full, it could be causing the checkpoint process to stall.
    4. Try running DBCC CHECKDB to see if there are any database corruption issues that may be causing the truncation issue.
  • check your vlfs as well

    DBCC LOGINFO

    I generally aim to have less than or equal to 40 or 50 vlfs, I wouldn't of my way to do anything about it unless there were several hundred or I had a window where I could easily deal with it.

    Given that it is happening after reboot, am curious if long recovery time is interfering with checkpoints.

    have you configured indirect checkpoints? If the database originally came from before whenever it was made default (2016? 2014?) it may have never been done.

  • Did you find a solution to this?

    We are seeing the same behavior on one of our databases, and I also believe that it may be related to in-memory tables.

    Did you, by any chance, move the database to another server recently?

     

  • Hi. We migrated our production databases from a SQL 2017 instance to a new SQL 2022 instance. We just started getting this exact  issue this past weekend. It is only happening with our database that does have in-memory enabled and is part of an availability group.

    We ran out of disk space for the transaction log. Performing log backups and restarting SQL service on the replicas did not resolve the issue. We had to remove the affected database from the AG and change the database recovery model from Full to Simple in order to successfully truncate our log. We didn't want to do this, but that was our only option to get things back up. The database was then changed back to Full recovery model and re-added to the AG.

    We are on SQL 2022 CU1. I'm thinking about installing CU4 soon, but none of the release notes for the CU's mention fixing this so I don't think it'll help.

  • That isn't the same issue. The logs wouldn't truncate because the memory optimized tables needed log space to commit, and with a full log file there was neither space to commit those changes or checkpoint. Expanding the log file a little bit then backing it up would have resolved the full log.

  • I should have mentioned that we have no in-memory tables in the database. So technically there's nothing in-memory that should need log space. We did have one initially but deleted it years ago.

    If you're wondering why we still have in-memory enabled for that database, it's because you can't easily delete an in-memory filegroup.

  • We're having the same issue with one of our environments. The only workaround until now is to reboot. We just installed the latest CU, but there is nothing in the documentation that may make us think that this will solve the problem.

  • There is a bug related to in-memory tables and Lock pages in memory.

    A workaround is to remove lock pages in memory for the instance (if possible) until it is fixed.

    It is unfortunately not fixed in CU5.

  • Thanks for the reply. Do you have documentation in regards to the Bug you're mentioning?

  • an update on this issue

    We set up a load testing environment and easily reproduced the issue after some hours. We halted the test and recover the database to start with lock pages in memory setting disabled. The test has been running since Saturday and we haven't had the issue until now.

  • This was removed by the editor as SPAM

  • We also have temporarily disabled "Lock pages in memory" and have not seen the issue crop up yet.

  • This was fixed with the CU7

Viewing 15 posts - 1 through 15 (of 17 total)

You must be logged in to reply to this topic. Login to reply