Transaction log full error 9002 on Alwayon Availability groups

  • Hi All,

    I am looking for some inputs on how to avoid or prevent this particular scenario which happened yesterday. there was downtime of around 2 hours.

    Below is the scenario

    We have a Multi-subnet Alwayon AG configured with 10 databases. AG has 3 nodes. All are azure VMs.

    2 nodes in one region and 1 node is in a different region.We are using SQL Server 2017 Enterprise edition CU23 version.

    Issue description:

    On Primary replica one of the user db's transaction log file of a database which is part of AG got full (9002 log full error).

    The mdf is 20gb and log file has grown ~ 400GB. The data and log files are located in separate drives. every hour we take log backups.

    The drive capacity of log file is 500GB.

    So, to minimize the downtime , we tried adding space on secondary synchronous replica and wanted to fail over.

    After adding space and when we rebooted the server didn't come online. it took like 20 mins to come up.

    Secondly, the cluster service was not coming online due to which , we aren't able to do a fail over to the secondary node.

    After systems escalation team was involved , they finally was able to bring the cluster service online which took another 30-45 mins to analyze the logs. We eventually failoved over to the secondary and made it primary. We followed same steps to add it on remaining nodes.

    We are suspecting all this could have happened due to a long running open txn and it is waiting for AVAILABILITY_REPLICA log_reuse_wait_desc. We asked application team if they have run an transaction, they said they haven't.

    Observations:

    1. We have disk space monitoring in place, but the incident was with system's team and they haven't notified.

    2. We got an sql server alert only for log full 9002 error.

    I am looking for inputs on how can avoid some disk space issues in future.

    My thoughts

    - > I am thinking about configuring another sql alert to send email when the log file is reached 50% .

    - > restrict auto growth of the log file

    Please share more thoughts on this on how to minimize of handle such situations more efficiently.

    Also, is there a way we can tell/send a list of spids/txns which caused the log file to get full?

    Thank you.

    Sam

     

     

     

     

  • What type of log backups are you taking?  Normal log backups or do you specify copy_only on your log backups?

    Alerting sooner is always a great idea.

    Restricting auto growth probably not so much.

    You need to find the cause of the large log ideally and resolve that.

    I have seen people do copy only log backups on their AGs and this doesn’t mark the log as reusable and then get into this exact issue you have mentioned.

    verify the backups are actually running still and are not failing.

    you won’t be able to see what caused the log to bloat any more but if you do have your log backups you could use something like fn_dump_dblog to see what is in the log and try and see who was doing what and when they where doing it etc.

  • A couple more points:

    1. SQL Server cannot truncate the log on the primary until the transaction has been hardened on the secondary.  If either the send queue or redo queue are backed up - the transaction log on the primary will continue growing until it fills the drive.

      1. I have seen this situation occur when someone started a query that generated billions of rows in tempdb - filling tempdb.  The session was open in SSMS for 5 days before the log filled the drive.

    2. If you are using Ola's utility to backup the log - and have that configured to run on the secondary - then you also need to make sure the copy only flag is not set for your log backups.  For your database backups - the copy only flag must be set to allow backups to be run on the secondary.
    3. The size of the drives on all secondary nodes must be at least as large as the drive on the primary.  For example, in your scenario the drive on the primary was 500GB - and it seems the drive size may not have been 500GB on the secondary.  In that scenario - as soon as SQL Server would continue growing the log file on the primary until it fills the drive - but would stop the REDO queue as soon as the drive was filled on the secondary.

    The choice to try and solve this problem by failing over just increased the issues.  In this type of situation - the better option would have been to remove that database from the AG.  Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).

    The recovery time for the system was extended because of the large transaction log.

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • Jeffrey Williams wrote:

    The choice to try and solve this problem by failing over just increased the issues.  In this type of situation - the better option would have been to remove that database from the AG.  Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).

    The recovery time for the system was extended because of the large transaction log.

    What is meant by Cleanup the transaction log after removing the db from AG?

  • vsamantha35 wrote:

    Jeffrey Williams wrote:

    The choice to try and solve this problem by failing over just increased the issues.  In this type of situation - the better option would have been to remove that database from the AG.  Cleanup the transaction log and resize back to normal and then add the database back to the AG (using automatic seeding - a 20GB database will take just a couple of minutes).

    The recovery time for the system was extended because of the large transaction log.

    What is meant by Cleanup the transaction log after removing the db from AG?

    This means removing the database from the AG, and shrinking the log by whatever means necessary, backup and shrink, set simple and shrink etc.  Then set back to full so another full and log backup and add it back to the AG.

     

     

    But you need to find the source of the log growth otherwise it will happen again and again.

  • Got it. Thanks everyone for all your inputs.

  • Ant-Green wrote:

    What is meant by Cleanup the transaction log after removing the db from AG?

    This means removing the database from the AG, and shrinking the log by whatever means necessary, backup and shrink, set simple and shrink etc.  Then set back to full so another full and log backup and add it back to the AG.[/quote]

    Correct - but...  There is no reason to change recovery model to shrink a log file - shrink the log, backup log, shrink - repeat until it is down to the size needed for normal operations.

    With automatic seeding enabled - there is no need to backup the database and log.  Automatic seeding performs the backup/restore for you automagically.

    Even if the database was a multi-terabyte database, removing the database - fixing the transaction log - adding it back would have still been a better option.  Using automatic seeding and enabling the trace flag to compress the data stream and that process could have been completed in less time (depending on network connectivity - of course).

     

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

Viewing 7 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply