I am looking for some inputs on how to avoid or prevent this particular scenario which happened yesterday. there was downtime of around 2 hours.
Below is the scenario
We have a Multi-subnet Alwayon AG configured with 10 databases. AG has 3 nodes. All are azure VMs.
2 nodes in one region and 1 node is in a different region.We are using SQL Server 2017 Enterprise edition CU23 version.
On Primary replica one of the user db's transaction log file of a database which is part of AG got full (9002 log full error).
The mdf is 20gb and log file has grown ~ 400GB. The data and log files are located in separate drives. every hour we take log backups.
The drive capacity of log file is 500GB.
So, to minimize the downtime , we tried adding space on secondary synchronous replica and wanted to fail over.
After adding space and when we rebooted the server didn't come online. it took like 20 mins to come up.
Secondly, the cluster service was not coming online due to which , we aren't able to do a fail over to the secondary node.
After systems escalation team was involved , they finally was able to bring the cluster service online which took another 30-45 mins to analyze the logs. We eventually failoved over to the secondary and made it primary. We followed same steps to add it on remaining nodes.
We are suspecting all this could have happened due to a long running open txn and it is waiting for AVAILABILITY_REPLICA log_reuse_wait_desc. We asked application team if they have run an transaction, they said they haven't.
1. We have disk space monitoring in place, but the incident was with system's team and they haven't notified.
2. We got an sql server alert only for log full 9002 error.
I am looking for inputs on how can avoid some disk space issues in future.
- > I am thinking about configuring another sql alert to send email when the log file is reached 50% .
- > restrict auto growth of the log file
Please share more thoughts on this on how to minimize of handle such situations more efficiently.
Also, is there a way we can tell/send a list of spids/txns which caused the log file to get full?