It’s an error that’s near-certain to ruin any DBA’s day:
Error: 9002, Severity: 17, State: 2.
The transaction log for database 'VeryImportant' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases
“But, but, but...” mutters the DBA, “The log backups are running, so why is the log full?”
Well, failed or missing log backups are just one of the reasons for a transaction log to be full. There are several other possible causes.
Where to start?
Start by querying sys.databases and to see what the value of the column log_reuse_wait_desc is for the database mentioned in the error message.
DECLARE @DatabaseName VARCHAR(50); SET @DatabaseName = 'VeryImportant' SELECT name, recovery_model_desc, log_reuse_wait_desc FROM sys.databases WHERE name = @DatabaseName
The value of the log_reuse_wait_desc column will show the current reason why log space cannot be reused. It is possible more than one thing is preventing log reuse. Sys.databases will only show one. Hence it is possible to resolve one problem, query sys.databases again and see a different log reuse wait reason.
The possible values for log_reuse_wait_desc are listed in Books Online. But first, before we get to an explanation of the various reasons, a short piece of theory.
Transaction log architecture
The transaction log is a circular file that's divided internally into a number of virtual log files (VLFs). SQL writes log entries sequentially into the VLFs. When one VLF fills up, SQL moves to the next one. When it reaches the end of the file, it will (ideally) go back to the beginning and start using the first VLF again (hence circular).
A VLF can be in one of two states:
An active VLF is one that contains one or more log records that are needed by the database engine. The log records may be part of active transactions, they may be log records that are needed for replication or mirroring, they may be needed for a backup, they may be associated with changes that haven’t yet been written to the data file, etc.
An inactive VLF is one that does not contain any log records needed by the database engine for any reason. There are no active transactions, the log records are not needed for replication or mirroring and all the changes associated with these transaction log records have been written to the data file. If the database is in full or bulk-logged recovery the log records in the VLF has been backed up.
For a VLF to be reused by SQL, it needs to be inactive. If all the VLFs in the log are active and SQL has filled the one it is writing to, the transaction log will grow or, if it cannot grow, data modifications will fail with error 9002. (The transaction log for database '<Database name>' is full. To find out why space in the log cannot be reused, see the log_reuse_wait_desc column in sys.databases)
The act of marking one or more VLFs as inactive is generally called log truncation. While this is a commonly used term, it’s a not a very accurate term. Truncate, according to the Oxford English dictionary means “shorten (something) by cutting off the top or the end”, however log truncation doesn’t shorten anything; it just makes the space available for reuse.
In addition, people often seem to think that truncate means to discard log records and break the log backup chain (probably from BACKUP LOG ... WITH TRUNCATE ONLY). It does not mean that.
In the above diagram, the transaction log has 4 VLFs (labelled A, B, C and D). This is for simplicity; most log files will have more. The first (A) and second (B) VLFs are inactive, and the third (C) and fourth (D) are active with SQL currently writing log entries into the "D" VLF. In that diagram, the (1) marks the log record of the last log backup, (2) marks the log record associated with the last checkpoint and (3) is the last log record written into the database.
After a number of data modifications, the log has wrapped around and SQL has now reused the first VLF and is about to start reusing the second. The third and fourth VLFs are still active. In this situation, unless something happens to mark the third and fourth VLFs as inactive the log will grow as soon as SQL has mostly used the second VLF. In this case, what is preventing the reuse is that the log backup has not run again.
If a log backup now runs, the third and fourth VLFs (C and D) would be marked inactive and would be available for SQL to reuse them.
Causes of delayed log truncation
These are the log reuse wait reasons from sys.databases
The active transaction log reuse wait reason means that there is an open transaction that is keeping the VLFs active.
When encountering this log reuse wait, two commands will help in identifying the source of the open transaction, DBCC OPENTRAN and sys.dm_exec_sessions
DBCC OPENTRAN lists information about the oldest open transaction in the database. The main piece of information needed from this, when investigating log growth caused by active transactions, is the SPID (Server Process ID) that is running the transaction.
Transaction information for database 'WebForums'.
Oldest active transaction:
SPID (server process ID): 51
UID (user ID) : -1
Name : user_transaction
LSN : (10861:3200:1)
Start time : Jan 14 2011 1:04:26:017AM
SID : 0x0105000000000005150000002e86f8cbc457a001b905c7e95e040000
DBCC execution completed. If DBCC printed error messages, contact your system administrator.
In this example, the connection with SPID 51 has had a transaction open since 1AM on the 14th Jan (a late worker, clearly). While there’s nothing in the output of DBCC OPENTRAN that can identify who is running that transaction, the SPID can be used along with some of the DMVs to get that information.
SELECT host_name,program_name, original_login_name, st.text FROM sys.dm_exec_sessions es INNERJOIN sys.dm_exec_connections ec ON es.session_id = ec.session_id CROSSAPPLY sys.dm_exec_sql_text(ec.most_recent_sql_handle) st WHERE ec.session_id = 51
I’m using original login name, rather than login name in case there’s any impersonation occurring.
That should be enough information to make an educated decision whether to kill the connection, or to speak to the person running the query and ask them to stop, or to wait for it to finish if it’s too important to be rolled back.
The checkpoint log reuse wait should be a transient condition in most circumstances. It will be most commonly seen in simple recovery model.
SQL Server automatically runs checkpoints on a regular basis to keep the number of log records that it would need to process during a database recovery. It runs checkpoints for a number of other reasons as well. For a full list, see http://msdn.microsoft.com/en-us/library/ms188748.aspx
If the checkpoint log reuse wait appears frequently or for prolonged periods occur it may indicate that the IO subsystem performance is inadequate for the volume of data that needs to be written during the checkpoint or that there are very large numbers of changed pages that the checkpoint needs to process.
The active portions of the log are necessary for database backups, as the backup has to include at least that much of the log to ensure a consistent restore. As such, long-running database backups may result in log growth as the portions of the log necessary for the backup must be retained for the duration of the backup.
There is little that can be done if this wait type is regularly encountered and is causing problems, other than optimising the backups themselves (optimising the IO subsystem or striping the backups over multiple devices), converting to file/filegroup backups (with appropriate consideration of backup/restore requirements) or using compressed backups either natively (SQL 2008 Enterprise, SQL 2008 R2 Enterprise and Standard editions) or via a 3rd party tool, or other methods that reduce the time taken to back the database up.
Transactional replication uses the transaction log to identify changes that need to be replicated to subscribers.
Whenever a change is made to a replicated table, the log records associated with that change are marked as ‘pending replication’. The Log Reader Agent job, which is one of the components of transactional replication, reads along the log looking for such transactions and, when it finds one, it places the details of the change into the distribution database and marks the log entry as ‘replicated’.
If transactions are being marked ‘pending replication’ and the Log Reader is not running, then those log records will never be marked as ‘replicated’, and the VLFs that contain those log records will never become inactive, and the log will grow.
In addition to this, on SQL 2008 Enterprise edition, the Change Data Capture feature (CDC) uses the transaction log and the Log Reader Agent job in much the same way as transactional replication. Hence, if the CDC jobs are not running, the log space cannot be reused, much as if there was transactional replication with the Log Reader Agent job not running. NB, the log reuse wait description still shows Replication, not CDC.
The only kind of replication that uses the transaction log is transactional replication, and thus it is the only type of replication that can prevent log reuse. However, there was a bug in some versions of SQL 2005 that could result in a log reuse wait of Replication when the only replication present was snapshot replication. See http://blogs.msdn.com/b/sqlserverfaq/archive/2009/06/01/size-of-the-transaction-log-increasing-and-cannot-be-truncated-or-shrinked-due-to-snapshot-replication.aspx.
In database mirroring, the principal database sends log records to the mirror database. In synchronous mirroring, this occurs before the transaction is committed on the principal. In asynchronous mirroring, the log records are placed into a log send queue when the transaction is committed on the principal. The log records will then be sent to the mirror at some point afterwards.
If the mirroring is running asynchronous and the rate of transactions exceeds that which the link between the principal and mirror can handle, then the log on the principal will grow because the log records cannot be marked inactive until they have been sent to the mirror.
The same thing will happen in synchronous mirroring if the link between the principal and mirror drops and the mirroring becomes disconnected or if the mirroring session is suspended. The log records on the principal remain active until the mirror reconnects or the session is resumed and the log records are copied over to it, or until the mirroring is dropped.
This should also be a short-lived transient type. During the creation of a database snapshot, SQL has to read the source database’s log so that it can run crash-recovery on the snapshot in order to ensure that it is transactionally consistent. While it is doing this, the portions of the log necessary cannot be marked as inactive.
DBCC CHECKDB uses hidden database snapshots for its checks and so this wait may appear when running CheckDB even if there are no user-created snapshots.
Prolonged snapshot creation waits can be a result of very long-running transactions that need to all be rolled back for the snapshot creation, or an inefficient reading of the log due to a very large number of VLFs.
Log backups are the one that most people know can prevent log reuse. In full and bulk-logged recovery models, VLFs cannot be marked as inactive until all the log records in that VLF have been backed up. The log backup wait reason will never appear in Simple Recovery.
If this log reuse wait reason appears, check that the log backups are correctly scheduled and are succeeding. Also check that the NO_TRUNCATE option has not been specified, as that option is only for backing up the log of a damaged database and will not ‘truncate’ any VLFs (as the option name implies).
This should cover the main reasons why a database’s transaction log may grow. Fixes for more complex problems are out of scope of this article, but this should at least help in knowing where to start looking.
Firstly a massive thank you to Paul Randal (blog|twitter) who kindly agreed to tech-edit this article and corrected a variety of errors small and large. Without his assistance this article would not have been worth reading.
Finally thanks to Jason Brimhall (CirquedeSQLeil) (blog | twitter), Gus Gwynne (GSquared), Brandie Tarvin (blog|twitter), Wayne Sheffield and Stefan Krzywicki for correcting grammar errors, illogical sentences and an appalling dearth of punctuation.
As they say in published books, any errors that remain are solely mine.