RE: Production issue related to a rollback

SSCommitted

Points: 1829

April 21, 2014 at 4:54 am

Thank God it worked! I had serious doubts about the procedure but Microsoft reviewed it and insisted it was the only solution. Down time was just 20 minutes. Developers were insisting that it would be seconds. After all, they don't need DBAs except to relieve them from tedious work and get blamed. 🙂

This is Microsoft's documentation edited to conceal system information:

Issue Definition: You killed two sessions (90,97) in the SQL Server ProdSQL through the KILL command; The rollback operation has been going on since two days and still completed 7% on one sessions and 18% on the other.

Analysis:

>> First we tried to see what these sessions are actually doing. From sp_who2 output we see that the CPU and I/O counts are increasing

SPID Status Login HostName BlkBy DBName Command CPUTime DiskIO LastBatch ProgramName SPID REQUESTID

90 ROLLBACK AppLogin ClientSrvr . ProcDB AWAITING COMMAND 89886746 87201 04/16 18:15:32 .Net SqlClient Data Provider 90 0

97 ROLLBACK AppLogin ClientSrvr . ProcDB AWAITING COMMAND 63360540 5639 04/16 18:15:33 .Net SqlClient Data Provider 97 0

And also you said that the percentage of work done is getting increased when we see it through the “kill with statusonly” command but it was slow.

>> When we see the dm_exe_requests, we don’t see these sessions and corresponding threads

>> When we see the dm_exec_connections output, we see that the client side TCP ports used are,

For 90 - xxx.xxx.x.xxx 56972

For 97 - xxx.xxx.x.xxx 56978

>> From dm_exec_sessions, we figured out the PIDs on the client side which established these session on the database

For 90 - 18332

For 97 - 18332

>> We logged into the middleware server but don’t find any sessions with the corresponding PIDs and the netstat output didn’t show us these ports in use.

>> That makes me think if these sessions 90 and 97 has become orphaned; but still not sure how the CPU and I/O counts are increasing without corresponding thread entries in dm_exec_requests.

Action Plan:

>> As Rollback can't be interrupted or bypassed as it's important to comply to the ACID properties, we have to looks for any alternatives/workarounds to sweep these sessions out of the SQL server

>> So we would recommend you to restart the SQL service to clean these sessions.

I know there few questions around whether SQL server stuck in the recovery phase forever making these databases inaccessible.

>> When SQL server starts, it performs the recovery in two phases

· REDO

· UNDO

>> Redo is nothing but rolling forward the changes that are not yet applied while undo is rolling back the uncommitted changes

>> In our case, the kill operation is doing something which is almost equivalent to UNDO

>> In SQL, we introduced a concept of background UNDO which means that, when you start the instance, databases will be accessible immediately after the redo and there’s a background thread which do the UNDO in the back ground.

>> This makes the databases available quickly instead of sticking on the recovery forever.

>> One thing we can’t avoid even after restart is, the locks: Even the background UNDO thread need to hold locks on the pages/rows/keys until it finishes UNDO operation. This will make the other queries wait till the undo thread releases the locks.

>> Also, to speed up REDO operation, I would recommend you to issue CHECKPOINT on all the databases before going for a restart. This helps SQL Server in avoiding the work it has to do to roll forward the changes.

>> So the action item would be

Ø Issue CHECKPOINT on all the databases (one-by-one)

Ø Restart the SQL Service

>> You can motor the locks through sp_lock and see how they’re moving. But as I said earlier, they’ll definitely block any other queries issues against some of the objects in the databases IDs 18 and 19