• Thank God it worked! I had serious doubts about the procedure but Microsoft reviewed it and insisted it was the only solution. Down time was just 20 minutes. Developers were insisting that it would be seconds. After all, they don't need DBAs except to relieve them from tedious work and get blamed. 🙂

    This is Microsoft's documentation edited to conceal system information:

    Issue Definition: You killed two sessions (90,97) in the SQL Server ProdSQL through the KILL command; The rollback operation has been going on since two days and still completed 7% on one sessions and 18% on the other.

    Analysis:

    >> First we tried to see what these sessions are actually doing. From sp_who2 output we see that the CPU and I/O counts are increasing

    SPID       Status                                   Login                     HostName          BlkBy     DBName              Command                           CPUTime             DiskIO   LastBatch             ProgramName                                                  SPID       REQUESTID

    90           ROLLBACK                           AppLogin ClientSrvr       .             ProcDB    AWAITING COMMAND                89886746             87201    04/16 18:15:32   .Net SqlClient Data Provider                        90           0   

    97           ROLLBACK                           AppLogin ClientSrvr       .             ProcDB    AWAITING COMMAND                63360540             5639       04/16 18:15:33   .Net SqlClient Data Provider                        97           0   

    And also you said that the percentage of work done is getting increased when we see it through the “kill with statusonly” command but it was slow.

    >> When we see the dm_exe_requests, we don’t see these sessions and corresponding threads

    >> When we see the dm_exec_connections output, we see that the client side TCP ports used are,

    For 90 - xxx.xxx.x.xxx     56972

    For 97 - xxx.xxx.x.xxx     56978

    >> From dm_exec_sessions, we figured out the PIDs on the client side which established these session on the database

    For 90 - 18332

    For 97 - 18332

    >> We logged into the middleware server but don’t find any sessions with the corresponding PIDs and the netstat output didn’t show us these ports in use.

    >> That makes me think if these sessions 90 and 97 has become orphaned; but still not sure how the CPU and I/O counts are increasing without corresponding thread entries in dm_exec_requests.

    Action Plan:

    >> As Rollback can't be interrupted or bypassed as it's important to comply to the ACID properties, we have to looks for any alternatives/workarounds to sweep these sessions out of the SQL server

    >> So we would recommend you to restart the SQL service to clean these sessions.

    I know there few questions around whether SQL server stuck in the recovery phase forever making these databases inaccessible.

    >> When SQL server starts, it performs the recovery in two phases

    ·         REDO

    ·         UNDO

    >> Redo is nothing but rolling forward the changes that are not yet applied while undo is rolling back the uncommitted changes

    >> In our case, the kill operation is doing something which is almost equivalent to UNDO

    >> In SQL, we introduced a concept of background UNDO which means that, when you start the instance, databases will be accessible immediately after the redo and there’s a background thread which do the UNDO in the back ground.

    >> This makes the databases available quickly instead of sticking on the recovery forever.

    >> One thing we can’t avoid even after restart is, the locks: Even the background UNDO thread need to hold locks on the pages/rows/keys until it finishes UNDO operation. This will make the other queries wait till the undo thread releases the locks.

    >> Also, to speed up REDO operation, I would recommend you to issue  CHECKPOINT on all the databases before going for a restart. This helps SQL Server in avoiding the work it has to do to roll forward the changes.

    >> So the action item would be

    Ø  Issue CHECKPOINT on all the databases (one-by-one)

    Ø  Restart the SQL Service

    >> You can motor the locks through sp_lock and see how they’re moving. But as I said earlier, they’ll definitely block any other queries issues against some of the objects in the databases IDs 18 and 19