Production issue related to a rollback

Question

Post reply

Production issue related to a rollback

Eduardo Olivera

SSCommitted

Points: 1829
More actions
April 17, 2014 at 4:20 am

#290403

Hi,
I have a problem with a production server that is causing considerable locking and performance degradation. It all started the night of Tuesday into Wednesday with a procedure that inserts data into a table for mailings that are later transferred to Jet Vision. The table has 652 MB of data, most of it useless because developers never delete anything. Now they want to do so but the table is locked by their process. This process caused problems blocking a couple of nightly jobs, including index rebuilds (we use Olla Hallengren’s) and another job which was not affected this night. I stopped those first yesterday. But the session would not let go even after the stopped their job queue located in one of the middle tier servers. Rebooting that server did not help either. The two sessions connected to that process are still present. Then I was requested to kill those sessions and did it, probably my worst mistake. That was about 11:30 yesterday. The sessions are still in a rollback state although one thing I have noticed is that the running time, which should be over 24 hours by now, is posted on sp_WhoIsActive as 11 hours as if it had restarted or started a new process, the rollback maybe.
We use SQL diagnostic manager from Idera on another server to monitor this production server. Yesterday it was working but today it is not able to connect to the production server and there are a large number of sessions from the server that hosts the monitor software with a wait of MSQL_XACT_MGR_MUTEX which I understand it is issued when a task is waiting to obtain ownership of the session transaction manager to perform a session level transaction operation. So it is having problems synchronizing its threads. It may or may not be related, my gut feeling tells me it is, but it is part of the overall sluggishness of the server.
Regularly scheduled tasks have been running and except for index rebuilds on the table locked all have had no problems including backups. Most transactions I saw yesterday where running with some deadlocks on procedures that are not alien to the concept and have been reported in the past. Most of the times are separate instances of the same stored procedures deadlocking each other. I don’t doubt there is a relationship between sluggishness and the increase in deadlocks (no more than 4 an hour) but I have only noticed one process that is directly affected by this locking and it has been reported to developers.
Yesterday the KILL # WITH statusonly command did not report any percentage completed or estimated time remaining. Today it is but it is very low percentage (4% and 11%) and a long remaining time (818845 and 344520 seconds) but I can say that since I logged in first the percentage has gone up 1% and 3% respectively. So it is moving just painfully slow. The procedure that is at the heart of this problem is only doing an insert or an update but the source is either the same table or the procedure’s parameters. The table has no triggers or foreign keys. It appears to be called in a loop.
I would appreciate any recommendation on how to bring this rollback to a prompt completion.
Best regards,
Eduardo

Viewing 15 posts - 1 through 14 (of 14 total)

You must be logged in to reply to this topic. Login to reply

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 1

Unfortunately the only thing you can do with a rollback is wait. It has to finish.

Don't restart SQL or the rollback will just start over from the beginning with the DB unavailable.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 2

That's what I was afraid. Developers and their managers will panic but... it is what it is and it is what I expected but, is there a way to speed up this rollback?

Thanks Gail

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 3

Kick all other users off the server will help. :hehe: 🙂

Check what the rollback is waiting for (wait type in sys.dm_exec_requests) first.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 4

I'll try the second one first. The first one would get me more grief. 🙂

Thanks

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 5

Running sp_WhoIsActive does not show waits for either of the rollback sessions. Is there anything else I can check?

Thanks

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 6

Then it's probably just chugging along. Rollbacks usually take longer than the initial changes did.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 7

Yes but this one has been running since yesterday. I just thought of something that I don't know if it would work or not but I am sure you would know. I am restoring all of last night's backups to another server. We believe there is only one database involved in this double rollback but I am restoring all just in case. According to KILL 90 WITH statusonly it has 831342 and 329608 seconds respectively to completion. My thought was that resources might be the reason for the delay and that a restore, since the rollback would part of the restore, would give us an idea of how long would it take to complete if we were to reboot the server. We've been experiencing lots of paging. There is a lot of pressure from management to get this server in "top shape". Would that work? Would it really give me a good measure, or even an approximation of how long it would take to recover from a reboot? Would the database not be able to complete recovery until the rollbacks are complete?

Just some thoughs. I appreciate your input.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 8

You can try that. You may have the entire DB offline for hours or days as a result.

The approximation is just that. An approximation.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 9

That's what I was afraid you were going to say. 🙂 And what I suspected. It only took about 4 minutes more to restore but then there has been some road traveled on the rollbacks since the last backup.

Thanks

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 10

I just had another wild thought. I know it's been a bad day probably not the best for ideas. :hehe: What if I change the databases to simple recovery mode and then reboot? Whould that prevent the rollback when it's backup? I can run a full backup when up.

Thanks

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 11

I ended up calling Microsoft and their DBA made a very interesting discovery: the transactions were not really rolling back. They were orphans but had references to the calling server, port number and were returning a rollback progression count. It took a long time to find out but the following queries where the ones that put the doubt in his mind:

select * from sys.dm_exec_requests where session_id IN (90,97)

select 8from sys.dm_exec_sessions where session_id IN (90,97)

select * from sys.dm_exec_connections where session_id IN (90,97)

The first one had no returns while the other two returned both sessions.

Gail Shaw SSC Guru Points: 1004504 More actions · Answer 12

Eduardo Olivera (4/17/2014)
What if I change the databases to simple recovery mode and then reboot?

You break your log chain and sit with potentially a many hour recovery after the restart. Recovery model has nothing whatsoever to do with how a transaction rolls back.

I wasn't joking when I said there's little you can do other than wait.

Gail Shaw
Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

We walk in the dark places no others will enter
We stand on the bridge and no one may pass

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 13

I understand that and that is why I resisted pressure from the application manager to reboot the server but Microsoft stepped in and, after long hours yesterday, has recommended that we do that. Originally they were backing me up. They thouroughly checked the server and found those two processes are orphans with no connection to anything and no exec requests. They say it is an event they have never encountered and suggested we turn off the application, check point all databases, reboot and then let the undo take care of the tons of locks these processes are holding in the background. We shouldn't be worst off than we are today.

Eduardo Olivera SSCommitted Points: 1829 More actions · Answer 14

Thank God it worked! I had serious doubts about the procedure but Microsoft reviewed it and insisted it was the only solution. Down time was just 20 minutes. Developers were insisting that it would be seconds. After all, they don't need DBAs except to relieve them from tedious work and get blamed. 🙂

This is Microsoft's documentation edited to conceal system information:

Issue Definition: You killed two sessions (90,97) in the SQL Server ProdSQL through the KILL command; The rollback operation has been going on since two days and still completed 7% on one sessions and 18% on the other.

Analysis:

>> First we tried to see what these sessions are actually doing. From sp_who2 output we see that the CPU and I/O counts are increasing

SPID Status Login HostName BlkBy DBName Command CPUTime DiskIO LastBatch ProgramName SPID REQUESTID

90 ROLLBACK AppLogin ClientSrvr . ProcDB AWAITING COMMAND 89886746 87201 04/16 18:15:32 .Net SqlClient Data Provider 90 0

97 ROLLBACK AppLogin ClientSrvr . ProcDB AWAITING COMMAND 63360540 5639 04/16 18:15:33 .Net SqlClient Data Provider 97 0

And also you said that the percentage of work done is getting increased when we see it through the “kill with statusonly” command but it was slow.

>> When we see the dm_exe_requests, we don’t see these sessions and corresponding threads

>> When we see the dm_exec_connections output, we see that the client side TCP ports used are,

For 90 - xxx.xxx.x.xxx 56972

For 97 - xxx.xxx.x.xxx 56978

>> From dm_exec_sessions, we figured out the PIDs on the client side which established these session on the database

For 90 - 18332

For 97 - 18332

>> We logged into the middleware server but don’t find any sessions with the corresponding PIDs and the netstat output didn’t show us these ports in use.

>> That makes me think if these sessions 90 and 97 has become orphaned; but still not sure how the CPU and I/O counts are increasing without corresponding thread entries in dm_exec_requests.

Action Plan:

>> As Rollback can't be interrupted or bypassed as it's important to comply to the ACID properties, we have to looks for any alternatives/workarounds to sweep these sessions out of the SQL server

>> So we would recommend you to restart the SQL service to clean these sessions.

I know there few questions around whether SQL server stuck in the recovery phase forever making these databases inaccessible.

>> When SQL server starts, it performs the recovery in two phases

· REDO

· UNDO

>> Redo is nothing but rolling forward the changes that are not yet applied while undo is rolling back the uncommitted changes

>> In our case, the kill operation is doing something which is almost equivalent to UNDO

>> In SQL, we introduced a concept of background UNDO which means that, when you start the instance, databases will be accessible immediately after the redo and there’s a background thread which do the UNDO in the back ground.

>> This makes the databases available quickly instead of sticking on the recovery forever.

>> One thing we can’t avoid even after restart is, the locks: Even the background UNDO thread need to hold locks on the pages/rows/keys until it finishes UNDO operation. This will make the other queries wait till the undo thread releases the locks.

>> Also, to speed up REDO operation, I would recommend you to issue CHECKPOINT on all the databases before going for a restart. This helps SQL Server in avoiding the work it has to do to roll forward the changes.

>> So the action item would be

Ø Issue CHECKPOINT on all the databases (one-by-one)

Ø Restart the SQL Service

>> You can motor the locks through sp_lock and see how they’re moving. But as I said earlier, they’ll definitely block any other queries issues against some of the objects in the databases IDs 18 and 19