Mirrored database becomes SUSPENDED during nightly maintenance

Question

Mirrored database becomes SUSPENDED during nightly maintenance

GordonLiddy

Ten Centuries

Points: 1047
More actions
April 18, 2011 at 9:35 am

#237859

Hi there!
I will try to keep this post as short as possible, without leaving out any crucial information.
Server Setup: SQL Server 2008 (x64) - Enterprise Edition, SP2 CU2.
Database Setup: Three principal databases which are mirrored. Each of these mirrored db's then has a database snapshot created on them, every 30 minutes.
We're running Database Mirroring in High Performance (asynchronous) mode, without a witness server. Both SQL Server instances, which we here call Instance1 (where the principal databases resides) and Instance2 (where the mirrored databases and the database snapshot resides) are part of the same two node failover cluster.
On two occasions now we've encountered the Error: 9003, Severity: 20, State: 15 (happens at 2011-04-17 00:37:35.770)
Which is followed by this information:
The log scan number (82666:12760:1) passed to log scan in database 'Principal_Database' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication.
Otherwise, restore from backup if the problem results in a failure during startup.
Which is then followed by Error: 9001, Severity: 21, State: 8.
2011-04-17 00:37:35.800spid60The log for database 'Principal_Database_Stats' is not available. Check the event log for related error messages. Resolve any errors and restart the database.
The last error is related to the database snapshot which is taken on the mirrored database.
When this has happend we can see that there are a lot of transactions going on, and the transaction log backups are quite large (1-4GB) during this period. The reason that the number and the size of transactions increase is because of two SQL Agent jobs that run at the same time.
One of the jobs does an INDEX REBUILD, and the other deletes data from one table which exists in two of the databases.
Look at the table below for the actual numbers gathered during this period for the mirroring session.
DatabaseNameMirrorState[log_generation_rate (MB/sec)][unsent_log (MB)] [send_rate (MB/sec)][unrestored_log (MB)][transactions_per_sec][time_recorded]
Principal_DatabaseSUSPENDED6.699122.060.000.001132011-04-17 00:49:00.713
Principal_DatabaseSUSPENDED5.398720.670.000.001902011-04-17 00:48:00.690
Principal_DatabaseSUSPENDED23.628397.290.000.002232011-04-17 00:47:00.667
Principal_DatabaseSUSPENDED26.146980.280.000.002422011-04-17 00:46:00.627
Principal_DatabaseSUSPENDED16.895412.130.000.001562011-04-17 00:45:00.620
Principal_DatabaseSUSPENDED13.334398.530.000.001232011-04-17 00:44:00.617
Principal_DatabaseSUSPENDED15.153598.760.000.001402011-04-17 00:43:00.603
Principal_DatabaseSUSPENDED10.712689.810.000.001002011-04-17 00:42:00.593
Principal_DatabaseSUSPENDED10.632047.130.000.00992011-04-17 00:41:00.580
Principal_DatabaseSUSPENDED6.631419.980.000.00622011-04-17 00:40:01.617
Principal_DatabaseSUSPENDED11.291015.650.000.001052011-04-17 00:39:00.847
Principal_DatabaseSUSPENDED9.13338.373.510.00852011-04-17 00:38:00.810
Principal_DatabaseSYNCHRONIZED6.510.126.512754.57592011-04-17 00:37:00.770
Principal_DatabaseSYNCHRONIZED5.710.005.712438.10942011-04-17 00:36:00.770
Principal_DatabaseSYNCHRONIZED2.550.002.552168.795842011-04-17 00:35:00.707
Principal_DatabaseSYNCHRONIZED1.910.001.912125.684362011-04-17 00:34:00.697
Principal_DatabaseSYNCHRONIZED0.980.000.982103.466222011-04-17 00:33:00.650
Principal_DatabaseSYNCHRONIZED3.020.003.022191.673272011-04-17 00:32:00.650
Principal_DatabaseSYNCHRONIZED5.560.115.552125.711112011-04-17 00:31:00.637
Principal_DatabaseSYNCHRONIZED6.920.007.341877.801222011-04-17 00:30:00.623
Principal_DatabaseSYNCHRONIZED4.4224.674.011516.42412011-04-17 00:29:00.623
Principal_DatabaseSYNCHRONIZED6.000.006.001328.79552011-04-17 00:28:00.547
Principal_DatabaseSYNCHRONIZED6.660.006.661072.62622011-04-17 00:27:00.520
(It now became clear that the table wasn't really easy to read. Hmm, maybe copy paste into another document? Well, not much to do ... :-))
Questions:
* Is it so that the number of transactions and amount of log during the minutes after 00:30:00 finally "breaks" the mirroring and turn it from SYNCHRONIZED to SUSPENDED?
* Would you consider 500-600 transactions/sec a high number, quite normal or maybe even very normal?
(I ask since I don't know. Probably depends and is related to the load you have, but this system averages 5-10 transactions/sec during normal production hours)
* Are there other questions to be asked around this?
* Is it to much to mirror three databases and then create snapshots on them, with this kind of load during maintenance?
* Should we change anything in our environment, maybe the maintenance routine?
Thank you very much for taking your time with this.
I hope anyone has any good ideas and/or answers to our problem!
Sincerely,
Gord

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Jeffrey Williams SSC Guru Points: 90245 More actions · Answer 1

I can't really answer all of your questions, but just to clarify about your mirroring. The mirroring state will not go to suspended unless someone is issuing a command to suspend mirroring. My bet would be the index rebuild process is suspending the mirroring and then resuming it after it is done.

As for having 3 databases mirrorring - shouldn't be an issue at all. I mirror 14 databases from CA to TX with no real issues. After an index rebuild or a purge operation, the transactions for the mirror queue and can take several hours (sometimes days, depending on the changes) to catch up.

My guess here is that your snapshots are causing the majority of the problems. Do you keep multiple snapshots? If so, how long are you keeping them?

Jeffrey Williams
“We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

― Charles R. Swindoll

How to post questions to get better answers faster
Managing Transaction Logs

Marios Philippopoulos SSC Guru Points: 57190 More actions · Answer 2

You may need to upgrade to Cumulative Update 3 for SQL Server 2008 Service Pack 2:

http://support.microsoft.com/kb/2403218#appliesto

__________________________________________________________________________________
SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
Persisting SQL Server Index-Usage Statistics with MERGE[/url]
Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

GordonLiddy Ten Centuries Points: 1047 More actions · Answer 3

Thank you very much for your answers!

Jeffrey: That sounds logical, I agree. But what about the part "The log scan number (82666:12760:1) passed to log scan in database.." in the error message?

It's great to hear that three mirrored databases shouldn't be an issue. As you say it might be the index rebuild that is suspending the mirror and then takes time to enable again. When it comes to the database snapshots, they are created ever 30 minutes in a SQL Agent job. The job first drops the existing snapshot (for each database) and then creates a new one. So there are no duplicates. Here's how it looked in the error log a few minutes before the database got suspended:

..

Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208254 seconds remain). Phase 1 of 3.

Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208239 seconds remain). Phase 1 of 3.

Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208682 seconds remain). Phase 2 of 3.

Recovery of database 'Principal_Database_Stats' (9) is 2% complete (approximately 6120 seconds remain). Phase 2 of 3.

Recovery of database 'Principal_Database_Stats' (9) is 14% complete (approximately 869 seconds remain). Phase 2 of 3.

Recovery of database 'Principal_Database_Stats' (9) is 15% complete (approximately 802 seconds remain). Phase 2 of 3.

..

It then came to 35% with approx 800 seconds left, just before error 9003 occured.

At daytime and normal production load, the drop and create of the three snapshots take between 5 and 10 seconds. At the time of the index rebuild and the other "delete job" running at the same time, the snapshot job takes about 15 - 25 minutes.

Marios: I'll definitely look into upgrading to the latest CU. It had some interesting info, and the title of the fix is spot on to our problem!

Thanks once again for your answers, I appreciate it!

Sincerely,

Gord

SQL Rob Say Hey Kid Points: 698 More actions · Answer 4

Hi,

This is just a guess based on what you've said, but how long is your optimization taking to complete? As you are using enterprise edition I'm assuming the optimization is online. It's possible that your snapshots are conflicting with the snapshot that is taken to rebuild the index. A snapshot every 30 minutes seems a little extreme anyway(are you reporting off these?), but I'd definitely recommend suspending that activity while your maintenance is being performed if possible.