Mirrored database becomes SUSPENDED during nightly maintenance

  • Hi there!

    I will try to keep this post as short as possible, without leaving out any crucial information.

    Server Setup: SQL Server 2008 (x64) - Enterprise Edition, SP2 CU2.

    Database Setup: Three principal databases which are mirrored. Each of these mirrored db's then has a database snapshot created on them, every 30 minutes.

    We're running Database Mirroring in High Performance (asynchronous) mode, without a witness server. Both SQL Server instances, which we here call Instance1 (where the principal databases resides) and Instance2 (where the mirrored databases and the database snapshot resides) are part of the same two node failover cluster.

    On two occasions now we've encountered the Error: 9003, Severity: 20, State: 15 (happens at 2011-04-17 00:37:35.770)

    Which is followed by this information:

    The log scan number (82666:12760:1) passed to log scan in database 'Principal_Database' is not valid. This error may indicate data corruption or that the log file (.ldf) does not match the data file (.mdf). If this error occurred during replication, re-create the publication.

    Otherwise, restore from backup if the problem results in a failure during startup.

    Which is then followed by Error: 9001, Severity: 21, State: 8.

    2011-04-17 00:37:35.800spid60The log for database 'Principal_Database_Stats' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

    The last error is related to the database snapshot which is taken on the mirrored database.

    When this has happend we can see that there are a lot of transactions going on, and the transaction log backups are quite large (1-4GB) during this period. The reason that the number and the size of transactions increase is because of two SQL Agent jobs that run at the same time.

    One of the jobs does an INDEX REBUILD, and the other deletes data from one table which exists in two of the databases.

    Look at the table below for the actual numbers gathered during this period for the mirroring session.

    DatabaseNameMirrorState[log_generation_rate (MB/sec)][unsent_log (MB)] [send_rate (MB/sec)][unrestored_log (MB)][transactions_per_sec][time_recorded]

    Principal_DatabaseSUSPENDED6.699122.060.000.001132011-04-17 00:49:00.713

    Principal_DatabaseSUSPENDED5.398720.670.000.001902011-04-17 00:48:00.690

    Principal_DatabaseSUSPENDED23.628397.290.000.002232011-04-17 00:47:00.667

    Principal_DatabaseSUSPENDED26.146980.280.000.002422011-04-17 00:46:00.627

    Principal_DatabaseSUSPENDED16.895412.130.000.001562011-04-17 00:45:00.620

    Principal_DatabaseSUSPENDED13.334398.530.000.001232011-04-17 00:44:00.617

    Principal_DatabaseSUSPENDED15.153598.760.000.001402011-04-17 00:43:00.603

    Principal_DatabaseSUSPENDED10.712689.810.000.001002011-04-17 00:42:00.593

    Principal_DatabaseSUSPENDED10.632047.130.000.00992011-04-17 00:41:00.580

    Principal_DatabaseSUSPENDED6.631419.980.000.00622011-04-17 00:40:01.617

    Principal_DatabaseSUSPENDED11.291015.650.000.001052011-04-17 00:39:00.847

    Principal_DatabaseSUSPENDED9.13338.373.510.00852011-04-17 00:38:00.810

    Principal_DatabaseSYNCHRONIZED6.510.126.512754.57592011-04-17 00:37:00.770

    Principal_DatabaseSYNCHRONIZED5.710.005.712438.10942011-04-17 00:36:00.770

    Principal_DatabaseSYNCHRONIZED2.550.002.552168.795842011-04-17 00:35:00.707

    Principal_DatabaseSYNCHRONIZED1.910.001.912125.684362011-04-17 00:34:00.697

    Principal_DatabaseSYNCHRONIZED0.980.000.982103.466222011-04-17 00:33:00.650

    Principal_DatabaseSYNCHRONIZED3.020.003.022191.673272011-04-17 00:32:00.650

    Principal_DatabaseSYNCHRONIZED5.560.115.552125.711112011-04-17 00:31:00.637

    Principal_DatabaseSYNCHRONIZED6.920.007.341877.801222011-04-17 00:30:00.623

    Principal_DatabaseSYNCHRONIZED4.4224.674.011516.42412011-04-17 00:29:00.623

    Principal_DatabaseSYNCHRONIZED6.000.006.001328.79552011-04-17 00:28:00.547

    Principal_DatabaseSYNCHRONIZED6.660.006.661072.62622011-04-17 00:27:00.520

    (It now became clear that the table wasn't really easy to read. Hmm, maybe copy paste into another document? Well, not much to do ... :-))

    Questions:

    * Is it so that the number of transactions and amount of log during the minutes after 00:30:00 finally "breaks" the mirroring and turn it from SYNCHRONIZED to SUSPENDED?

    * Would you consider 500-600 transactions/sec a high number, quite normal or maybe even very normal?

    (I ask since I don't know. Probably depends and is related to the load you have, but this system averages 5-10 transactions/sec during normal production hours)

    * Are there other questions to be asked around this?

    * Is it to much to mirror three databases and then create snapshots on them, with this kind of load during maintenance?

    * Should we change anything in our environment, maybe the maintenance routine?

    Thank you very much for taking your time with this.

    I hope anyone has any good ideas and/or answers to our problem!

    Sincerely,

    Gord

  • I can't really answer all of your questions, but just to clarify about your mirroring. The mirroring state will not go to suspended unless someone is issuing a command to suspend mirroring. My bet would be the index rebuild process is suspending the mirroring and then resuming it after it is done.

    As for having 3 databases mirrorring - shouldn't be an issue at all. I mirror 14 databases from CA to TX with no real issues. After an index rebuild or a purge operation, the transactions for the mirror queue and can take several hours (sometimes days, depending on the changes) to catch up.

    My guess here is that your snapshots are causing the majority of the problems. Do you keep multiple snapshots? If so, how long are you keeping them?

    Jeffrey Williams
    “We are all faced with a series of great opportunities brilliantly disguised as impossible situations.”

    ― Charles R. Swindoll

    How to post questions to get better answers faster
    Managing Transaction Logs

  • You may need to upgrade to Cumulative Update 3 for SQL Server 2008 Service Pack 2:

    http://support.microsoft.com/kb/2403218#appliesto

    __________________________________________________________________________________
    SQL Server 2016 Columnstore Index Enhancements - System Views for Disk-Based Tables[/url]
    Persisting SQL Server Index-Usage Statistics with MERGE[/url]
    Turbocharge Your Database Maintenance With Service Broker: Part 2[/url]

  • Thank you very much for your answers!

    Jeffrey: That sounds logical, I agree. But what about the part "The log scan number (82666:12760:1) passed to log scan in database.." in the error message?

    It's great to hear that three mirrored databases shouldn't be an issue. As you say it might be the index rebuild that is suspending the mirror and then takes time to enable again. When it comes to the database snapshots, they are created ever 30 minutes in a SQL Agent job. The job first drops the existing snapshot (for each database) and then creates a new one. So there are no duplicates. Here's how it looked in the error log a few minutes before the database got suspended:

    ..

    Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208254 seconds remain). Phase 1 of 3.

    Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208239 seconds remain). Phase 1 of 3.

    Recovery of database 'Principal_Database_Stats' (9) is 0% complete (approximately 208682 seconds remain). Phase 2 of 3.

    Recovery of database 'Principal_Database_Stats' (9) is 2% complete (approximately 6120 seconds remain). Phase 2 of 3.

    Recovery of database 'Principal_Database_Stats' (9) is 14% complete (approximately 869 seconds remain). Phase 2 of 3.

    Recovery of database 'Principal_Database_Stats' (9) is 15% complete (approximately 802 seconds remain). Phase 2 of 3.

    ..

    It then came to 35% with approx 800 seconds left, just before error 9003 occured.

    At daytime and normal production load, the drop and create of the three snapshots take between 5 and 10 seconds. At the time of the index rebuild and the other "delete job" running at the same time, the snapshot job takes about 15 - 25 minutes.

    Marios: I'll definitely look into upgrading to the latest CU. It had some interesting info, and the title of the fix is spot on to our problem!

    Thanks once again for your answers, I appreciate it!

    Sincerely,

    Gord

  • Hi,

    This is just a guess based on what you've said, but how long is your optimization taking to complete? As you are using enterprise edition I'm assuming the optimization is online. It's possible that your snapshots are conflicting with the snapshot that is taken to rebuild the index. A snapshot every 30 minutes seems a little extreme anyway(are you reporting off these?), but I'd definitely recommend suspending that activity while your maintenance is being performed if possible.

Viewing 5 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply