Log shipping no longer working...

  • I've got a weird problem here that I'm hoping someone here has some insight on.  Here's the scenario:

     

    server1, sql 2000 enterprise edition

    servers2 and server3 also SQL2k Enterprise, and are in a cluster, cluster1 on comparable hardware.  They are quad processor systems with 2 gigs of RAM.

    I am using log shipping to move data from server1 to cluster1.  I have about 20 databases set up to ship logs to cluster1.  This seemed to be going quite well for a couple weeks.  Then I needed to make a massive update to one of the databases on server1.  This created some rather large transaction log backups.  I have the backups running every 15 minutes, so none of the files were bigger than say 50 MB.

    Well, the log shipping got way out of sync, and the monitor started complaining.  The cluster1 SQL server was receiving logs faster than it could apply them.  At first I suspected that shared disks used in the cluster were to blame.  But that doesn't seem to be the case, as the connection is over a fiber channel.  I was able to copy files over this link while it was trying to restore, and they were quite fast.

    So I start monitoring counters. The performance counters seemed to indicate that data was trickling in.  In order to rule out any disk related problems, I copied the transaction log backup to the local hard drive and tried to do the restore manually.  I am aware that this will likely screw up my log shipping, and I'll have to start over.  But I just want to know why the restore is so slow.

    Keep in mind this should be no more than 15 minutes worth of changes.  My manual restore of the transaction log took 40 minutes!  I turned on statistics time and statistics IO.  The output message I get is:

    SQL Server parse and compile time:

       CPU time 0ms, elapsed time = 0 ms.

    Processed 4260 pages for database 'database1', file 'database1_Log' on file 1.

    SQL Server Execution Times:

       CPU time = 16 ms, elapsed time = 2542 ms.

    RESTORE LOG successfully processed 4260 pages in 2.002 seconds (17.429 MB/sec).

    SQL Server Execution Times:

       CPU time = 28235 ms, elapsed time = 2382899 ms.

    What they hey???  If the restore took about 2 seconds, what the heck was it doing for the rest of the 40 minutes???

    I was watching the sysprocesses tables, and the waittime was never high.  So I'm not sure what was going on.

    Any help or suggestions are appreciated!

    Devin

     

  • I raised the same issue with Microsoft some months ago, because restore time were highly variable for similarly sized log backup files.

    The support people looked through the SQL Server code, and discovered that the time quoted is just the time to physically restore the data from the Log backup file into the transaction log.

    The restore process then has to roll forward/roll back these changes to the database itself. It is this part of the process that is taking the remaining 40 minutes of your restore time.

  • Hmm, so I guess my next question is:

    Why would it take almost 40 minutes to roll forward (or back) 15 minutes worth of transactions?  This is really weird stuff.

     

  • Possibly, the rolling occurs on the primary databases due to some error or poor query execution that SQL is rolling back. Bundled in the transaction log and shipped to the secondary. Sure, I could be wrong.

  • I just had a thought.  Maybe the reason the restores are so slow has something to do with the nature of the transaction log being restored.  While the transaction log backups are 15 minutes apart, they span one very large open transaction that will update tens of millions of rows.  That is:

    server 1, 12:00pm

    update bigtable set col1 = b.col1

    from bigtable join table2 on (bigtable.x = table2.x)

    where bigtable.col2 = 'xyz'

    go

     

    server1, 12:15pm

    (note transaction is still open)

    backup tran database1 to disk='c:\mssql\backup\logshipping\database1.trn'

     

    server2, 12:16pm

    (note, backup has completed)

    copy \\server1\logshipping\database1.trn c:\mssql\backup\logshipping\database1.trn

     

    server2, 12:17pm

    restore log database1 from disk='c:\mssql\backup\logshipping\database1.trn'

    (This is the part that takes 40 mins)

     

    Now the big question: What is in that transaction log backup?  It has part of one large transaction, but it has NOT committed yet.  It must have something, because together all the log backups will complete the transaction (assuming it finally commits).  Is it possible it takes this long just to build the transaction log up?  I can't think it would be rolling anything forward or back at this point, as the transaction is still open.

  • Are you using standby files?  Because SQL server will get to the end of the transaction log restore, find there is no commit and then write the whole uncommitted transaction out to the standby file so the database remains in a readable state.

    This may take awhile, but not that long.

  • Actually, I am using the UNDO files for standby mode.  I guess I don't fully understand the purpose of the UNDO file in this scenario because of the very large transaction.  Would all this data have to reside in the UNDO file?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply