RE: Mirroring monitor:" "current restore rate" low (100 KB/s) while "unrestored log" enourmous (100's GB)

SSCarpal Tunnel

Points: 4592

December 30, 2013 at 9:23 am

Hi Kevin,

Good point, i was somewhat sparse with the amount of details... here goes:

As you quite rightly pointed out it's not the size of the DB that is important here, it is the size of the log that is important in this case. Well.... the are *huge*.... I have no control over what the BI guy does every night to the DWH, but it is not nice (log wise that is)

And Yes, i'm sure that these 2 DWH databases are the ones that have problems, since they are the only 2 DB's (out of 40+) that show "Unrestored log" in mirror monitor, as well as enourmouse log growth, and there are 2 CPU's running up the electricity bill....

But for good measure, here is how i came to my conclusion:

Profiler has been runnign for over an hour now, and showed nothing of any significant (> 50 ms) CPU usage, hence; it is a LOOOOONG running query, and i'm missing start and end of that SPID.

When i use Adam's sp_WhoIsActive with the switch for system SPID's on, i see

SpID: 67: DBMIRROR_DBM_MUTEX, Wait: 638.823 ms, CPU: 57M, read: 45M, write: 2M, phy read: 2.6M

DBCC inputbuffer(67) comes back empty... no luck there either.

Process Explorer: shows 2 thread id's under SQLSERVR.exe comsuming huge amounts of CPU clycles for ntoskrnl.exe!KeWaitForMultipleObjects, confirming the view in task manager of 2 CPU's maxing out. And i also think that these must be a mirror threads described in kb2001270, since *nothing* else is running on that server besides the maintenance jobs.

When i look at the distribution of this query:

SELECT a.scheduler_id ,current_tasks_count, runnable_tasks_count, work_queue_count, pending_disk_io_count, parent_node_id as Cpu_Numa, is_idle, b.session_id

FROM sys.dm_os_schedulers a

left JOIN sys.dm_os_tasks b on a.active_worker_address = b.worker_address

WHERE a.scheduler_id < 255

order by b.session_id

I see an average 13 threads on each of the 64 CPU, and an average of 6-9 scheduler_id's without an SPID assigned, so there are enough scheduler_id's to server any SPID that needs attention.

Memory: 256 GB, buffer cache hitratio is close to 100%, with an 23 hour PageLifeExpectancy.

DISK: Nn the drive that holds the database files is see an occasional spike in disk queues to 0.7, the database files do report IO-Stalls but that is to be expected for a DWH on a traditional raid array (even it is 25 disks wide)

But what is really buggin me is this: SQL mirror monitor claims 64 - 100 KB/sec for the "current restore rate", but windows TASKMANAGER disk monitor clamed 5 MB / sec Write IO for one of the DWH databases.

In summary: I think that applying this much mirrored log entries overwhelming SQL for some reason i cannot (yet) see.....

I do not see an obvious HW bottleneck except for the 2 maxed out CPU cores which (most likely) are servicing the LOG threads.

Could it be that i stumbled upon an inherent "maximum log/time vs underlying hardware" ratio that mirroring is able to cope with??

p.s.

I'm moving one of my problem DB's from mirror to logshipping, but why would that work better? Both mirror and logshipping are essentially in the business of duplicating log entries on one DB and restoring them on another?