The consistency found surrounding log backup has a few more points found:
I looking over all captured "what's running" processes I am consistently seeing
DBName Command CPUTime DiskIO LastBatch ProgramName
master EXECUTE 0 0 9/12/2012 10:00 SQLAgent - TSQL JobStep (Job 0xADACD36076D9A64FBB8DFF56CE4E2D9D : Step 1)
Although I was not really looking at this as a big player, I kept it in the theory bin. I knew log shipping was kicking off every 15 minutes, but the freezing never really started when log shipping backups started. Not only that, but it started into the backup a few minutes, and the freeze always stopped while log shipping was still going. The step above actual backup process completes in less than a second but was there each time as CPU time 0 and Disk time 0.
The log shipping has been setup like so:
There is a master job. This master job is on a 15 minute schedule. The steps of the job are to kick off another job that does the backup. This way, if needed I could manually kick off just one job very easily, create new log shipping entries very easily and other reasons. All jobs are disabled except MASTER. The first step should complete in 7 seconds, no less, really no more. Because starting a job from another job really just fires off the job and does not wait for it to finish I have a waitfor script that the master job step 1 calls a proc, waits 2 seconds, starts the job, loops on a 5 second intervals checked to see if the job is still going, then once it is complete will exit the proc, moving the master job to the second step.
So, I have a sequential log backup process where no two tail logs are being backed up at the same time. I have all log files on a single volume separate from data files. If these backups were started at the same time without the ordering, there was slowing, not blocking, of the queries when the backups kicked off all at the same time or on staggered times but clashed with on another.
Now, when the master job fires off, the first step that starts the first call to kick off a tail log backup job kicks off after 2 seconds... and sits there for about a minute and a half to 3 minutes. Well, this should never happen to take more than about 7 seconds. 2 seconds for a wait before starting the job, probably a few milliseconds to see that there is nothing to backup on the log, and 5 seconds for a wait on checking before making sure the job kicked off was completed and now it can move on.
So, after a minute and thirty seconds the job finally completes BUT the freeze did not start until 45 seconds into this to a minute and a half into this. On top of that, now that the freeze started, maybe another 40 seconds goes by where this job will not seem to do anything and then after that job is done and the next job is executing - the freeze is still happening and then subsides after a couple minutes of freezing. Sometimes other things are executing along, CPU is having no abnormalities, reads and writes nothing off, disk queue no abnormality. Sometimes, cpu goes to 2% on all cores, sometimes CPU spikes up on all cores.
I have tested to see if it was the database first in queue by removing that step and starting at the next step. I did get another freeze at about 12:15 today where it showed now this other log shipping job took a minute and a half, the freeze was going for two minutes, and the freeze started 50 seconds into the job being run for log shipping.
I have created a new "dummy" job that does a waitfor for 2 and a half minutes, created a new step in the master job that uses the same waitfor script that the other log shipping calls use, put it onto the front of the step chain and am waiting to see if I get another bite on this for freezing and when - during the 2:30 waitfor first step, or not until the actual backup on the log file itself is attempted to be started. Looking to see if it is the waitfor script, the chain on starting a job from another job, or that the log backup process is killing that volume for some reason and perhaps I cannot read from disk for 45 seconds, and then cannot write to disk for 2 minutes... still all new to me why there was no real line up of this. Resource contention would just be a great pointer for this if there was some consistency there. This mostly looks to rule out Network involvement need. Still not sure why I could query DMVs and system tables where the applications were blocking. I did not query off the main tables used in the blocking app database backing, I will also put that in my bucket of testing. If this were the script that does a waitfor, I would have thought of more consistency to be seen unless it took on resource contention... but since there is no steady showing in resource contention every time this is still a point of confusion.