Communications to SQL server "freeze", then resume.

  • matt.newman

    Hall of Fame

    Points: 3910

    ocoj (9/14/2012)


    We started getting "freezes" due to disk I/O being frozen by Avamar backup solution - but weirdly, sometimes at other times than the regularly scheduled backup when no Avamar process was running.

    Sometimes we get the freeze when the backup is not running, as well. I am using RedGate SQL backup tool for the backup but I am also using a master job to kick off the children jobs in sequential ordering so not one of the backups goes against the volume at the same time as another.

  • Adam Haines

    SSC-Insane

    Points: 23197

    matt.newman (9/14/2012)


    One of the log shipping jobs that is a very small and nearly static database seems to be at the root of this, somehow.

    I kicked off log shipping and a lock was seen. Log shipping had not started for all databases when the lock was seen, it was stuck on this very small (3 mb data file 6 mb log file). This job step to backup the database is taking between 1:30 and 5:00 to complete.

    It's after 10 so it's unlikely I will be able to continue this until next week. I am still not sure if this is the root cause because every time I manually kicked off to cause a block there was a very apparent CPU spike that had the server pegged. The stage connected app did not freeze.

    Hmm weird... lets get at it next week then...

  • matt.newman

    Hall of Fame

    Points: 3910

    I had tested the MySQL link and when I run that job manually there was no problem. I also turned it back on and has not given me a freeze yet. Also, with Log shipping off, I am seeing a lot more CPU spikes where all cores jump - but no freeze is seen.

    I have seen a lot of MSQL_XP wait but still unable to link this. Possibly the log shipping backup is somehow taking over all cores. Seems like I got somewhere though, even though some things contradict others.

    'Til next week then. Same Bat time, same Bat channel.

  • Adam Haines

    SSC-Insane

    Points: 23197

    Yes sir. Next week it is....

    One thing to ponder before we going into the weekend is perhaps you have resource contention at multiple levels that is causing issues. Backups really shouldn't drive up CPU that much, as it is more of an IO bound process; however, compression is one thing that can drive CPU up during backups.

    Next week, We should check what majority of your waits, IO Stalls/ IO Latency, and worker threads. Maybe the "freeze" is just you running out of worker threads and queuing requests, disks are maxed, or CPU is flat lined.

    Anyway CYA next week.

  • matt.newman

    Hall of Fame

    Points: 3910

    I have been monitoring reads/writes/cores.

    It's intermittent of being maxed out and not being maxed out on CPU. Disk never really went high but did, once, dip to near nothing. When the freeze occurs, sometimes CPU is pegged, sometimes CPU is unchanged from what it normally looks like, and sometimes just near flatlines around 1-2% use on all cores.

    I do think there is resource contention, but why would all my cores sometimes be reporting 1-2% use when application connections "freeze", and I can run all kinds of queries through SSMS like the server is ready for action, connected in via VNC to view the processes. Other times, when the freeze occurs, SQL Agent Jobs that are running already just continue running while existing connections to the application all lock up.

    It's just quite confusing and I think I am looking in the wrong place with the real issue starring right at me, unnoticed. Sometimes, MSQL_XP wait is very minimal at the time of the freeze.

  • matt.newman

    Hall of Fame

    Points: 3910

    The consistency found surrounding log backup has a few more points found:

    I looking over all captured "what's running" processes I am consistently seeing

    DBName Command CPUTime DiskIO LastBatch ProgramName

    master EXECUTE 0 0 9/12/2012 10:00 SQLAgent - TSQL JobStep (Job 0xADACD36076D9A64FBB8DFF56CE4E2D9D : Step 1)

    Although I was not really looking at this as a big player, I kept it in the theory bin. I knew log shipping was kicking off every 15 minutes, but the freezing never really started when log shipping backups started. Not only that, but it started into the backup a few minutes, and the freeze always stopped while log shipping was still going. The step above actual backup process completes in less than a second but was there each time as CPU time 0 and Disk time 0.

    The log shipping has been setup like so:

    There is a master job. This master job is on a 15 minute schedule. The steps of the job are to kick off another job that does the backup. This way, if needed I could manually kick off just one job very easily, create new log shipping entries very easily and other reasons. All jobs are disabled except MASTER. The first step should complete in 7 seconds, no less, really no more. Because starting a job from another job really just fires off the job and does not wait for it to finish I have a waitfor script that the master job step 1 calls a proc, waits 2 seconds, starts the job, loops on a 5 second intervals checked to see if the job is still going, then once it is complete will exit the proc, moving the master job to the second step.

    So, I have a sequential log backup process where no two tail logs are being backed up at the same time. I have all log files on a single volume separate from data files. If these backups were started at the same time without the ordering, there was slowing, not blocking, of the queries when the backups kicked off all at the same time or on staggered times but clashed with on another.

    Now, when the master job fires off, the first step that starts the first call to kick off a tail log backup job kicks off after 2 seconds... and sits there for about a minute and a half to 3 minutes. Well, this should never happen to take more than about 7 seconds. 2 seconds for a wait before starting the job, probably a few milliseconds to see that there is nothing to backup on the log, and 5 seconds for a wait on checking before making sure the job kicked off was completed and now it can move on.

    So, after a minute and thirty seconds the job finally completes BUT the freeze did not start until 45 seconds into this to a minute and a half into this. On top of that, now that the freeze started, maybe another 40 seconds goes by where this job will not seem to do anything and then after that job is done and the next job is executing - the freeze is still happening and then subsides after a couple minutes of freezing. Sometimes other things are executing along, CPU is having no abnormalities, reads and writes nothing off, disk queue no abnormality. Sometimes, cpu goes to 2% on all cores, sometimes CPU spikes up on all cores.

    I have tested to see if it was the database first in queue by removing that step and starting at the next step. I did get another freeze at about 12:15 today where it showed now this other log shipping job took a minute and a half, the freeze was going for two minutes, and the freeze started 50 seconds into the job being run for log shipping.

    I have created a new "dummy" job that does a waitfor for 2 and a half minutes, created a new step in the master job that uses the same waitfor script that the other log shipping calls use, put it onto the front of the step chain and am waiting to see if I get another bite on this for freezing and when - during the 2:30 waitfor first step, or not until the actual backup on the log file itself is attempted to be started. Looking to see if it is the waitfor script, the chain on starting a job from another job, or that the log backup process is killing that volume for some reason and perhaps I cannot read from disk for 45 seconds, and then cannot write to disk for 2 minutes... still all new to me why there was no real line up of this. Resource contention would just be a great pointer for this if there was some consistency there. This mostly looks to rule out Network involvement need. Still not sure why I could query DMVs and system tables where the applications were blocking. I did not query off the main tables used in the blocking app database backing, I will also put that in my bucket of testing. If this were the script that does a waitfor, I would have thought of more consistency to be seen unless it took on resource contention... but since there is no steady showing in resource contention every time this is still a point of confusion.

  • matt.newman

    Hall of Fame

    Points: 3910

    Nothing new that takes away from confusion. If anyone sees anything familiar it would be appreciated for any references.

    Attached is screen shot of Confio showing Log shipping backups started a minute and 1-2 seconds before SQL server recorded Log shipping backups started. I did have the Net Admin pull up both Confio Machine and SQL machine and pull the clocks side by side to verify. They are off by a second, if that (screen capture software might push difference into play).

    I am re-adjusting things again to see if I can get a better gauge since my time stamps are a minute off. From what I have been pushing to verification, it does look like the RedGate backup tool sits there, amps up for a minute, locks the volume for a couple minutes, the Disk Queue length counter never goes above 2 on the volume with the log files. May take a little more time to verify.

    Anyone heard of this for the RedGate Backup tool or Off time stamps from Confio?

  • Adam Haines

    SSC-Insane

    Points: 23197

    What version of SQL Backup are you using? I found a forum post that said that SQL Backup would hang with a msql_xp wait type. I believe this may be problem may be with the way the backup software is trying to call an extended procedure.

    http://www.red-gate.com/messageboard/viewtopic.php?t=9889

  • matt.newman

    Hall of Fame

    Points: 3910

    Thanks, that problem sounds about right with what I have been seeing. I have an open ticket in to RedGate about this since last week Monday/Tuesday. My search terms had picked up several similar posts but could not find that one.

    I am on an older version still from that post - 5.4.0.55

    Confio shows me those messed up timestamps and I was missing the potential problem, and now that I am honing in on RedGate backups, there are still a lot of "sometimes" and "not always" and "that's different".

    I broke out all my log backup jobs and tested in a variety of ways. Took the exact command to do the backup and executed in SSMS. Took 4 minutes 16 seconds to complete, but the output said the total execution time was 4 milliseconds. I was running sp_who2 ACTIVE when the execute command was running and CPUTime and DiskIO were still hung at 0 (Runnable). I had no Queue length increase, no noticeable errors from the Raid card or volume. The volume itself did not hand this time, but did the next time I let it run from the job itself.

    If I run one log backup and wait a few minutes, the next log backup takes a few minutes. If I execute multiple in a row, the first one takes a long time and those afterwards are fast to execute.

    My theories are still on Disk IO subsystem issue, Raid controller card, or now product that looked as reported to be executing fine in earlier research. I will likely be taking down the RedGate backup to use native log backups and see if I am still getting these hangups.

  • matt.newman

    Hall of Fame

    Points: 3910

    Have a ticket in with RedGate. Have a ticket in with Confio, but need to update support and version before moving forward with Confio. Confio could not work with the logs I sent them. I have converted one backup to a native backup process and it completes in less than a second (3 milliseconds - 0 pages). With RedGate backup, it will stall for minutes, sometimes logged where it starts sometimes not (via Confio), with MSQ_XP wait, and then kicks off the actual backup and comes back with results of completing in milliseconds. The command was run with and without the jobs that normally run them, and same long wait for running the backups after not having run for many minutes between tests. Confio has the timestamp matter where it is off by a minute and a second for something and I have not seen out of sync time stamps on the machines hosting this. Also, the big chunks of missing data when outages occur, but at the same time we had someone run a bad query that pegged CPU and DISK and Confio picked that up just fine.

    Are there any known PerfMon counters I could dump on that may indicate what the root cause is? Anything seen as missing in diagnostics? Sadly, I look and feel like a headless Aardvark running around on this one. I have no indicators saying "there is a problem" except missing data chunks in Confio data, and user reports of locking. It would be nice to be able to catch these things. CPU going low does not happen all the time so I can rig up something for that but is still not a catch all for a problem like this.

  • VinicioAizpurua

    SSC Veteran

    Points: 272

    I had kind of the same problem. Have you tried to change the network cable, plug into another port in the Switch? If you have more that one ethernet port, plug into the other one. By the way what is the brand/model is your server? How old is it?

    I hope this help you

    Regards

    Vinicio Aizpurua

    Miami-FL

Viewing 11 posts - 31 through 41 (of 41 total)

You must be logged in to reply to this topic. Login to reply