January 12, 2015 at 11:12 am
Hello,
I am hoping to get some direction on how to best troubleshoot a recent ongoing issue:
Issue: Native backups to NAS do not complete.
We have been experiencing an issue whereby our native backups are hanging with status': SUSPENDED/RUNNABLE
I ran select * from sys.sysprocesses. All of the backup SPIDs processes show BACKUPTHREAD/PREEMPTIVE_OS_FILEOPS
This first occurred last Wednesday evening. When I discovered this on Thursday, I attempted to kill the backup jobs. This also hung with 0% completed/0% time remaining. Backups hung on more than one instance.
That evening, I attempted to restart the instance which also failed with something along the lines of: could not start MASTER file in use.
I then restarted the server--which I really did not want to do--and this cleared it. I was also able to manually kick off maintenance plans (DBCC CHECKDB and full backup) without issue.
I was off Friday and the weekend. I came in this morning and found the maintenance plans (diff/tlog backups) did not complete in some of the instances--in one case, the instance affected now was not affected before. They appeared to have hung on their next scheduled kickoff which was later that night after I went home
Remembering the "file in use" error, I have run process monitor to see if anything unusual had a lock on any files. I saw only SQL Server and Double-Take processes accessing log files.
Being a relatively new DBA, I am user where to go next in trying to track down the cause of this issue. This is fairly urgent as one of the instances that has had this problem both times is our production SharePoint environment.
I'd appreciate any suggestions on what to look at next.
ENVIRONMENT:
SQL version:
Microsoft SQL Server 2012 (SP1) - 11.0.3368.0 (X64)
May 22 2013 17:10:44
Copyright (c) Microsoft Corporation
Enterprise Edition: Core-based Licensing (64-bit) on Windows NT 6.2 <X64> (Build 9200: )
OS:
Windows Server 2012 Standard
Installed Software:
Double-Take
Commvault filesystem/SQL agents
MS System Center End Point Protection
January 12, 2015 at 12:04 pm
That wait type means that SQL Server is waiting on the operating system. It sounds like you may have a problem with your NAS. Is it sharing the network with everything else? I'm not a NAS expert, but the indications you've provided suggest that's where the problem is. I'd get with your systems people or storage people and walk them through the issue. See if there are logs on the NAS OS that suggest an issue.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 12, 2015 at 12:13 pm
Thank you for your response. 🙂 NAS is where I am heading too. I assume it is running into an issue creating the file on the NAS which is causing it to hang. It is strange that our other SQL servers which have similar jobs backing up to the NAS are not experiencing the issue.
What troubles me is when I restarted the instance service last Thursday, it failed with "file [MASTER db] in use." I wouldn't expect to see this if it was simply a network/NAS issue. However, I freely admit, that I am not an expert and may not be aware of other factors that may make sense.
Also, in rereading my post. I should note that the database files are local on the server. Only the backups are going to the NAS.
Are there any suggestions for resolving these hung jobs without restarting the service/server?
EDITED: for clarification and grammar.
January 12, 2015 at 12:30 pm
Restarting the server because of a NAS problem, even if it's just backups, doesn't give me a warm & fuzzy, at all. Combining the fact of your NAS problems with the file in use error for one of your databases, I'm actually a little concerned about the system in general. Rebooting it over & over is unlikely to do it, or your databases, any good. However, if you have processes that are waiting for a resource that is effectively offline, killing that process may not clear appropriately. So restarting the service is probably what you'll have to do to get them clear. And again, that makes me very uncomfortable to say. Best thing to do is find out what the heck is going on and leave the running processes in place until you can resolve it, assuming they're not causing blocking.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 12, 2015 at 12:40 pm
Thanks again, Grant.
I agree about not having a warm feeling about the current state.
Are there any suggestions on where to look or what tools to use to try and figure this out? Using procmon, I don't see anything locking the files at the OS level and all the database files themselves appear to pass integrity check.
I do see some sql blocks preventing sharepoint tasks from completing, but no reports of issues from staff.
At this point, I have asked that we kick off commvault backups of everything on this server as we haven't had log backups in the affected instances since Thursday night.
January 12, 2015 at 4:05 pm
Honestly I don't know where to go in the NAS, but that's gotta have some of the source of info.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 13, 2015 at 5:40 pm
Thanks again, Grant.
I have switched our backups to Commvault and they have been running without issue. I did have to restart the SQL server to get the backups running as any attempt to backup the database where the previous backup hung (predictably) failed.
There was something that was disconcerting about the restart of the server--namely, it took 45 mins. This is a 32 core, 256GB RAM box.
Could the hung (SQL native) backup jobs have caused this 45 min delay in restarting (i.e. rolling back changes)?
It was interesting to watch each SQL instance service stop one by one during this 45 mins. The server finally rebooted 2-3 mins after the all the SQL services had stopped (last one was integration services).
I am hoping that the hung jobs were a result of some communication issue between the local box where the database files reside and the NAS where the maintenance plans store the backup files; and the issue rebooting last night was caused because those backup jobs were hung for 3.5-4 days. However, I am not sure how likely this may be.
I have opened tickets with our hardware vendor and Microsoft to assist with finding out the cause---there isn't much in the logs (SQL or event).
BTW, I have your 2012 SQL query performance tuning book in my library. Every time I pick it up, I realize I need to continue to hone my SQL skills :laugh:
January 13, 2015 at 7:48 pm
Yeah, probably rolling back those transactions. Glad the recovery on the other end wasn't as long or longer.
Good luck with the tickets. If you do get a more complete answer, post it back here. I'm very curious to know what's up.
I sure hope the book proves useful.
"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt
Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning
January 14, 2015 at 12:28 am
Can you check on your NAS that the target folder isn't compressed ?
I've seen sqlserver backup commands having issues with windows compressed folders.
Johan
Learn to play, play to learn !
Dont drive faster than your guardian angel can fly ...
but keeping both feet on the ground wont get you anywhere :w00t:
- How to post Performance Problems
- How to post data/code to get the best help[/url]
- How to prevent a sore throat after hours of presenting ppt
press F1 for solution, press shift+F1 for urgent solution 😀
Need a bit of Powershell? How about this
Who am I ? Sometimes this is me but most of the time this is me
Viewing 9 posts - 1 through 9 (of 9 total)
You must be logged in to reply to this topic. Login to reply