Requesting help with troublesome SQL Server Lockup

  • Problem Description: Every 2 to 3 days I come in and find that I can't connect to my SQL instance with time out errors. The services (SQL Server and SQL Server Agent) seem to be running fine.

    I am able to connect using the DAC (Dedicated Administrator Connection) to the activity monitor and also a single query window which allows me to run some diagnostics but so far I am unable to track the source of the lockup. Probably because I'm not experienced with understanding the output of the activity monitor and the resultant output files from my queries.

    I have at least 3 snapshots of the following query results which I can share to assist in troubleshooting along with some screen shots of the activity monitor. Currently available are:

    sys.dm_exec_connections.rpt

    sys.dm_exec_requests.rpt

    sys.dm_exec_sessions.rpt

    sys.dm_os_wait_stats.rpt

    sys.dm_os_wait_tasks.rpt

    sys.dm_trans_locks.rpt

    I also have a dump of sysjobsteps.rpt to look up relevant job information from the above files.

    My big problem is that I'm not experienced or knowledgeable enough to look at these result files and pinpoint what is actually causing the server to hang/refuse connections.

    Some other information that might be relevant:

    - We reboot our server every night at 8PM.

    - Several of our jobs run against Linked Servers (foxpro, access and proprietary) and it may be that the first Linked Server job to run after the reboot is starting the hang process but I'm unsure and can't back up this idea.

    I'm really hoping someone can assist with this issue as it's been going on for over a month and I'm completely out of ideas. I am willing to provide whatever information is required to assist with the troubleshoot. I have attached a rar file that contains the latest dump of the above mentioned rpt files.

    Thank you very much for even reading all of this message.

    Best regards,

    - D

    Operating System: Windows Server 2003 Standard

    SQL Version: Microsoft SQL Server 2005 - 9.00.4035.00 (Intel X86) Nov 24 2008 13:01:59 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

  • Well none of the files show the blocked processes. how about giving the output of a sp_who2 when the problem is occurring.

    From looking at the other files a couple of notes:

    - you have some i/o issues as far as performance , looks like you are getting a ton of i/o waits especially if these are the stats since your nightly reboot. .. Look at optimizing performance for disks. May not specifically be the problem but could be contributing. look at your windows perf mon and see what you i/o wait times are.

    The other file shows several of your running processes with Exclusive table locks. Whatever those objects are I am sure they are locking or blocking other processes. This could possibly be fixed by having covering indexes or then again it could be the Sql itself(not committing at the proper times,poorly formed sql,etc). Several ways of looking into this, sorry to be vague on that but there are mutiple ways of going about troubleshooting the exclusive locks.

    MY guess is that you have cascading locks/block occurring meaning one thing starts gettting blocked, blocks the next one and so on until everything is locked up.

    Thanks,

    Mike McNeer

  • Mike,

    Thank you very, very much for even looking at this issue. I would be more than happy to run sp_who2 and provide the results. It will probably be another day or two before the lockup occurs again but I will certainly update this thread.

    Thanks again,

    - D

  • Mike,

    Server hung up again today. Attached are the latest file dumps including the sp_who2 dump you requested. I sincerely appreciate your efforts.

    Thanks much,

    - D

  • Looking at your sp_who2 you will notice this SPid

    62RUNNABLECDE\SVC-SQLFASDSQL1.CALSTARSINSERT139648426226614-Mar6:18:02SQLAgent - TSQL JobStep (Job 0x79F3492C9CE468409F08A2574351D7AA : Step 1)620

    This job has been running since 15:00 the previous day and still shows running at 6:18 am the next day..

    Notice the CPU time and the DIsk i/o time.. It is chewing all of your resources up so to speak. Then you havea backup kicking off, more jobs waiting on this job etc. , cpu appears to be constrained, i/o is constrained but that could all be a symptom of the job.

    However , From the wait types it appears it is waiting on the OLEDB source to finish.

    what is your oledb source? is it Oracle, a file , myssql or another sql server? Depending on that kinda depends on what i would look at first.

    I would find a way to possibly speed up the source you are pulling from and looking at the queries within that job/proc to make sure

    it is indexed properly.

    Also post machine specs and Sql server version please.

    Thanks,

    Mike

    That job equates to this job and step:

    2C49F379-E49C-4068-9F08-A2574351D7AA 1 Import Data Tables TSQL EXECUTE dbo.ap_ImportDataTables DEFAULT 4 NULL 0 1 2 4 2 NULL CALSTARS NULL 0 5 0 NULL 1 10729 0 20100302 50101 NULL 92988440-B2D2-4C7F-89C8-71147CFC7224

  • That job run time seems weird as we restart the physical server every night. Can the jobs persist over a server restart? We use several linked servers to bring in data from access, foxpro and a proprietary mainframe source. The CALSTARS job you used as an example runs against this (the mainframe) linked server.

    From above:

    Operating System: Windows Server 2003 Standard

    SQL Version: Microsoft SQL Server 2005 - 9.00.4035.00 (Intel X86) Nov 24 2008 13:01:59 Copyright (c) 1988-2005 Microsoft Corporation Standard Edition on Windows NT 5.2 (Build 3790: Service Pack 2)

    You looking for physical machine specs as well?

  • I wouldnt think it would have persisted over the reboot, maybe the reboot did not happen?

    On that note, Why are you rebooting every night? Shouldn't need to if you do, you have some other underlying issue but that's another topic 🙂

    The one job still showed as running at the time you took the snapshot though.

    Can you verify it starts at 15:00?

    I would do testing of this job manually running the sql or procs associated outside of the agent and see which part is taking so long. Could be a problem on the mainframe side not getting your data out quickly enough or a problem with the link betweeen the two not performing but it definitely showed to be waiting on the source to respond. At some point it begins waiting on cpu cycles, have someone verify this when the problem happens using perf mon and monitor cpu usage. I would bet you are maxed out at the point you can not get into the box.

    Good luck,

    Mike McNeer

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply