The semaphore timeout period has expired

  • Hi to all,

    Just moving over to a new firm I was asked to find out the reason why most of the clustered servers where prompting "SEMAPHORE TIMEOUT" every now and than so I digged in.

    This message occurs when using all kinds of I/O to the subsystem (which resides on the SAN).

    The exact error message is;

    Executed as user: . TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01]

    (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121). The step failed.

    Hardly any info is to be found in the Internet so I need your help guys!!

    As far as I found out this is an OS error which is written in the SQL server 2005 logs just before losing contact with one of the parts of the subsystems. This can be Hardware response timeout or an Autodetect setting on cluster network interface card. Others say this could be a Memory issue on windows 2003 SP1 cluster node with sql 2005 Ent. 64 bits SP1 or this is a bug/failure to properly release the cached memory. . . . .

    Who delt with this issue before and /or can help me out on this?

    Regards,

    GKramer

    The Netherlands

  • We had exactly the same problem with a cluster using SAN disks. However only one node showed this behaviour and it happened when high I/O occurred and especially around full backups. We changed the firmware etc and put the latest drivers on but still had the problem on the one node so in the end we replaced the node.

  • DNA,

    Thanks for the comforting thought.....we have several hundreds of nodes showing this problem (randomly).......

    GKramer

  • Hope you find out the root cause and don't have to start swapping out servers! We don't have hundreds of clusters but 20 or so and only one node on one cluster had the problem

  • Hi!

    I encountered same problem also.

    ODBCQuery: SQLSTATE: 08S01. Error: (121). Msg: [Microsoft][SQL Native Client]TCP Provider: The semaphore timeout period has expired.

    ODBCQuery: SQLSTATE: 08S01. Error: (121). Msg: [Microsoft][SQL Native Client]Communication link failure.

    ODBCQuery: SQLSTATE: 08S01. Error: (10054). Msg: [Microsoft][SQL Native Client]Communication link failure.

    Please if there's anybody might share what's the cause of these errors... and Hope to hear any solutions that you might suggests.

    Thanks.

  • Sorry to bump an old post, but was this ever resolved? We have the same issue on a cluster and can't pinpoint where the connection is being dropped.

  • Foxxo,

    This problem still excists and no cause can be found.

    I Entered this on the Microsoft Helpdesk (we've got platinum support) but they cannot solve this either.

    Recentley we where supported by a microsoft engineer for a project and aksed him if he was familiar with this problem but he also was scratching his head again and again.

    As mentioned in the opening text this is still a random issue which occurs every now and then. The Timeout could be related to several parts on the hardware and/or probably to some software as well. . . . .

    No sollution yet !!

    Hope this will be sollved when we move over to MS Windows 2008 (running SQL Server 2008) within several months.

    I can give you one small hint though;

    examin your HBA's queue depth length and set it at least to 128.

    We're hosting our SAN and they once set this queue depth length to 4 . . . .

    by upgrading this setting we encounter a 8x speed up of our most heavly used systems and probably lose the semaphore issue. . .(I hope)

    Regards,

    Guus Kramer

  • Hello Guus,

    Did changing that setting ever resolve the problem?

  • Grasshopper,

    Unfortinately this didn't solve the semaphore issue. . . . .

    it is just increasing on some of our servers cuasing a lot of networktimeouts resulting in processes ending / dataloss (and a lot of extra work). This is driving us to the edge:-)

    Still no permanent solution yet even though we re-entred this at microsofts support-team.

    Guus

  • We just encountered the same problem yesterday during reindexing on a SQL 2005 SP2 EE 64bit cluster running on Windows Server 2003 EE SP2. We have three reindexing jobs running at the same time on three drives. One of the three jobs failed with the error below.

    Message

    Executed as user: BLAIRNET\sqlsaservice. TCP Provider: The semaphore timeout period has expired. [SQLSTATE 08S01] (Error 121) Communication link failure [SQLSTATE 08S01] (Error 121). The step failed.

    We replaced the motherboard on this server a few days ago to address a memory issue. The following KB article indicates this can be a firmware/motheboard issue, but the firmware is not that old and the KB references 32bit. We are 64bit.

    Dave

  • Hi Guys

    I have had this problem for a long time, but found the solution in this thread here on SQL erver central:

    http://www.sqlservercentral.com/Forums/Topic399630-149-1.aspx

    Try it out, it worked for me.

    //SUN

  • Hi folks. I've been trying to solve this for many months on our systems. Here's my situation:

    * Two identical servers (Dell PE 840, Quad core, 4GB RAM, 300GB 10,000 rpm RAID 1) running Windows 2003 R2 SE SP2 (x32), each running SQL Server 2005 SE SP3 on two distinct GB subnets (using Netgear GS605 GB Switches).

    * I have maintenance plans (MP) on each server to do Full DB backups each night and Log backups every 30 minutes to a device on another computer on that subnet (WinXP Pro SP2, GB ethernet). I specify the device using a UNC name to the share on the WinXP box (e.g., \\winxpbox\sharename).

    * In the MP, I specify creating separate folders on the backup device (e.g., \\winxpbox\master, \\winxpbox\msdb, etc.)

    * I have been keeping 4 weeks of backup files, which became a sizeable number of files (e.g., 1,300) in each folder where logs were actually taken. My full backups are about 350MB (not very large), and my logs are typically around 1.5kb, so I'm not moving a lot of data. BTW, I can copy tens of GBs of data across the LAN with windows explorer and have never had a failure -- that I know of.

    * I have been receiving the Semaphore timeout error randomly since setting up the MPs (previuosly I used xcopy in a DOS script to do backups, which never had a problem across the LAN), and can immediately run the backup again once I receive the error, and it will work fine. Also, the error occurs on both the .bak files and the .log files, with no apparent preference as to which database, small or large (e.g., msdb, mydb, etc.).

    I've tried the following things to troubleshoot the problem to no avail:

    * I've noticed that when I reset the Netgear switches, I seem to have better luck with no errors for a while, then they come back with more regularity over time. Thinking about replacing the switches, or playing with the NIC settings on both machines. I have verified that both NICs are autosensing, and both show good quality links at GB speed. It could be switch related or other NIC settings (QoS is not enabled on my NICs, so I may try this too)...

    * I have tried moving my backup device to other machines with similar configurations, but different HW, and even NW speeds, and they all exhibit the same general failure behavior. So I don't believe it is HW/disk related...

    * I have noticed that when I access the shared device via UNC across the network, the window opens and shows the flashlight rolling around before finally populating the window with the directory contents. I began wondering if SQL Server gave up while waiting to get this information back from the device across the LAN while doing backups. With so many files, it could also be disk fragmentation causing the issue. I changed my MP to keep only 2 weeks of backups, and tightened my schedule for logs for only when the DB is being updated. This reduced the file count to around 500. I'll see if this makes a difference. I also mapped a drive to the backup device share on both servers, but did not change the MPs to use them, just to see if that keep the OS at attention on those remote shares. Nope.

    At any rate, this is where I am now, and I'll let you know if the latest changes do any good. I really believe that this issue is a symptom of SQL Server backup timing logic and use of UNC devices. I saw a post where Red Gate SQL Svr backup SW exhibited the same problem with NAS, which was supposedly solved by lowering the block size written by the backup program. Don't know if that is even possible in SQL Server. May end up having to write backups to local disk and use COPY to move them. How quaint.

  • Hi,

    Try the reg keys mentioned in my previous post, I now have more than one week after the change and still no timeout or comm link failures.

    //SUN

  • Thanks Soren, I'll give it a try and report back. I may give it a couple of days yet to see if the reduction in number of files in the folder has any effect. If so, it probably won't fix the problem for good, but will point to the fix you reposted as a likely long term fix. It would be nice if the fix for everyone was really this simple!! 😉

  • Well, I had a backup fail tonight on one of my two servers, so reducing the number of files in the device share directory didn't seem to help. SO, I've added the registry key entries Soren suggested to one of my servers to see if that helps. Question for Soren, does this change need to be made on both ends of the connection (i.e., my WinXP Pro box where the backups are written, if those keys are even used)?? Seems doing it on one box might not be enough. What'd you do?

    I'll let you guys know what happens. Got my fingers crossed and I'm feeling the love 😉

    Shane

Viewing 15 posts - 1 through 15 (of 35 total)

You must be logged in to reply to this topic. Login to reply