Hi folks. I've been trying to solve this for many months on our systems. Here's my situation:
* Two identical servers (Dell PE 840, Quad core, 4GB RAM, 300GB 10,000 rpm RAID 1) running Windows 2003 R2 SE SP2 (x32), each running SQL Server 2005 SE SP3 on two distinct GB subnets (using Netgear GS605 GB Switches).
* I have maintenance plans (MP) on each server to do Full DB backups each night and Log backups every 30 minutes to a device on another computer on that subnet (WinXP Pro SP2, GB ethernet). I specify the device using a UNC name to the share on the WinXP box (e.g., \\winxpbox\sharename).
* In the MP, I specify creating separate folders on the backup device (e.g., \\winxpbox\master, \\winxpbox\msdb, etc.)
* I have been keeping 4 weeks of backup files, which became a sizeable number of files (e.g., 1,300) in each folder where logs were actually taken. My full backups are about 350MB (not very large), and my logs are typically around 1.5kb, so I'm not moving a lot of data. BTW, I can copy tens of GBs of data across the LAN with windows explorer and have never had a failure -- that I know of.
* I have been receiving the Semaphore timeout error randomly since setting up the MPs (previuosly I used xcopy in a DOS script to do backups, which never had a problem across the LAN), and can immediately run the backup again once I receive the error, and it will work fine. Also, the error occurs on both the .bak files and the .log files, with no apparent preference as to which database, small or large (e.g., msdb, mydb, etc.).
I've tried the following things to troubleshoot the problem to no avail:
* I've noticed that when I reset the Netgear switches, I seem to have better luck with no errors for a while, then they come back with more regularity over time. Thinking about replacing the switches, or playing with the NIC settings on both machines. I have verified that both NICs are autosensing, and both show good quality links at GB speed. It could be switch related or other NIC settings (QoS is not enabled on my NICs, so I may try this too)...
* I have tried moving my backup device to other machines with similar configurations, but different HW, and even NW speeds, and they all exhibit the same general failure behavior. So I don't believe it is HW/disk related...
* I have noticed that when I access the shared device via UNC across the network, the window opens and shows the flashlight rolling around before finally populating the window with the directory contents. I began wondering if SQL Server gave up while waiting to get this information back from the device across the LAN while doing backups. With so many files, it could also be disk fragmentation causing the issue. I changed my MP to keep only 2 weeks of backups, and tightened my schedule for logs for only when the DB is being updated. This reduced the file count to around 500. I'll see if this makes a difference. I also mapped a drive to the backup device share on both servers, but did not change the MPs to use them, just to see if that keep the OS at attention on those remote shares. Nope.
At any rate, this is where I am now, and I'll let you know if the latest changes do any good. I really believe that this issue is a symptom of SQL Server backup timing logic and use of UNC devices. I saw a post where Red Gate SQL Svr backup SW exhibited the same problem with NAS, which was supposedly solved by lowering the block size written by the backup program. Don't know if that is even possible in SQL Server. May end up having to write backups to local disk and use COPY to move them. How quaint.