I am looking to see if anyone else has a few ideas of where to start in looking for problems or possible look for something I missed.
Problem: a few times per day, our internal traffic to our SQL server will "freeze" and our web interface and VB 6 windows application will also "freeze". No external web surfing or IP phone communications are affected. The "freeze" happens for about 2 to 5 minutes in length. Every workstation affected at the same time, for the same duration. The problem occurs AROUND 15 minute time intervals, but typically does not start directly on the mark but instead a minute or two away. Example, 9:00:50, 9:17:10,9:47:05,10:00:30. These times have been consistent the past 3 days. When the "freeze" is over, everything just resumes as though nothing happened.
Anyone have things that I should look towards as a suggestion, it would be quite appreciated. Below are a few things I have been looking into with no evidence of what is causing the problem.
I have Confio Ignite 8 and working with their support and don't really see anything other than for these "freezing" times, hardware data and sometimes SQL statement information gathering stops. There is nothing recorded and Ignite sows chunks of missing data. When it can pick up some data, there is nothing showing strain on the server.
I have perfmon counters set for CPU, Memory, and Disk. During the "freezing", I see no anomalies or large strain. I see nothing to indicate the server is physically struggling.
I have captured sp_who2, sp_who2 active, and custom "running SPIDS" query result data during these "freezes" and see nothing hitting the server hard, I do not see a strain on the server. I see a normal amount of open connections, nothing being dropped, and a normal amount of running connections and normal activity. Nothing putting strain on the server. There is no blocking. There is a very small number of transactions in queue.
I have checked the SQL server error log and seen nothing of value for this case.
I even went to the pain of running a profiler against production during the times I knew there would be a "freeze" - to pull ANY errors from SQL server. Nothing but some log completion events, no errors.
I have switched some SQL Agent timers to execute on different times.
I have worked through a few things with RedGate support for log shipping since we go for 15 minute intervals (but on the 15 minute marks exactly).
Only once ever did I get a user reporting an error (reported from user):
9:00 am 2 mins
9:32 am 2 mins
9:46 am 5mins
This error came up once it unfroze:
Error at: 9/11/2012 9:49:19 AM
Error: [DBNETLIB][ConnectionWrite (send()).]General network error. Check your network documentation.
I have had the Network Admin check the event logs on the SQL server, and the switch logs. Nothing came up during the problematic times.
We have an audit mechanism that shows us how long individual queries take. During these freezing time periods, the time taken for execution SOMETIMES shows execution duration spikes when things are frozen. As in, it seem to keep the connection, just not do anything with it, and then resume after 2-5 minutes, and then log that the execution took longer. The mechanism is that built into the app the application takes a timestamp, runs the query, takes another timestamp, then logs what was executed and how long it took. There are spikes, but this doesn't really give detail as to WHY. The web application were no errors are seen has a timeout for some pages of Our main app has a built in timeout of 30 minutes. Web has between 2 and five depending on what is being used. Whatever is being used no one is reporting timeouts or thrown/shown errors.