Sporadic query timeouts / query not showing in trace

  • We have been having sporadic query timeouts in our production environment (SQL 2005 Enterprise SP1 A/P cluster), they been a constant thorn in my side! In almost all of these cases the query its self does not seem to be the problem, the tables are well tuned and the query will run in management studio in a few seconds at most, its only when this is run from an app via ADO.

    To replicate the timeout problem I have created an App (C#, ADO) to run a simple select in a loop until the timeout occurs. I am queering a newly created test table which is not related to any other apps to rule out blocking. This table has one column “number” with the digit “1”. I also append a comment to the end of the query with the loop counter so that I can uniquely distinguish each instance in the trace (Select * from TestTable --1, Select * from TestTable --2, etc..) Usually within 20k iterations I see a timeout. The bizarre part is that the instance which times out is no where to be found in the trace (no filter), there is no SQL:BatchStarting event class. I can see the batch start and stops for all successful instances. I expected to see a query instance start with a beyond 30 second duration, the obvious timeout scenario, but I am getting nothing in the trace. We have run a packet sniffer to monitor traffic between the test app and the server. In the timout case we see the outbound query packet going to SQL and then a sequence of keep alive requests until the timeout. Since it appears the query is at least getting to the server I do not know why it’s not showing up in the trace, its as if its never making it to SQL.

    Any suggestions of tools, other trace event classes, or settings changes I should make would be great. I have been researching this on and off for the past few days and am stumped!

    Thanks,

    Aaron

  • Your packet sniffer seems to be saying you see the outbound packet to the SQL server, per what you wrote. Have you checked to make sure it arrives at the SQL server?

    My first thought on this is a router or switch between you and the server that's dropping packets. I'd test that very, very thoroughly, before I bothered checking anything else.

    The fact that it's not in the trace means the request never made it to the SQL engine. Simple as that.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

  • The sniffer is actually running on the SQL server. We could see the accepted packet at the server, at least that’s what my network admin is telling me. I am not too experienced with network traffic monitoring / packet analysis, we usually leave this to our network team. But now we are in the situation where the network team says the packet makes it, and we on the database side say it’s not in the trace so SQL doesn’t see it. We are going to run another test this afternoon to verify the sniffers’ previous findings.

    We were seeing some network errors in our apps a few months ago and replaced the NICs in both cluster nodes, so those are virtually new. I too have been questioning the router/switches. Another interesting fact is that this test app ran 200k transactions against our development DB without a timeout. I am not sure if these severs share a switch/router, that would be interesting to know, I will have to find out.

  • We were having a periodic timeout issue recently, replaced routers and switches, replaced NICs, all to no joy. Then found one of the cables had a flaw in it. Replaced that, and the problem went away.

    - Gus "GSquared", RSVP, OODA, MAP, NMVP, FAQ, SAT, SQL, DNA, RNA, UOI, IOU, AM, PM, AD, BC, BCE, USA, UN, CF, ROFL, LOL, ETC
    Property of The Thread

    "Nobody knows the age of the human race, but everyone agrees it's old enough to know better." - Anon

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply