Intermittent Login Timeout

  • We have an application that experiences a SQL login timeout at the same time every night.  The app establishes thousands of SQL connections per day without any problem, but consistently experiences login timeouts at the same times each night. 

    I've monitored the SQL Server with profiler and perfmon, there is no heavy resource use on the box when these login timeouts occur.  There is the normal constant SQL activity, but the box is basically idling (CPU < 5%, queues very short).  No SQLAgent jobs running at this time.  I'm pretty adept with perfmon, have captured tons of different counters in counter logs, but there's just nothing helpful.

    To help diagnose this I wrote a CMD script that uses START to simultaneously

    • PING boxname
    • DIR \\boxname\c$
    • OSQL -Sboxname  -Q"select @@servername" -E (trusted connection)
    • OSQL -Sboxname  -Q"select @@servername" -Uxx -Pxxx (SQL connection)

    I run this script every 2sec, connecting from my box to the SQL server.  I run the tests with name and with IP address as an argument, to see if DNS is a problem.  So actually there are 8 tests run simultaneously.  I get exactly the same results either way, so I'm pretty confident DNS is not an issue.

    What I do see is all the tests succeed, except the OSQL -E tests get a login timeout three times per night, all spaced about a minute apart, and consistently at the same time each night.  In between the timeouts are several successful connection attempts.  Here is example output showing the timeout:

     START       FINISH     TEST                 RESULT

     4:01:04.67  4:01:05.08 PING BOX_IPADDR      time=44ms

     4:01:04.68  4:01:05.10 PING BOX_NAME        time=44ms 

     4:01:04.68  4:01:05.47 DIR \\BOX_IPADDR\c$  boot.ini

     4:01:04.70  4:01:05.47 DIR \\BOX_NAME\c$    boot.ini

     4:01:04.71  4:01:14.14 OSQL -E -SBOX_IPADDR Timeout

     4:01:04.68  4:01:14.05 OSQL -E -SBOX_NAME   Timeout

     4:01:04.72  4:01:05.60 OSQL -U -SBOX_IPADDR BOX_NAME

     4:01:04.76  4:01:05.59 OSQL -U -SBOX_NAME   BOX_NAME

    Except for the timeouts, the duration for these tests is same as it is when we run the tests any other time of day. 

    My tests experience timeouts at exactly the same time as the application gets them, so I'm sort of able to reproduce the timeout problem on a different box...I don't think the problem is specific to the application server.  What's perplexing, though, is that the problem application connects to SQL with a SQL connection, not trusted!  I'm unable to reproduce that.

    What I am seeking suggestions on is "what could be causing this?"  What resources are involved in processing either a SQL login or trusted login to SQL Server?  Since I've monitored the SQL box so closely, and seen nothing that would cause this symptom, I have to wonder if it's some kind of network or domain problem?  This SQL server is 1500 miles away from both my workstation and the problem application's server, and is in a separate domain.  But what kind of problem would be so specific, allow 99.999% of logins to succeed, yet consistently cause this same failure at same time every night?

    Thanks in advance if you have any ideas or suggestions.

  • Could very well be a latency/network issue - particularly with 1500 miles between the client and server.  Run a ping -t at the same time as your other test and see if you start seeing more dropped packets &/or latency.  One possible cause of the middle of the night behavior could very well be that the network is "working" a lot harder at night than during the day as batch processes, data replication, backups, etc. are run across the network?

    Joe

     

  • Thanks for reply.  I thought same thing, that's why part of my test was a PING run simultaneously with the OSQL command that failed to login.  The results are in my first post, 44ms.  This is same or better latency than it is during the day when I don't see the timeouts. 

    I run 2 side-by-side simultaneous OSQL -Q"select @@servername" cmds, one using trusted login, other using SQL login, and both succeed all day long.  I run this every 2seconds for 23hours in a row, there will never be a single login timeout for either cmd.  But at 4:01am and 4:02am and 4:03am, the OSQL cmd that uses trusted login will experience a login timeout.  The other OSQL cmd, run at exactly same time, succeeds.  I believe this test rules out problems of network connectivity between my workstation and the server? 

    Wondering if maybe I'm looking in wrong place & should look at connectivity between the SQL server and the PDC (or whatever this is called nowadays in AD environment)?

  • Try checking with the Domain Admins and ask them if there are any type of AD errors in the System Event logs of the PDC(s) and BDC(s) ... additionally ask them if a regulalry scheduled AD replication takes place at the same times as your connection failures.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • I had *STRUGLED* with that problem for a looooonnnngggg time. The issue in my case was Windows 2003 Server! It can't handle under certain conditions many connection "open" requests at once!


    * Noel

  • Out of curiosity noeld just how many is too many ?

     

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • Rudy & Noel, just want to say "thanks" for your replies.  I've stopped trying to treat this as a SQL Server problem, involved our domain admins.  No news yet, might be a while.  If we ever solve this I'll post an update here saying what we determined the problem was.

  • In my case 300 *AT ONCE* --- was too many


    * Noel

  • noeld what were the 'other' conditions that you allude to. I ask because the shop I just came from experienceds a mass logon ever M-F of roughly 500-700users between 7:30 AM and 8:00 AM. Then big crunches just after returning to lunch, probably 400-500 in 15 minutes. Our environment was Win2K3 SP 2 for AD(4), DNS(2),  DHCP(2) and WINs(2). One primary location and 19 remote locations.

    RegardsRudy KomacsarSenior Database Administrator"Ave Caesar! - Morituri te salutamus."

  • the 'other' conditions are:

    What is "login on" are scripts not people.

    These scripts are highly synchronized triggered by cron (Unix)

    All happens at once withinn a maximum of 3 MILLISECONDS difference.

    We traced the issued all the way down to discover that Windows tcpip.sys was sending RST packets back to the clients ( randomly )The moment we put some sleep time among our jobs all went back to normal.

    Note: We didn't had that issue before SP1 Win2003 but M$ introduced "new" protection "features" in the tcpip stack and there we went ... to the toilet


    * Noel

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply