Intermittent connectivity issue - how to diagnose and fix

  • Greetings,

    I have been trying to track down the cause of an intermittent connectivity problem I have been seeing from a .NET 3.5 application to SQL Server 2005 (Standard Edition).

    The application uses the Microsoft Data Access Enterprise Library for connecting and running queries and stored procs. I have used the Activity Monitor to verify that the connections being created are being closed properly, and the number of connections from the application is staying pretty constant (between 1 and 3, depending on where we are in the application and how many I expect to have open).

    The application is a data import & processing application that reads data from one database (we'll call A) in batches, pulling 100 rows at a time into a disconnected dataset. It then opens a connection to database B (which may or may not be on the same server as A), starts a transaction, performs a number of queries (mostly inserts), commits the transaction, and closes the connection to B for each record in the disconnected dataset.

    Periodically throughout this process at seemingly random intervals (i.e. the number of records that will process successfully before this happens is unpredictable) I will see the following error when I try to open a connection (using the same connection string that worked for the previous rows):

    A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections.

    I've researched this error but everything I find talks about looking at your connection string, is your db listening on the proper port etc, but seems to assume you either can connect or you can't -- I've seen no mention of this as an intermittent problem or how to troubleshoot it in that way.

    I should also note when this error occurs, NO ONE and NO PROCESS can connect to the database server at all. (Not through Management Studio, not through other .NET applications). So something (probably this application) is "bogging down" either the server or the network. After between 10-30 seconds the connection issue "goes away" and the application (and other users) can get back in, til the next time it happens again.

    I want to figure out exactly what is causing this and fix it. But, since I don't see a runaway number of connections being created in SQL Server (all connections being opened in the code are being closed properly from what I see in the Activity Monitor) I am not sure how to proceed here.

    I believe that when this app was first written, all of the individual row-level transactions were in the same connection instead of having a single connection that opened and closed for each row, and that the change was made to individual connections to try to fix it, but it didn't make a difference. I currently have connection pooling turned off in my connection string -- I had it on previously but still had issues (in fact they seemed to be more frequent!)

    Most of the individual queries being run are fairly simple, there are just a lot of them happening in a short amount of time. There are no cursors or triangular joins anything evil like that. The most complicated thing I am doing is using a paging CTE for the data retrieval query against database A; they are mostly simple insert statements against database B with the occasional lookup to make sure we don't have duplicate rows. Looking at a trace in the profiler, the individual queries (even the paging cte) are all quite short when this is running successfully ( usually 10-50 microseconds, 100 tops). The individual queries don't timeout - it's always an inability to connect to the server at all that I am seeing. This application does cause the CPUs on the sql server box to go up to about 50-60% while it is running.

    I know I'm being somewhat vague on the details of the queries themselves, but I really don't think any one query here is the issue and honestly it would be more trouble than any of you would want to go to to actually reproduce the db and application environment that we have here. I can't reproduce this with a simple example because I've never had it happen with any of my other applications.

    Does anyone have any ideas how I can proceed with figuring out what is causing this? I'm not a DBA and unfortunately we don't have a DBA on staff, I myself am pretty amateur at using the performance and monitoring features of SQL Server. I have been trying to figure this one out for awhile now and it's stretching the limits of my knowledge pretty far. I don't know if SQL Server "thinks" there are too many connections even though the number seems to be the same, or if there is a limit to the # of queries in a given timespan, or if the network itself is getting bogged down. If we need Enterprise Edition for the amount of load we're putting on it I can recommend that but I want to make sure it isn't something I'm doing wrong before I blame the database server.

    Thanks in advance for any thoughts you might have.

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • most connectivity issues I've run into has more to do with server name resolution or DNS than anything else;

    for the connection string, are they using the machine name, an IP address or a DNS alias?

    for example, at work my machine on the internal network has all three:

    machine name:D223,

    IP: 192.168.0.40

    DNS: lowell.ourdomainname.com

    connecting by servername or DNS alias, since it depends on a server to answer back to give the IP address is where i've seen all my issues.

    assuming your SQL server has a fixed IP, change the connection string to use that, and see if the issue goes away. if it does, it means you can get with your network admin to see why the WINS service or the DNS service is acting up.

    Lowell


    --help us help you! If you post a question, make sure you include a CREATE TABLE... statement and INSERT INTO... statement into that table to give the volunteers here representative data. with your description of the problem, we can provide a tested, verifiable solution to your question! asking the question the right way gets you a tested answer the fastest way possible!

  • Hmm, I actually use machine name when running against our QA server and IP address on staging, but see the problem occurring in both cases.

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • Lots of network errors I've seen also come from network issues, meaning the network is too busy. That would explain your randomness.

    The other thing is a relatively short timeout and a busy SQL Server. If the server is pegged when you try to connect, it might not respond. Have you correlated the issues with load in any way? Perhaps with Profiler and/or the DAC connection?

  • Well, this morning I did some more testing and this is what I found.

    During the process the CPU on the server is stable at about 15% and drops to almost nothing when the problem occurs.

    Pages/sec is very high, sometimes in the hundreds or even thousands.

    Disk queue length is usually very low but occasionally spikes then drops, not necessarily right before a hiccup.

    I can ping the server during the connection interruption but cannot telnet to port 1433 which the server is supposedly listening on.

    After 10-30 seconds I can again telnet in successfully.

    I do think this is somehow traffic related since this process bangs on the server harder than any other. I am not sure which timeout is probably too short though, could you please elaborate?

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • Just posting a followup since I've resolved this - it turns out the problem was related to connection pool settings.

    I'm not sure why bad connection pool settings from one app would cause ALL connections to the server to fail, but that was indeed the problem.

    Learn something new every day 🙂 Thanks to all who made suggestions.

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • Could you elaborate on the bad connection pool settings? I'm seeing similar symptoms--intermittent connectivity in my app which has low traffic but connections are load balanced on an F5.

    Thanks

  • This ended up being pretty complicated.

    With connection pooling turned off, this intermittent lack of connectivity happened frequently and was caused by just too much load for a non-pooled scenario. I had turned pooling off because I had been seeing wonky behavior with it turned on. When I turned it back on, I eventually figured out that the problems were related to the pool not being able to expand past its minimum pool size in our situation:

    In our environment we have two kinds of databases - common databases and customer specific databases. We use the Microsoft Enterprise Libraries for data access and keep the connection strings in the web.config.

    With the common databases, we have a static connection string in the web.config. With the customer databases, we have tags in the connection string which we use to swap in the customer-specific database name, login information etc. When we instantiate the static connections we could use DatabaseFactory.CreateDatabase(ConnStringName) which looks for a pre-existing Database object and its associated pool using the exact conn string straight from the web.config. When we instantiate the dynamic connections, we have to use the SqlDatabase(ConnString) constructor which allows us to replace the tags the specific values in the connection string instead of using the verbatim connection string text from the web.config (which obviously wouldn't work since it contains placeholders instead of actual database name and login info).

    What was happening was, in the WCF service environment, each service call that was instantiating the SqlDatabase using the constructor was having some domain separation from each other call that was seemingly causing security issues when trying to grow the pool - it could use the pool from the first one instantiated for that connection string, but as soon as the pool tried to grow - it couldn't and hence connections started failing.

    For some reason running in a WPF environment this doesn't occur, presumably each call to the constructor is inside the same appdomain. I just thought it was odd that in the WCF service environment it could use the pool at all -- I would have expected a new pool be created each time before I would have expected what actually happened.

    I was able to validate this by changing the min pool size and sending a known number of concurrent requests at once that would exceed this pool size - and sure enough, every time I could run as many queries as the min pool size before it bombed, but only with the dynamic conn strings and only in the WCF environment.

    So my solution was to create in the WCF service layer itself a pool of Database objects. When a request came in for a particular customer database, I would look in my custom pool to see if I'd already created one and pass its instance along to the class that would perform the actual operations instead of just passing the connection information and letting it manage its own Database object. This cured the problem, and now the connection pools grow past the minimum pool size when necessary.

    Now, if you are using static connection strings and/or not using the Enterprise Libraries this isn't likely to be your problem. The DatabaseFactory.CreateDatabase() has no problem managing the pool even in the WCF service environment and its pool grows just fine.

    --
    Anye Mercy
    "Service Unavailable is not an Error" -- John, ENOM support
    "You keep using that word. I do not think it means what you think it means." -- Inigo Montoya in "Princess Bride"
    "Civilization exists by geologic consent, subject to change without notice." -- Will Durant

  • Wow. that's a new one on me. Thanks for the update.

Viewing 9 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply