SQL Server 2005 Cluster - [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

  • Hi all,

    Am hoping that you can help me with a problem that seems to be quite popular on the web but has very few definitive fixes.

    Server/SQL Background:

    2 nodes with Windows Server 2003 Enterprise Edition SP1 and 16GB RAM.  Memory is fixed at 14GB, 2Gb reserved for the OS.  Default instance.

    SQL Server 2005 Standard Edition (Patch Level 9.0.2153).

    Active/Passive Cluster configuration.

    Application Log Errors:

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    [sqsrvres] OnlineThread: QP is not online.

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]Communication link failure

    [sqsrvres] printODBCError: sqlstate = 08S01; native error = 40; message = [Microsoft][SQL Native Client]TCP Provider: The specified network name is no longer available.

    Cluster Log Errors:

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]TCP Provider: An existing connection was forcibly closed by the remote host.

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 2746; message = [Microsoft][SQL Native Client]Communication link failure

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] OnlineThread: QP is not online.

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    00000f00.00001e08::2007/09/15-11:45:08.874 ERR  SQL Server <SQL Server>: [sqsrvres] printODBCError: sqlstate = 08S01; native error = 0; message = [Microsoft][SQL Native Client]Communication link failure

    Have tried a few things that have been suggested. The ones that spring to mind are:

    - Ensured that the cluster service account has sufficient SQL access to run 'SELECT @@Servername'

    - Fixed memory size

    - Disabled SynAttackProtect in the registry

    - Upgraded the HP iLO network card drivers

    - Have followed the MS guidelines for cluster configuration

    Any further suggestion would be great and much appreciated.

    The error occurs completely randomly, around 4-5 times in any 24 hour period.  Also, failover does not occur, the cluster stays up and running, it just disappears (from a SQL point of view) off the network for a couple of seconds....

    Cheers,

    Dave.

  • Hi Grasshopper

    Did you ever get an answer to this, as I'm facing the exact same problem.

    Thanks in advance!

  • Hi,

    Nope, never did! I've done just about everything that I can do - except for opening a call with MS or rebuilding the cluster. I will be opening a all with MS this week though and will report anything back here.

    Cheers,

    Dave.

  • I am experiencing some issues with SQL Server 2005 installation.

    Task Manager on the active node of the server shows Total Physical Memory :16776172 and Available Physical Memory : 15915488. Does this mean that SQL is not utlizing any memory? Also, SQL server agent log has events like "8 processor(s) and 4096 MB RAM detected.

    Any help is very much appreciated.

    Prakash B

  • I have exactly the same issure on my cluster, any positive comments now?

  • Disable TCP Chimney, TCPA and RSS

    HKLM-->System-->CurrentControlSet-->Services--TCPIP-->Parameters

    Change the following entries to 0 and reboot

    EnableTCPChimney

    EnableTCPA

    EnableRSS

    Alternatively, set the KeepAlive setting via SQL Configuration Manager to a higher setting than the application side connect pool timeout setting.

  • so, guys do you have answer to this issue?

  • So after talking with MS and EMC for over two weeks, we've gotton down to what our problem was\is.

    When we upgrade our machine from 6gb to 36gb of memory, this allowed us to keep all our tables in memory, but the side effect was that now when running our maintance job, the in-memory tables now are flooding the disks with Write requests and it's overloading our disks with the number of iops per sec, which once the (San 4gb write cache) disk cache fills up on the san, then io comes to a screeching hault on the machine as it has to start retrying and it has to deal with the large amount of things waiting in queue on the server. This causes our node to become less than fully responsive as it pegs the first cpu (Which ironically in a defult setup is the only cpu that handles network requests). So for us the "Fix" was a few things"

    1. We turned off the write cache on the lun that the db was attached to which in turn made the disks slower so that the host did not throw all the thousands of write requests onto the lun and it throttled better.

    2. We also appplied a registery change as requested by MS that allows the network load to be assigned to any open processor rather than rely on one proc (I'll paste the article number and some more info below). This change by it's self may be one of the root causes of your cluster "Network" errors.

    3. We also tuned the query to reduce the io load as well.

    4. We change the lun from a raid 5 to a raid 10 (With a few extra disks for io performance) and turned back on the write cache on the san.

    I'd check to see if your getting any io requests taking longer than 15 seconds as well, as this also points to a disk issue as well. As a test lowering our sql server memory settings back down to 6gb elimited the issue as well for us, as the disks had to read which slowed down the write issue.

    So while we really improved performance for our users by adding in the additional ram, we didn't take into account the additional load that was now being able to be placed on the disks.

    Fun stuff.

    -Greg

    Ms info:

    The processor load is not distributed across multiple processors on a computer that is running Windows Server 2003, Windows 2000 Server, or Windows NT 4.0

    System error 64 has occurred. The specified network name is no longer available.

    http://support.microsoft.com/kb/892100

    Procs HEX BIN

    2 0x3 0b11

    3 0x7 0b111

    4 0xF 0b1111

    8 0XFF 0b11111111

    Also, some additional notes from our MS case:

    Apply MANDATORY Microsoft Hotfix 946448, required for all STORport driver installations (for Windows 2003 SP1/2) - http://support.microsoft.com/kb/946448

  • I called Microsoft and was on phone with them for 6 hours yesterday. They were not able to give me an explanantion. They turned around and said that, this is the way, SQL 2005 is designed to work.

    I still do not agree with Microsoft because, one of my friend who works for another IT firm do have SQL 2005 and he says that SQL 2005 should allocate all the available memory. If maximum 14 gigs of memory is configured on SQL SERVER, SQL should utlize every bit of it.

    If any one has corrected the issue, please let me know, i shall be eager to know the solution.

    Prakash B

  • Sql 2005 will not immediatly allocate all the memory, It will just grab what is configured as it's min and grow from there. But once it's grown it wont release it unless called for by the os. Are you also using the "lock pages in memory" option? Definatly leave at least 2gb for the os outside of Sql though.

    http://blogs.technet.com/askperf/archive/2008/03/25/lock-pages-in-memory-do-you-really-need-it.aspx

  • yes, the service account that sql server services run on does have lock memory pages rights...

    Prakash B

  • This is a tough problem I remember one of this should fix it. Change the network card speed from Auto to the speed of switch specifically and also make sure network card driver is updated one.

  • We too have the following error on our win 2k3 X64 cluster system using SQL2005 . Here is the cluster log leading to cluster failure as pointed out by earlier posters on this thread .

    000008d0.00000610::2008/06/19-08:56:49.653 ERR SQL Server : [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

    We plan to change following para in registry and eliminate the write cache on storage .

    HKLM-->System-->CurrentControlSet-->Services--TCPIP-->Parameters

    Change the following entries to 0 and reboot

    EnableTCPChimney

    EnableTCPA

    EnableRSS

    Will this solve the problem.

    We have also increased the isalive and lookslive values from 5000 and 60000 to 6500 and 80000.

    Does this looks ok .

    We use VNC , at time RDP and performance logs monitoring . Further SQL memory is et to utilize max physical memory (physical memory - 8GB) .

  • We have resolved this issue on a couple of Clusters by setting the following in Registry on the nodes(After having updated service packs, driver, firmware and disabling TCP/IP offload)

    1. Set TcpMaxDataRetransmissions to 30 (decimal);

    http://technet2.microsoft.com/WindowsServer/en/library/7dac9001-3e55-4e9c-b0fa-52841ece2fdd1033.mspx

    2. Set KeepAliveInterval to 25000 (decimal).

    http://technet2.microsoft.com/WindowsServer/en/library/734570a2-06d6-450e-b765-ccfa7530af491033.mspx

    It worked for us you mileage may.

  • Hi Thomas,

    Many thanks for that. I'll certainly give it a try.

    Cheers,

    Dave.

Viewing 15 posts - 1 through 15 (of 32 total)

You must be logged in to reply to this topic. Login to reply