SQL Server 2005 Cluster - [sqsrvres] CheckQueryProcessorAlive: sqlexecdirect failed

  • Thomas --

    We have had the same problem in our cluster environment for awhile now... and I tried these two additional reg-pokes you suggested.

    We had been thinking it was disk issues up till point... we are going to get EMC involved sometime today.

    Anyhow, I tried the pokes and they look to have worked... I have a process that would cause the error most every time... once I put these in, it wouldn't fail. I took them back out, and things started failing again. So... I'm proceeding with cautious optimism that this works for us...

    BUT... not being very strong from a network perspective... how did you come across these reg-pokes and what exactly do they do and why would I need to be concerned about them on the cluster but not on our other non-clustered SQL servers? Just curious if you have answer..

    Thanks!

    Mike Metcalf

  • In my Organization we had this issue with a number of SQL2005 64 Clusters.

    I went through every posting on the net.

    We updated NIC drivers, disabled TCP/IP offload. We checked Storage adapters and SAN provider bugs. It was getting to the point where we were considering some rather drastic measures regards RAID layout when on the 5th or 6th call to Microsoft PSS a tech mentioned these settings as a possible resolve.

    As we were getting to the point where we might have to ask the client to blow away their LUNS to test a theory, we thought we would give the registry edits a try. They worked and continue to work. I got to thinking about all the hours of trolling through online postings I did and decided to post the info so someone else’s life could be more fulfilling.

    Hope it works out for you. On the upside I now know more about SQL architecture than I ever really wanted to know (I am an OS guy primarily)

    BTW what process would you use to generate the error?

  • Funny thing is that in crept into our main cluster about a few months ago... but at that time, it wasn't impacting anyone... but It's gotten progressively worse... We have gone through the same exact motions you did... except considering rebuilding the LUNS. 🙂

    We have sent off this info to Microsoft (we have an ongoing open case about this) to get some more clarification as to why this would work... and we're still going to work with our SAN vendor 'just to be sure'... but my team and I were talking, and to us it is PLAUSIBLE that maybe it really is a network/disconnect thing... and the reason for the spike in disk activity is secondary... i.e. a disconnect forces SQL Server to commit or rollback... potentially momentarily causing a disk spike.. (a thought anyway).

    How we replicated the error... hmmm.. well.. we are getting a new/upgraded system in here, and last week, they went through a mock conversion. One part of the process runs a stored procedure that creates/runs dynamic SQL against the server.... so, it was running against a node in the cluster... and get kept failling for most of the day... until it finally ran clean -- it took about 2 hours. I took the database to another newly-made node, re-did my test, and it failed every time. I applied the poke, and it ran cleanly off the bat. I removed the poke, and the proc failed... then SQL profiler started getting disconnects, and even another query window I was using to query the status of the running proc. Put the poke back in, and the node was happy. I'm not sure if I could tell you how to make a generalized script/test out of it... if you're interested I'll see what I can do.

    -- Mike

  • Hi Mike,

    ....'We have sent off this info to Microsoft (we have an ongoing open case about this)....'

    Could I ask that you post whatever MS comes back with? I'd be very interested to know what they say. 🙂

    Many thanks,

    Dave.

  • Thomas / Mike ,

    Thanks for ur all valuable contributions . Here are my 2 cents to the discussions . We too had similar problems . What I noticed our SQL server instance was configured to use maximum memory virtually unlimited . I set the same to little lesser than OS memory (7 GB for 8 GB ), And now it works fine without any errors.

    I'll also try the reg setting suggested herein .

  • Everyone --

    We are still working through the issue with Microsoft and EMC. The most current is that we have been to single out a node in our cluster to be our guinea pig. As of this morning...

    - We have found out that we had outdated HBA drivers for our HBA to our EMC SAN. There is also a mandatory (says EMC) patch from Microsoft required for the HBA. It should be in place now too.

    - We have also put in some Server Service/LanMan reg pokes microsoft suggested.

    One of the new things they found was related to this:

    "The errors we’ve been getting indicate that the Server service is unable to keep up with the demand for network work items that are queued by the network layer of the input/output (IO) stream.

    There are many causes some of which only cause brief logging of error conditions (but may not cause failover) and these may be addressed by tuning the server service.

    Disk subsystem not being able to keep up is the most common cause of the accumulation of work items in the server service. "

    He then went on to suggest this reg poke:

    To increase the capacity of the server service to handle incoming IO request please set the following registry settings (Hexadecimal values) using regedit.exe:

    HKLM\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters

    "MaxFreeConnections"=dword:00000064

    "MinFreeConnections"=dword:00000020

    "MaxRawWorkItems"=dword:00000200

    "MaxWorkItems"=dword:00002000

    --- We'll see where this takes us... I am also collecting additional information for him as well... will keep you posted.

    -- Mike

  • Thanks for your continued efforts on this. The registry fixes I posted appear to prevent the unneeded SQL Cluster resource failures and the attendant 19019 errors, but I have suspected the underlying issue has not been resolved.

    We have systems using HP and 3PAR based SANs that have been effected by this issue (We may have some systems attached to EMC SANS that are affected but I am unaware of any at this time). With the application of the fix recommended by Microsoft tech the Cluster resource failures have stopped and therefore the Clusters are no longer on the front burner, as it were.

    I am continuing to see evidence of Disk subsystem issues (VSS & VDS errors which are interrupting backups)

    Systems have been checked against the respective SAN configuration Matrixes for HBA drivers/firmware, MPIO etc etc

    We have been applying http://support.microsoft.com/kb/943295 against some of the effected systems and the Jury is still out, but I have the feeling that we are not out of the woods regards this issue.

    I will be paying close attention to this thread.

  • Everyone --

    Still not out of the woods yet. This is becoming a long (and painful) process. I've tried everything Microsoft has suggested and still nothing.

    As for Hotfixes, Thomas mentioned that he applied 943295. EMC Told us to put in Hotfix 943545... which I'm assuming is a newer fix to the one Thomas put in ?

    Today thought that possibly the Communication link Failures I'm getting *may* be fixed by CU10.... yes, they're up to CU10 for SP2. But I think I disproved that this afternoon.

    I re-ran the process on another SQL2005-x64 machine that is SAN connected but not in a cluster... ran clean as a whistle.

    I'll keep you posted.

    -- Mike

  • Everyone --

    Wanted to check back in... we think we got it. It'll take a few days to pull everything together what we did and was microsoft recommended... but the tweak that has seemed to nail for us was that we made (in hindsight not so smart) mistake of SQL Server Priority Boost checked on the nodes in our cluster.

    If any of you following this have done the same... uncheck it as soon as reasonbly possible. Again, in hindsight, If google around long enough you'll hit the articles that say you shouldn't haven't have this checked in a cluster... but it only says it could cause networking problems, but no details or specific messages...

    In our case, we got to a point to where we stripped a node down to it's bare bones... we uninstalled EVERYTHING on the node that wasn't critical and (with boost on) ran a test where I would run a procedure that reliably causes the 19019 events, and profiler. SQL Profiler would see that the cluster service would get periodically dropped as a connection from SQL Server. THIS DROP is what was generating the 19019 errors! On a hunch, one of DBA's thought that *maybe* this priority boost thing might be choking other processes on the node... the cluster service being one of them... to give way to the higher-priority SQL Server.

    Sure enough, we switched off this setting... not a single doggone 19019 since.

    Like I said, I will try to get back to you folks within a few days, maybe a week, to compile everything we tried and all of Microsoft's recommendations based on our particular environment. In short, in no particular, things Microsoft sited in our environment:

    1. Spikes in disk activity on our SAN (got better after applying latest drivers/hotfixes)

    2. They claimed that our NIC cards in our nodes were teamed, which is a cluster no no (our cards were not teamed, period.)

    3. They tried to reference an obscure match with Quest's SQL Litespeed causing the problem when using native command substitution... not buying this one... we've has litespeed for years and have always used teh xp_ procs... not command substitution

    4. The suggested that we look at / play with our MAXDOP options... current set to 0 on each of our nodes. (this was suggested after we mentioned to them about us stumbling upon the priority boost thing).

    Take it easy -- Mike

  • Hi

    I have a newly build sql server cluster and getting these errors with no application volume at all.

  • Hi my name is Fabio Pereira, i'm brazilian, please you solved this problem, with the modification?

    Thank.

  • Prakash.Bhojegowda (5/7/2008)


    I called Microsoft and was on phone with them for 6 hours yesterday. They were not able to give me an explanantion. They turned around and said that, this is the way, SQL 2005 is designed to work.

    I still do not agree with Microsoft because, one of my friend who works for another IT firm do have SQL 2005 and he says that SQL 2005 should allocate all the available memory. If maximum 14 gigs of memory is configured on SQL SERVER, SQL should utlize every bit of it.

    If any one has corrected the issue, please let me know, i shall be eager to know the solution.

    Your friend is correct within certain contexts.

    If you are using a 32bit version of SQL2005 then it will not use all the memory should PAE and AWE not be enabled.

    On a 64bit system it will use all of the memory, however will not take all of that memory immediately, it will start off small and then ramp up as memory is needed, up to the maximum that it is allocated.



    Shamless self promotion - read my blog http://sirsql.net

  • Still looking for some answers on this? anyone figure it out?

  • Any one have a solution for this issue.

    Please reply..

    Thanks in advance

  • Hi

    We had a problem almost the same - Looked at CPU affinity, network cards etc.

    In the end I found it was because Priority Boost was enabled on the installation (it was already there and the server failing before I arrived.)

    Once I set this to 0 the mysterious reboots ended - along with the event viewer errors which used to happen 2 or 3 times a day.

    Priority Boost changes require a service restart before they take affect.

    Hope this help some of you

    Seth

    (actually just noticed someone else has said the same thing a few posts earlier - I'll leave this for those like me who get forum thread blindness)

Viewing 15 posts - 16 through 30 (of 32 total)

You must be logged in to reply to this topic. Login to reply