Clussvc.exe And Lsass.exe Use 100% CPU

  • We have a Windows 2000 Advanced Server (SP2) 2-node cluster that is running as a SQL Server 2000 failover cluster (on a Compaq Proliant CL380). The system has been working quite steadily now for 7 or 8 months, but every once in a while we run into a situation in which the CPU jumps to 100% with the Clussvc.exe using between 60 and 70% of the CPU and LSASS.exe using the other 30 to 40% of the CPU. It seems to stay like this for 2 or 3 minutes, and then it will settle back down to normal. This happened to us yesterday at a fairly low-traffic time of the day (only 5 or 6 users accessing the database at the time) with no batch processes running. We have checked our event logs, and have not found anything there that would indicate what is causing this condition.

    At this point, I am basically stumped. If anyone has had a similar experience and might be able to help, I would greatly appreciate it.

    Thanks,

    Jeremy Antonini

  • Nope. We have autogrow set up on both of the databases residing on that server, but not autoshrink. The weird thing is that there was just not much going on when that happened.

    Thanks,

    Jeremy

  • Clussvc.exe is the Cluster Service executable that handles communications between cluster nodes, failover, etc. Lsass.exe is the Local Security Authentication Server which is responsible for validating user access privileges on the server. Knowing that, I tend to suspect that it has something to do with determining permissions for accessing the server. Maybe even having trouble contacting our AD domain controller to get user information. I am just not sure.

    Jeremy

  • There is a private network between the two nodes, which is simply a crossover cable connecting the NIC on the first node to the NIC on the second. There is also a public/mixed network connection that is the public connection but can also be used for the "heartbeat" communications in case of a failure of the private network. So far, we have not had any trouble with the nodes not being able to communicate with one another via the private network, and the public network portion has generally performed well (except when we had an extended power outage that drained the UPS on the router, but that's another problem all together).

    Jeremy

  • Thanks for taking the time to try.

  • First check Event Viewer if you have not already done and see if anything may be reporting issues. Also check the SQL Server logs for similar issues. Finally, if this occurrs frequently and you can catch it for sure use Profiler with CPU column data on and run against all items such as SQL:Stmt Starting , errors, and others that may give you an idea of why this occurrs.

    "Don't roll your eyes at me. I will tape them in place." (Teacher on Boston Public)

  • 1st question: Was anyone using cluadmin at the time looking at the cluster? Perhaps there was someone doing something they shouldn't have been?

    This one is an even greater stab in the dark because it wouldn't explain lsass.exe spiking.

    2nd question: are you caught up on your softpaqs? Especially for the backend? We saw very odd behaviour both on our shared storage array and on our SAN cluster until we made the sure firmware was up-to-date. We were even getting unexplained system failures to the point where we had Microsoft and Compaq on-site to troubleshoot. Several very strange OS occurrences disappeared after the updates.

    K. Brian Kelley

    http://www.truthsolutions.com/

    Author: Start to Finish Guide to SQL Server Performance Monitoring

    http://www.netimpress.com/shop/product.asp?ProductID=NI-SQL1

    K. Brian Kelley
    @kbriankelley

  • We went through the Event Log and the SQL Server logs, and we did not find any problems that seemed to be related. However, I spoke with a MS Product Support Specialist (actually a couple of them) and they thought that recurring instances of the "Browser" service giving error messages along the lines of "The browser service has failed to retrieve the backup list too many times..." might be an indication of an incorrect configuration of the Private network. As it turned out, someone had enabled NetBios on the Private Network and set a DNS server that did not exist on the network (since only the two nodes were on that network). We changed those settings, but there was not much other advice that they gave us. Whether or not that actually solves the problem, I don't know, since this seems to happen quite sporadically, but generally only every couple of weeks (which makes it difficult to try to catch it when it happens).

    In terms of Cluster Administrator, I am pretty sure that no one was accessing the server with Clusert Admin since the machine is fairly well locked down and the three people who have access to it were all together at the time it happened. On the SoftPaqs, I believe that everything is current as of January or February of this year. I will check into that to see if any more recent ones have been released that we should install.

    Thanks Mr. Kelley and Antares for your advice on the situation.

    Jeremy Antonini

Viewing 8 posts - 1 through 8 (of 8 total)

You must be logged in to reply to this topic. Login to reply