Clustered SQL Server 2005 SP2 and SQL Agent problem

  • Hi,

    We have been having this maddening problem with SQL Agent and SQL Server 2005 SP2 and we need to get some ideas on how to diagnose and fix this issue.

    First a bit about our environment:

    We just updated our cluster a few weeks ago to new hardware (Dell PowerEdge R900, 4 Quad Core CPUs, 64 Gb RAM), updated the OS to SP2 and the SQL Server instance to SP2.

    Sql Server 2005:

    Microsoft SQL Server Enterprise Edition (64-bit)

    Version 9.00.3042.00

    OS:

    Windows Server 2003 R2

    Enterprise x64 Edition

    Service Pack 2

    We are running a 2-node Active-Passive failover cluster with an EMC Clariion SAN for the shared disk.

    This problem did not start right away but it took a few days to manifest itself.

    Some of the symptoms:

    1. A couple SQL Agent jobs will hang and take forever to finish.

    Job 1: Backup the transaction logs via Red Gate SQL Backup (using the SQL Backup xp procs)

    Job 2: Select from and update several tables in another database.

    Before the cluster migration and service pack update the typical runtime of these jobs is from 5 seconds to 40 seconds.

    After the service pack update the jobs started to take from 90 minutes to 18 hours to finish when they are run concurrently.

    2. Once those two jobs are running concurrently in the 'stupid' state they cannot be stopped via Management Studio

    3. If we take the SQL Agent resource offline in the Cluster Manager it moves to the failed state and never comes back. There are a few log messages from the cluster service but these are limited to messages like: 'Cluster resource SQL Server Agent (PROD) failed to come offline.' Very informative.

    Only a reboot of the cluster node could make the SQL Agent start again successfully.

    Once during the cluster resource failure I saw the SQL Agent service stuck in the 'Stopping' state. I could not kill it!

    Not even process explorer could kill it.

    Only a reboot could fix the problem and of course, the problem would come back after a day or so.

    Not wanting to give up and rollback to the old cluster hardware we pushed this problem up to Microsoft PSS.

    Working with PSS we were not able to replicate the problem (by executing the SQL Agent job contents via Management studio).

    At this point the MS support engineer basically said that transaction log backup and the other job must be blocking each other. That they said was the only explaination for SQL Agent hanging.

    As far as I know backing up the transaction logs in a OLTP environment will not wait for some transaction in some database so finish.

    Microsoft PSS was not able to help.

    Has anyone of you ever come across a problem like this?

    Thanks,

    Andy

Viewing post 1 (of 1 total)

You must be logged in to reply to this topic. Login to reply