Server Puke, CPU Pressure, or Something else?

  • Hey everyone,

    I am having a rather challenging issue that I am hoping to get some help on.

    In the last 3 weeks, we have 3 instances where the server decided to take a puke and start stumbling along like a drunk. Slow moving, huge waits, etc. Currently, we haven't been able to figure out what the issues are. Sadly, restarting the issues resolves the issues for 5-7 days before it crops up again. What I do know is below. I just need help figuring out some different directions to look next or some really good idea to figure out what is going on.

    Issues:

    On 4/10, 4/17, and 4/22, the SQL 2005 SP4 64 instance simply starts to drag.

    -No blocking

    -Light waits on TempDB GAM/SGAM pages, but very minimal

    -60+ percent of total instance waits (once the issue starts to occur) are sos_scheduler_yeilds. Prior, it is 15-20%. During a single hour last night this reached 31k seconds of signal waits and over 7m task waiting on it.

    -Sys.dm_os_schedulers go from a normal running value of 0-1 to 3-5.

    Top 2 users of CPU per Confio come from the same TVF. Both are deletes. The TVF is heavily used, often being called 30K/hour

    Server Specs and Config:

    SQL 2005 SP4 Standard Ed 64 bit

    Server 2008 R2

    DAS Box with 2 SAS drives for OS and Backups

    32 SSD’s for the remaining drives

    128GB of ram with max server memory set to 100GB

    2 x 6 core processors for a total of 24 threads with hyper threading enabled.

    MAXDOP = 1

    TempDB, Indexes, Data, and Logs are all on separate drives

    Total of 14 TempDB files

    Replicate (transactional) 95% of all reporting to reporting system. Only 1 large processes runs on the server, but was not running when the issues began.

    From what I have read, when we start to have running_tasks from sys.dm_os_schedulers continually above 2-3, that it is a sign of CPU pressure. However, I don't see it on the CPU counters.

    Regardless, anyone got any ideas on what might be going on or what to look at next?

    Thanks,

    Fraggle

  • I think you must already have checked all these before but just wanted to ask?

    Have you checked the execution plan for long running sqls/ stored procedures? It may be good case for execution plans going bad.

    Have you tried re-compiling Sproc ?

    Are the stats are being updated?

    How TVF is being used in query/SP?

  • DevDB (4/23/2013)


    I think you must already have checked all these before but just wanted to ask?

    Have you checked the execution plan for long running sqls/ stored procedures? It may be good case for execution plans going bad.

    Have you tried re-compiling Sproc ?

    Are the stats are being updated?

    How TVF is being used in query/SP?

    Yes, we actually tried flushing the entire cache a couple of times to no effect.

    Fraggle

  • Flushing the entire cache, I think is never a good idea. But what I was more interested in to see the execution plan plan for the slow running sql when it is running fast and when it is bad. Do you have anything to compare for good vs bad plan?

    What I believe that when the query/ SP is running fast, is using parallel plan but in bad mode it is using the serial plan. Which will cause the query to run very- slow.

    Multi- statement TVF may cause the query to use serial plan.

    So if you could share some more information, that will be helpful.

  • DevDB (4/23/2013)


    Flushing the entire cache, I think is never a good idea. But what I was more interested in to see the execution plan plan for the slow running sql when it is running fast and when it is bad. Do you have anything to compare for good vs bad plan?

    What I believe that when the query/ SP is running fast, is using parallel plan but in bad mode it is using the serial plan. Which will cause the query to run very- slow.

    Multi- statement TVF may cause the query to use serial plan.

    So if you could share some more information, that will be helpful.

    I have looked at the plans between the two and they are the same. As the MAXDOP = 1 for the server, parallelism isn't really a concern. The MAXDOP has been set to 1 for 12-18 months now, so it isn't a new change to the system.

    If needed, I can get you the plans. Maybe you will see something that I am not. However, as TVF don't really show up well in execution plans, not sure how much value there will be.

    Fraggle

  • If you can show the code using the TVF and the code for the TVF would be helpful as well as the actual execution plans for both a fast and slow run of the query. As this appears to be on SQL Server 2005, please save and post the plans as .sqlplan files.

  • Yes you are right. For Optimizer TVF is like black box.

    Simple Single statement TVF can be in lined by SQL but for multi line TVF SQL Server will repeatedly

    execute the function – for every row in the result set. Which will cause the CPU to spike a lot.

    Have you tried replacing the TVF with inline SQL query logic?

    Also one of the issues with functions that they don't get good stats from optimizer and sql server has a no way to know at compile time how much data it is going to return and it works on assumption that only one row is going to return from function.

    Have you tried seeing the execution plan just for TVF for different kind of parameter to see if that will help?

Viewing 7 posts - 1 through 6 (of 6 total)

You must be logged in to reply to this topic. Login to reply