Serious intermittent performance issues

  • Hi,

    We are getting intermittent performance issues on our SQL server that happen once or twice a month. When it happens the whole sql server becomes unresponsive and I have to use the dedicated administrator connection to connect. The last two times that it has happened I have looked in sys.dm_exec_requests and could see that there is one stored procedure in particular that has about 70 entries. Out of the last 2 times it happened, the first time we just failed over to another node so I didn't really have time to look into the issue however the second time (yesterday) I made sure that we gathered some info before failing over.

    I found that again there were about 70 lines in sys.dm_exec_requests for the stored procedure in question. I thought that I would comment out the SP and sure enough the requests for this SP started to dissapear, however this did not fix the problem overall. We still had about 70 lines in sys.dm_exec_requests for various SPs.

    In the wait resource for each of these many have command EXECUTE, wait_resource OBJECT: 5:1417080505:0 [COMPILE], wait_type LCK_M_X and sql_handle and plan_hanlde of 0x000000000000000000000000000000000000000000000000. These ones have the highest wait times of all the requests.

    Many other lines in here have wait_type LCK_M_SCH_M and wait_resource of either METADATA: database_id = 5 SECURITY_CACHE($hash = 0x2532319d:0x1) or METADATA: database_id = 5 METADATA_CACHE($hash = 0x4ae22766:0x0).

    As a side note we do have one SP that calles OPEN SYMMETRIC KEY and CLOSE SYMMETRIC KEY in the SP. This stored procedure was running at the time the issue was happening. When this was running there were a few exec requests as follows: OBJECT: 9:366624349:0 [COMPILE]. Using SQL Load generator on our dedicated Dev instance and loading it up with 20 concurrent requests running this SP I can get it to cause the OBJECT: 9:366624349:0 [COMPILE] wait_resource.

    There are no jobs running that would be modifying schema however we do have a job running every 5 minutes that does a bulk insert of data for each customer. Maybe this contributed the the LCK_M_SCH_M locks??

    Does anybody have any idea as to what could be the issue here as any help would be really great.

    I have attached an excel dump of the exec requests if that helps.

    I can provide more info if I have not been clear enough.

    Our setup is 32bit, dual core xeon 3ghz, 6gb ram (AWE not enabled) if this helps

  • Why is AWE not enabled?

    Is /3GB switch on?

    What is the definition of the procedure with objectID 1417080505 in database 5? Do you know why it's compiling frequently? Same for procedure 366624349 in database 9.

    Is moving to x64 planned?

    What does your compiles/sec perfmon counter look like normally? What does it look like during the intermittent problems? Same for recompiles/sec

    Can you try and get the T-SQL associated with those Sch-M locks?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • We are moving to 64 bit yes I have been pushing for this. Basically this server is not up to the job and I think that they have realized this now.

    In regards to compilations per second I have just checked and it is currently running at 726 batch requests per second with 32 compilations per second so this does not seem like overkill to me. When the server goes through one of these episodes I do not know what the compilations per second is exactly I'm afraid.

    I am not really comfortable posting the stored procedure definitions onto the web so are there any suggestions that you could provide me with in regards to the compilations happening?

    What I can say is that the 2 object Ids you asked for do not query the same data in any way shape or form. The same goes for the queries that have schema lock waits.

  • Follow up question what do these wait_resources mean?

    METADATA: database_id = 5 SECURITY_CACHE($hash = 0x2532319d:0x1)

    METADATA: database_id = 9 METADATA_CACHE($hash = 0x3317e3d:0x0)

    and also could the compilations cause schema locks in any way at all?

  • william.rees.howells (12/20/2013)


    In regards to compilations per second I have just checked and it is currently running at 726 batch requests per second with 32 compilations per second so this does not seem like overkill to me. When the server goes through one of these episodes I do not know what the compilations per second is exactly I'm afraid.

    Recompiles/sec?

    Can you monitor and see if that value is representative, high or low? Monitor for a couple of days, then you'll have a base line to compare to when another problem occurs

    I am not really comfortable posting the stored procedure definitions onto the web so are there any suggestions that you could provide me with in regards to the compilations happening?

    Not without seeing them. There are a bunch of things that cause recompiles (described in Books Online), but without seeing the procedure, it's going to be very hard to say anything useful.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • william.rees.howells (12/20/2013)


    Follow up question what do these wait_resources mean?

    METADATA: database_id = 5 SECURITY_CACHE($hash = 0x2532319d:0x1)

    METADATA: database_id = 9 METADATA_CACHE($hash = 0x3317e3d:0x0)

    Offhand, not sure.

    and also could the compilations cause schema locks in any way at all?

    Schema stability, yes. Shouldn't cause any sch-m locks

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • So in that case would you say that the schema locks are being caused by something else? I.e. an alter table or something like that?

    Also can you clarify which way round this is as well. The wait type is schema lock. Does this mean that the sp is waiting to get a schema lock so that it can modify schema or is it that something else has a schema lock and it is waiting on that other request to lift the schema lock before it can continue?

  • If a process has a Sch-M wait type, then it is waiting to get an Sch-M lock, meaning it is trying to get a schema modification lock and is waiting to be granted that lock.

    What's preventing it from getting the lock, you'll have to look for. For locks you can use the blocking session id column in sys.dm_exec_requests to see what session holds locks preventing it from getting the lock it wants.

    You may find this useful: http://www.red-gate.com/community/books/accidental-dba

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Ok so that would indicate that those stored procedures with schema lock waits are trying to modify the schema in some way? This sounds strange because the SPs that are waiting for those locks do not make any schema modifications at all. They are just simple select statements. Does this sound right to you?

    I would like to add one more thing to all of this. At the beginning when I said that each time this happens there is one particular stored procedure in here that is listed about 70 times? Well, if I use my load testing tool and load it up with about 20-30 concurrent queries running this SP then I can reliably recreate a load of RESOURCE_SEMAPHORE waits. These wait types indicate that this SP is chewing up the memory. If the memory is being chewed up then am I right in saying that SQL Server will start ditching cached plans from the plan cache as it starts running out of memory. If this is correct then would this in turn cause all of these Compilations?

    Just a thought?

  • william.rees.howells (12/20/2013)


    Ok so that would indicate that those stored procedures with schema lock waits are trying to modify the schema in some way? This sounds strange because the SPs that are waiting for those locks do not make any schema modifications at all. They are just simple select statements. Does this sound right to you?

    Doesn't sound right, but then the wait resources weren't tables, so could be something else going on. I don't know what those wait resources are.

    Well, if I use my load testing tool and load it up with about 20-30 concurrent queries running this SP then I can reliably recreate a load of RESOURCE_SEMAPHORE waits. These wait types indicate that this SP is chewing up the memory.

    Not necesssarily. Resource Semaphore is about query workspace memory, in this case probably the space the optimiser needs to compile. The waits indicate that queries want workspace memory and aren't getting it, that can be caused by having lots and lots of concurrent queries wanting memory grants

    If the memory is being chewed up then am I right in saying that SQL Server will start ditching cached plans from the plan cache as it starts running out of memory. If this is correct then would this in turn cause all of these Compilations?

    Maybe, maybe not. Workspace memory and plan cache aren't the same thing.

    You never answered my question about why AWE wasn't enabled and if /3GB was on. If it isn't on, then SQL is limited to 2GB of memory, which may well be a large portion of the problem.

    Have a read through the troubleshooting book, especially chapter 1. That has some counters that you may want to start monitoring so that you have an idea what's normal so that when the problem happens again you can see immediately what's not normal. I'd add some SQL memory-related counters, especially around data cache, plan cache and memory grants to Jonathan's list in this case.

    I'd also suggest next time it happens, get a lot more info on memory, CPU, waits before failing over. I think we're going to need a lot more info.

    And push for that 64-bit move. Hopefully to a server with a lot more memory.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (12/20/2013)


    william.rees.howells (12/20/2013)


    Well, if I use my load testing tool and load it up with about 20-30 concurrent queries running this SP then I can reliably recreate a load of RESOURCE_SEMAPHORE waits. These wait types indicate that this SP is chewing up the memory.

    Not necesssarily. Resource Semaphore is about query workspace memory, in this case probably the space the optimizer needs to compile. The waits indicate that queries want workspace memory and aren't getting it, that can be caused by having lots and lots of concurrent queries wanting memory grants

    One correction here, RESOURCE_SEMPHORE is execution workspace memory only, associated with hash, sort, bulk copy, or index creation operations. Compile memory is a completely separate resource and would trigger RESOURCE_SEMAPHORE_QUERY_COMPILE waits if compile memory resulted in pressure on one of the three internal gateways for compile memory. For a 32-bit instance the small gateway threshold is fixed at 250,000 bytes, and can support four times the number of CPUs for concurrent compilations before a new compile would have to wait. The medium and big gateways have dynamic thresholds that adjust based on current compile memory usage by the workload.

    If the memory is being chewed up then am I right in saying that SQL Server will start ditching cached plans from the plan cache as it starts running out of memory. If this is correct then would this in turn cause all of these Compilations?

    Maybe, maybe not. Workspace memory and plan cache aren't the same thing.

    Gail is correct that workspace memory and plan cache aren't the same thing, but excessive workspace memory grants occurring concurrently would reserve pages from the buffer pool and could result in internal cache store pressure that clears plans during external clock sweeps of the cache entries. You would see this in the performance counters for the instance by a drop in Free Pages, associated with an increase in Reserved Pages/Granted Workspace Memory and then shortly after, a drop in Database Pages (data cache) and Stolen Pages (plan cache/other stores). You can track the plan cache activity with the SP:CacheHit, SP:CacheInsert, SP:CacheMiss, SP:CacheRemove, and SP:Recompile events, and the SP:CacheRemove event would let you know why it was being removed for further troubleshooting.

    I've seen this behavior in other workloads where triggers were present, so one question I'd have is are there triggers associated with either of the object_id's? If you have a plan invalidation for an object, it would compile the trigger and if the procedure is long running, the OBJECT compile lock becomes the bottleneck.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • Jonathan Kehayias (12/20/2013)


    One correction here, RESOURCE_SEMPHORE is execution workspace memory only, associated with hash, sort, bulk copy, or index creation operations. Compile memory is a completely separate resource and would trigger RESOURCE_SEMAPHORE_QUERY_COMPILE waits if compile memory resulted in pressure on one of the three internal gateways for compile memory.

    Ack! I keep forgetting that. Only time I've faced those waits was on a SQL 2000 box, and the query compile wait was added in 2005. 🙁

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi thanks for the response.

    I have managed to replicate the compiles on our Dev box.

    I set the max memory to 2GB the same as our production server (GilaMonster said that this is the max for 32 bit without /3GB switch and AWE enabled). I then ran 50 concurrent queries from my machine with the load testing tool. These queries were the query that is causing the RESOURCE_SEMAPHORE waits. I then loaded up the load testing tool on the server and started running about 30 concurrent queries. These were the additional queries that were running in the excel spread sheet attached to this discussion.

    Once this was all running I looked at sys.dm_exec_requests and low and behold the additional queries that do not cause RESOURCE_SEMAPHORE waits have a mixture of the following:

    METADATA: database_id = 5 SECURITY_CACHE($hash = 0x2532319d:0x1)

    METADATA: database_id = 9 METADATA_CACHE($hash = 0x3317e3d:0x0)

    OBJECT: 5:1417080505:0 [COMPILE]

    (Ignore the IDs I just copied them from earlier in the thread).

    Each time the production server starts becoming really unresponsive there is a massive queue in sys.dm_exec_requests for this query that is causing RESOURCE_SEMAPHORE waits. From my test I can clearly show that this stored proc can cause compiles on the server. This stored procedure is basically run on every web page on every page load so this is clearly an issue.

    How come then, when this issue happened on production the other day, I commented out the SP and could clearly see the number of requests for this problem SP falling slowly to 0, however the compiles continued happening even after this SP was no longer showing in sys.dm_exec_requests.

    The question here then is this: Could it be that the server resources are so severely limited that once this problem SP starts causing compiles that the load gets to much and the server basically just falls over and cannot reach equilibrium once more?

  • Is there any way you can post that procedure? Obfuscate the table and column names, those aren't important, it's the form of the procedure (including any hints) that I'm interested in.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • CREATE procedure [dbo].[Problem_Stored_Procedure] (

    @blank uniqueidentifier,

    @blank int,

    @blank bit = null,

    @blank bit = null,

    @blank datetime = null

    )

    as

    if @blank is null

    begin

    set @Now = getdate()

    end

    SELECT COUNT(distinct imr.column1) AS CountMessage

    FROM Table1 imr

    inner join Table2 im on

    (

    im.column1 = imr.column1

    and im.column2 = @blank

    and im.column3 = @blank

    )

    WHERE imr.column2 > CONVERT(DATETIME, @blank , 103)

    and case when @blank is null then im.column4 else @blank end = im.column4

    and case when @blank is null then case when imr.column3 is null then 0 else 1 end else @blank end = case when imr.column3 is null then 0 else 1 end

    Basically the point is to get the number of messages that a user has sent out, that have been read by the recipient. You have the variables left in as hints as well.

    I am also getting this missing index hint

    CREATE NONCLUSTERED INDEX [<Name of Missing Index, sysname,>]

    ON imr (column2)

    INCLUDE (column1,column3)

    I tried adding this on production but it actually made the problem worse (this SP started queuing up like crazy in exec_requests) even though it gives a much nicer execution plan and stops the REOURCE_SEMEPHORE wait types happening.

Viewing 15 posts - 1 through 15 (of 20 total)

You must be logged in to reply to this topic. Login to reply