Serious intermittent performance issues

  • I would recommend that you start off by doing a comprehensive read of the following two links:

    http://www.sqlskills.com/blogs/kimberly/stored-procedure-parameters-giving-you-grief-in-a-multi-purpose-procedure/[/url]

    http://www.sommarskog.se/dynamic_sql.html

    Your nullable parameters are most likely a big part of the problem and this is a pattern that is routinely problemattic in the ways that you describe. I'd rebuild that procedure using the parameterized dynamic SQL execution by dynamically building the string to execute with parameterization and then calling it with sp_executesql, as shown in both of the above articles.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • Jonathan beat me to it (I started rewriting the procedure into dynamic SQL and got busy with other things). Here's another blog post on the subject http://sqlinthewild.co.za/index.php/2009/03/19/catch-all-queries/

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi thanks a lot for this answer. You two have been very helpful. I will definitely be rewriting this query in the suggested manner. Again thanks

  • It may be worth looking through the codebase for other similar procedures. I've typically found that if this kind of pattern is used in one place, it's used in a lot others too.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (12/23/2013)


    It may be worth looking through the codebase for other similar procedures. I've typically found that if this kind of pattern is used in one place, it's used in a lot others too.

    +1, usually when this type of code exist in one place, it exists elsewhere as well, and it is worth fixing everywhere in one major push than combating performance issues one by one over time.

    Jonathan Kehayias | Principal Consultant | MCM: SQL Server 2008
    My Blog | Twitter | MVP Profile
    Training | Consulting | Become a SQLskills Insider
    Troubleshooting SQL Server: A Guide for Accidental DBAs[/url]

  • Hi mate I have done a bit of further investigation into this and was wondering if you could help me out further.

    I set perfmon running yesterday morning with every counter under the sun collecting as much data as possible. I didn't have to wait long for this issue to happen again so now I have a lot more information. Basically looking at the results, I don't think that the compilations we were talking about are the cause of the issue, I think they are a result of it. I did notice however, that yet again exec requests was filled up with that same problem SP again though. I have actually done as you have said and split that multi-purpose SP up into seperate SPs however the changes have not yet been deployed so still waiting to see whether that fixes this issue altogether.

    Anyway onto the data: If you want I can send you the .blg perfmon file to inspect but if not then here is a general overview of the data that I collected.

    Looking at the data I can see that this isn't a gradual thing, it looks like it comes on hard and fast. Here is what happens

    Average temp tables: Drop to nearly 0 from 50

    Temp Tables For Destruction: Skyrocket

    Compilations/sec: Jump from average of about 9/s to about 70/sec

    Active Parallel threads: Skyrocket off the chart briefly and then go to 0. Previous to the crash they were averaging about 10

    Average Latch Wait Time (ms): Averaging about 7 and then shoots up to about 40-50

    Read and Write Disk Queue Length: Lots of activity until the crash. They then go down to basically 0.

    Batch Requests per second: Stable until the point of the crash and then a sharp blip up and then drop drastically.

    Blocked Tasks: Huge Spikes off the chart after the crash.

    Bulk Copy Throughput/sec: Frequent spikes before the crash for a couple of hours and then one single spike right at the point of the crash. Another shortly after the crash.

    Cache Hit Ratio: Drops by about 40%

    Connection Resets/Sec: Sharp blip up before crash and then drop off to very low levels:

    Extended procedures: A few tiny blips 15 mins before the crash. Huge spikes after the crash

    Failed Auto-Params/sec: Few blips and spikes for about an hour before the crash. Big spikes after the crash.

    Free-List Stalls/sec: Big spikes for about an hour before the crash. Big spike on crash and then 0

    Latch Waits/sec: Steadily increase until the crash and then drop right off to nearly 0

    Lazy Writes/Sec: Huge spikes consistently for about an hour before the crash and then 0

    Lock Timeouts/Sec: Quite big spikes before the crash. Drop right off to 0 during the crash

    Lock Wait Time(ms): Starts going haywire around about the time of the crash, during and after the crash.

    Log Flush Wait Time: Quite big spikes for about an hour before the crash. 40 mins before the crash wait time increases significantly (80%) for about 20 mins and then drops back to normal levels. Spikes back up again at about the time of the crash.

    Memory Grants/Sec: Stable until the crash and then drop right off

    Memory Grants Pending: Around about 0 until the crash and then increase significantly

    Non page latch waits: Consistently spiky until the crash and then drop down to 0

    Page IO Latch Waits: Consistent medium sized spikes until the crash and then increases significantly

    Page Lookups/Sec: Stable until crash, small spike during crash, gradual reduction after crash

    Page Reads/Sec: Consistent medium sized spikes until the crash and then drops to 0

    Page Splits/Sec: Consistent massive spikes and then drop to 0 immediately after the crash.

    Query Optimizations/Sec: Consistent mini spikes leading up to crash and then huge surge during/after crash

    Readahead pages/sec: Consistent big spikes until crash and then 0

    SQL Attention Rate: Basically non existant until the crash and then increases quite significantly

    SQL Compilations/Sec: consistent medium spikes until crash and then an absolutely massive surge in the number

    SQL ReCompilations/Sec: A Few big spikes 30 mins before the crash. More big spikes about 10 mins before the crash. One spike during the crash

    Stolen Pages: Very Stable until the crash and then drops off to nearly 0

    Table Lock Escaltions/Sec: Very consistent mini spikes until about 10 mins before the crash. Right before the crash this increases significantly and remains high after the crash.

    Temp table creation rate: Very low leading up to crash and then increases significantly after the crash

    Temp Tables for destruction: 0 right up until the crash and then huge spikes during and after the crash

    Total App Domains Unloaded: Very stable until crash and then increases during and after the crash significantly

    Total Latch Wait Time(ms): Medium sized consistent spikes for an hour before the crash. Big spike during crash. 0 After crash

    Transactions/Sec: Stable until the crash. Big spike right before the crash. Drop off to low levels immediately after the crash.

    Unsafe Auto-Params/Sec: 0 until after the crash. 3 Big Spikes after the crash.

    Usage: Korean_Wansung_Unicode: Stable level until after the crash. Increases significantly after the crash

    Usage: Table Hint Without With: Stable until shortly after the crash when it starts to increase significantly

    Usage: XP_API: Stable until right after the crash when this skyrockets off the chart

    Usage: DataTypes: text ntext or image: Low levels and stable before crash. Increases a lot shortly after crash

    Usage: Syslogins: Stable until crash. Skyrockets to top of chart after the crash.

    I want to just add to this that we have a job that runs every 10 minutes and does bulk loads of CSV files for any clients who have sent new data over to us. It does this by using a CLR procedure to check if there are any new folders on the network share and then bulk loads them using a format file if there are. I don't know how relevant this is but I have a sneaking suspicion that this might have something to do with the problems we are getting.

    I've also checked on the server for any potential scheduled tasks that could cause problems and there basically aren't any. The only thing installed on there as well is EMC Powerpath 5.3. I don't know what this is but could this potentially cause issues?

    One more thing to note is that we are running on a fail over cluster and basically it just isn't configured correctly. It sounds as though the guy who set it up originally (who has now left the company) didn't do too great a job at it and the guy that looks after it at the moment just doesn't have any experience looking after these things and doesn't want to touch it in case he breaks anything. He tried to fail it over the other day and nothing happened so that is the state that it is in. I had a quick look into it today and there are error messages such as the folowing:

    "Cluster node '****' was removed from the active failover cluster membership. The Cluster service on this node may have stopped. This could also be due to the node having lost communication with other active nodes in the failover cluster. Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapters on this node. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges."

    "The Cluster service is shutting down because quorum was lost. This could be due to the loss of network connectivity between some or all nodes in the cluster, or a failover of the witness disk.

    Run the Validate a Configuration wizard to check your network configuration. If the condition persists, check for hardware or software errors related to the network adapter. Also check for failures in any other network components to which the node is connected such as hubs, switches, or bridges."

    "Cluster service failed to update the cluster configuration data on the witness resource. Please ensure that the witness resource is online and accessible."

    "The cluster service detected a problem with the witness resource. The witness resource will be failed over to another node within the cluster in an attempt to reestablish access to cluster configuration data."

    "Cluster resource 'Cluster Disk 5' in clustered service or application 'Cluster Group' failed."

    Perhaps this has something to do with the issues that we are seeing as well. If you could add some comments in about this as well then that would be great. Hopefully this will be enough information for you to go on. I just hope I can get to the bottom of this soon!

    Again if you want I can send you the .blg perfmon file to inspect if that would help. I really look forward to your response.

    Thanks again.

Viewing 6 posts - 16 through 20 (of 20 total)

You must be logged in to reply to this topic. Login to reply