Dumps

  • We are on SQL 2017. There was Stack dump latch timeout and server stopped responding. After looking in to further found there was a simple update query in the dump file. Other than running checkdb to find the corruption any other thoughts? Thanks in Advance!

  • To add more details. I found this. But i am already on CU21. So it doesn't look like below link is applicable to me. The error is "stack dump latch timeout"

    https://support.microsoft.com/en-us/help/4089819/time-out-error-when-using-availability-groups-in-sql-server-2016-2017

  • I'd check in your logs to see what happened.  both windows and SQL Server logs.

    A good write up on latch timeout can be found here:

    https://mssqlwiki.com/2012/09/07/latch-timeout-and-sql-server-latch/

    It includes some steps to troubleshoot the problem too.

    I would still run checkdb to make sure nothing is corrupted.  I would also try to replicate the dump in test and if you can, then it may be easier to troubleshoot and debug.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks Brian. A query is showing in the dump file is that means that is the query is cause for the dumps to generate?

  • Side note there was spike in CPU usage before the dump got generated. To monitor proactively for future events,I was thinking to set the alert when cpu spikes in warning state instead of critical. Reading the dumps and setting the alerts would be an option too i am thinking as well. Any thoughts?

     

  • My approach - dig into the facts before digging into possible symptoms.  Spikes in CPU don't mean there is a SQL issue.  Monitoring for spikes MAY help or may not do anything.

    Like the link I sent, I would dig into the dump file as it will tell you exactly what went wrong and fix that problem.

    My approach - read the dump, read any and all logs, find the root cause, correct the root cause.  Don't treat the symptoms (CPU spikes), treat the problem (SQL crashed and created a dump file).

    Might not hurt to reach out to Microsoft for support on this too as they may need to create a patch or you may need to install a patch.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks Brian. To identify the stack dump issue quicker. Is there a way to configure alerts when this happens?

  • generally, when a stack dump happens, the process that caused the stack dump shuts down.  In this case, SQL Server service is likely no longer running after the stack dump happened.

    If you wanted to identify the issue quicker, I would monitor for an instance being offline and if any instance does go offline, you can address that immediately.

    There are other reasons that SQL Server can go offline, so monitoring for that on a regular basis is a good idea for any that you want to make sure have high up-time.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks for the response. In my case the system was up and running after the stack dumps were generated so SQL service was never went down. Only symptom was spike in CPU and query's were running longer due to this. Memory looked okay as well.

  • I would monitor the logs in that case.  Or long running queries.  If most queries complete in under 5 seconds, but I start seeing a large number of transactions being open for a minute or more, I should look into that.

    Monitoring for long running spikes in CPU may not be a bad idea, but that also may be normal behavior.  If your system frequently runs at 20% CPU or less but has these abnormal spikes up to 100% for extended periods of time during a dump, then monitoring for that may be helpful.  But it also may not.  If the dump only causes a spike on a single core (for example), you may miss the spike.  Or if the dump doesn't cause a spike.  Generally, dumps cause memory spikes as it dumps the process and all related debugging information to disk for further analysis.  But if it only needs to dump 20 MB of data, you are likely not going to catch that spike as it is pretty small.

    On the other hand, if your ETL process often causes spikes after hours for hours on end, you may get a lot of false positives.  False positives are very dangerous as it trains you to ignore those alerts.

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • So alerting for stack dumps is not possible?

  • Pretty sure SQL has no built in way to monitor and alert for stack dumps.

    This link has a powershell script for it though:

    https://www.sqltechnet.com/2014/04/monitor-and-alert-sql-stack-dumps.html

     

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thanks. Sorry for shifting the gears. I installed the tool to read the dump file and would it be okay to read from the server or copy the dmp files to different location and read using the tool?

  • You can read it from another location, but you may be missing the symbol libraries.  If possible, it is best to do it on the server that had the problem OR an equivalent machine.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Ok. Another finding from me is i found the update query in the dump.txt file which seems to be human readable. But going back to queries during the CPU spike  i got the query it is a same query running from different application nodes causing high reads and cpu usage. Do you think this particular query could be the cause of dump? But the question why the query is running longer and it never showed up in other days also the disk and memory looks good on the server. Please advise?

Viewing 15 posts - 1 through 15 (of 16 total)

You must be logged in to reply to this topic. Login to reply