SQL 2016 SP1 CU2 - sudden and unexplained shutdowns

  • Hi there,

    Anyone had problems with unplanned instance shut-downs on SQL 2016 SP1 CU2?

    We patched our Data Warehouse to CU2 in May (upgrade from CU1). Since then we've had two SQL Server shutdowns during heavy (but not unusually heavy) load. There are stack dumps along with the crash, but not much else.

    Up to now we've been putting all our effort into looking for user behaviour that may have overloaded the server, but we've come up with nothing so far.

    I'd like to hear if any others have experienced any similar issues.

    Thanks
    Stephen.

  • I'd be curious what is in the SQL logs and the windows logs.
    I imagine that one of those 2 places will have more information about what is wrong.   When SQL Server shuts down, is it just the one instance that goes down?
    How many instances do you have on the system?
    How much memory is utilized when the crashes occur (roughly)?
    What features do you have enabled (CDC, SSIS, SSRS, etc)?
    How much memory does the system have?
    How much free disk space do you have?
    How large is your page file and how large is the max value for it?
    What is your Max memory value for the instance(s) that crash?

    My last few questions above are because my guess is that the SQL instance is becoming starved for resources and is crashing.

    We have a SQL 2016 SP1 CU2 instance that has had a good month of uptime (since install) with no issues that I've seen.  Mind you it is very light loads.
    Presuming it has no confidential data, can you post your stack dump?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • What's in the error log from the time of the shutdowns?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • We also have SQL 2016 SP1 CU2 running on other servers with OLTP workload without issue, but the "problem" server has a very different load pattern, with heavy scheduled and ad-hoc processes. 

    Instances and memory

    We have three instances running on the server, but the other two are less utilised ,have memory restrictions and were not affected. Max server memory is split so that there is 3.5% of server memory available for the server (out of 768GB total), and available memory was steady at 3.5% at the time the instance shut down.

    Features
    SQL Server engine and Agent are installed on this instance and also the other two on the server.  We use Resource Governor to split between high-priority processes and ad-hoc users. There's no SSIS, SSRS or SSAS features installed on any instances on this server, although other servers with those features regularly connect to the server to query data from one of the instances.
    We use Redgate backup and restore, which are regularly in use on the server.

    Page life expectancy was low at the time of the crash (~200 seconds) but it is low at the same times every day, usually without crashing (though when our next hardware/architecture review kicks in, this will be something we'll need to address).

  • GilaMonster - Tuesday, July 4, 2017 4:49 PM

    What's in the error log from the time of the shutdowns?

    First sign of trouble is "ex_raise2 - exception handler raised exception , major=9, minor=24, severity=14 - Server terminating"
    Followed by a stack dump. 
    We're going to patch to CU3 next week as a precaution.

  • Can you post the error log entries please?

    If SP3 doesn't fix it, you may need to open a case with Microsoft Customer Support. These kind of things are hard to debug.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  •  I have a server running that build and am not experiencing that issue.  However, this is a very low end application.

  • did a quick google search of that error and came across this blog post:
    https://blog.sqlauthority.com/2007/04/25/sql-server-error-messages-sysmessages-error-severity-level/

    Severity 14 doesn't sound like it should take down the server but I can see that it did.

    When was the last time you ran DBCC CHECKDB and was it successful?
    Your environment is a lot different than mine (we are only running SQL Standard, you have enterprise) so I am not a good comparison.
    Do you notice any pattern in the crashes?   Like is it always on a specific date or a specific time of day?  If you can find some pattern, it might help diagnose the problem.
    And can you post more details about your logs like a few lines before and after the error message?  And is there anything interesting in the windows logs at the time of the crash?
    Do you have an antivirus installed on the system that is set to scan the MDF and/or LDF files?  That can create a lock on the file at times which can prevent SQL from writing to it and cause it to crash.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • GilaMonster - Wednesday, July 5, 2017 1:40 AM

    Can you post the error log entries please?

    If SP3 doesn't fix it, you may need to open a case with Microsoft Customer Support. These kind of things are hard to debug.

    Hi Gila
    We've had some help from Microsoft, after sending our stack dump overnight.
    Here's the relevant section of errorlog.  The crash happened during a proc call to a custom proc that checks times of database backups. This proc calls come DMVs, including sys.dm_exec_requests.

    According to MS,
    "The issue here is actually trying to provide the user_id from dm_exec_requests when one of the db’s that a current request is running against is in single user mode. That is what it appears to be from looking closer at the dump (which does match the call stack ... gleaned from the ERRORLOG)."
    "In our code, we try to handle Msg 924 for this scenario but it is in a part of our code that cannot handle that type of error just based on the sequence of where we are executing. This is extremely rare but I have seen this in the past. In these situations, it is not safe to proceed so we fail fast."

    It's highly likely that this rare bug affected us (we use single user mode in some scenarios when restoring and scrubbing DBs).  We can program around this by avoiding using dm_exec_requests, and see if the system returns to normal stability.
    However, we'll still plan to update to CU3 (one does not simply decline a maintenance window).

  • Please keep us updated on which CU actually fixes this; we do sometimes, though rarely, have single user mode databases, and we use dm_exec_requests for some monitoring, so this is of significant concern to us.

  • Our patching window has been postponed by a week - will update the thread if there is any return of the bug.  
    As well as the patch to CU3 we are reducing our usage of dm_exec_requests in one of our regular monitoring checks.
    We've also had a couple of non-yielding scheduler issues, not resulting in a server crash, taking place around the time Redgate backups of large DBs kick off. We are planning to switch some of these backups to native backups just in case  here is some incompatibility with Redgate in our current environment.

  • Bleeping DBA - Wednesday, July 5, 2017 4:49 PM

    GilaMonster - Wednesday, July 5, 2017 1:40 AM

    Can you post the error log entries please?

    If SP3 doesn't fix it, you may need to open a case with Microsoft Customer Support. These kind of things are hard to debug.

    Hi Gila
    We've had some help from Microsoft, after sending our stack dump overnight.
    Here's the relevant section of errorlog.  The crash happened during a proc call to a custom proc that checks times of database backups. This proc calls come DMVs, including sys.dm_exec_requests.

    According to MS,
    "The issue here is actually trying to provide the user_id from dm_exec_requests when one of the db’s that a current request is running against is in single user mode. That is what it appears to be from looking closer at the dump (which does match the call stack ... gleaned from the ERRORLOG)."
    "In our code, we try to handle Msg 924 for this scenario but it is in a part of our code that cannot handle that type of error just based on the sequence of where we are executing. This is extremely rare but I have seen this in the past. In these situations, it is not safe to proceed so we fail fast."

    It's highly likely that this rare bug affected us (we use single user mode in some scenarios when restoring and scrubbing DBs).  We can program around this by avoiding using dm_exec_requests, and see if the system returns to normal stability.
    However, we'll still plan to update to CU3 (one does not simply decline a maintenance window).

    It doesn't sound like they admitted to any type of bug.  It sounds to me like the MS code is doing its job in protecting the server and it sounds like they identified what part of the code you folks wrote that needs to be changed.  I would be extremely surprised if the next CU fixed this problem because I don't believe the MS code is the problem.  I'd plan on making the work around to your code that you've already identified.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Jeff Moden - Thursday, July 13, 2017 12:11 AM

    It doesn't sound like they admitted to any type of bug.  It sounds to me like the MS code is doing its job in protecting the server and it sounds like they identified what part of the code you folks wrote that needs to be changed.  I would be extremely surprised if the next CU fixed this problem because I don't believe the MS code is the problem.  I'd plan on making the work around to your code that you've already identified.

    We applied CU3 today and it's got a few good fixes in it.  However, I don't expect CU3 to fix the bug we had, after seeing Microsoft's explanation. One man's feature is another man's bug, and just because Microsoft could technically defend the crashes by claiming "exception handling", from a customer point of view users are doing legitimate things. If the way to avoid the behaviour (i.e. the crash) is to stop using certain features in combination, in my view it's a bug.

  • Bleeping DBA - Tuesday, July 18, 2017 7:46 PM

    Jeff Moden - Thursday, July 13, 2017 12:11 AM

    It doesn't sound like they admitted to any type of bug.  It sounds to me like the MS code is doing its job in protecting the server and it sounds like they identified what part of the code you folks wrote that needs to be changed.  I would be extremely surprised if the next CU fixed this problem because I don't believe the MS code is the problem.  I'd plan on making the work around to your code that you've already identified.

    We applied CU3 today and it's got a few good fixes in it.  However, I don't expect CU3 to fix the bug we had, after seeing Microsoft's explanation. One man's feature is another man's bug, and just because Microsoft could technically defend the crashes by claiming "exception handling", from a customer point of view users are doing legitimate things. If the way to avoid the behaviour (i.e. the crash) is to stop using certain features in combination, in my view it's a bug.

    True enough and agreed.  I'm afraid that I got used to the proverbial "Blue Screen" a very long time ago and that it's actually a bug. The ol' "Works as designed.  Won't fix" or "Scheduled for fix in a future release" replies that MS frequently gives it's customers has made me a bit numb to such problems and now I just sigh (or curse them out loud) and do a work around.  It's a really bad thing for me to have gotten used to.  I've also gotten used to partial (the crap side of "agile") releases of new functionality (like when SUM() OVER, and a wad of other stuff first came out), poorly performing new functionality (FORMAT() is a good example of that), and poorly designed functionality (PIVOT, SPLIT_STRING(), TABLE PARTITIONING, etc) were it didn't seem that they actually understood what the real life needs might be but won't fix what they're released.

    According to what MS responded with in your post, the failure appears to be the code (and it appears to be their code... you just called it) asked something of a database that's in the single user mode and the single available connection was already in use.  Rather than a crash, you would think that it would be easy enough to just return a "Sorry Dave.  I can't do that message" and continue on.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • So far so good - no sudden shutdowns since we patched to CU3 and also reduced usage of sys.dm_exec_requests.

Viewing 15 posts - 1 through 15 (of 15 total)

You must be logged in to reply to this topic. Login to reply