SQL Server 2016 crashes

  • Hello,
    After updating our company's SQL Servers 2016 SP1 to CU6 last week, I am receiving constantly crashes of the service with stack dumps. We have 6 VMs, 4 cores and 28 GB of memory, running on Azure and we run about 200 databases on each (total size about 200 GB). The first hours of working I can receive 3-4 crashes on a server, but if a server survives these first hours, then it runs much more stable.
    On Monday, after weekend update to CU6, I counted 20 crashes of these systems.
    Sometimes I can see an error like access violation but not always. It looks to me like a problem to memory handling, which getting better after some days of work, (maybe the server is adjusting its policy to the load ?).
    I have set min_memory to 0, max_memory to 24GB, enabled Lock Pages In Memory. After these crashes I disabled on 3 of them LPIM after restarting.

    Any ideas will be appreciated.
    Thanks

  • mitsos 4066 - Sunday, December 3, 2017 2:58 AM

    Hello,
    After updating our company's SQL Servers 2016 SP1 to CU6 last week, I am receiving constantly crashes of the service with stack dumps. We have 6 VMs, 4 cores and 28 GB of memory, running on Azure and we run about 200 databases on each (total size about 200 GB). The first hours of working I can receive 3-4 crashes on a server, but if a server survives these first hours, then it runs much more stable.
    On Monday, after weekend update to CU6, I counted 20 crashes of these systems.
    Sometimes I can see an error like access violation but not always. It looks to me like a problem to memory handling, which getting better after some days of work, (maybe the server is adjusting its policy to the load ?).
    I have set min_memory to 0, max_memory to 24GB, enabled Lock Pages In Memory. After these crashes I disabled on 3 of them LPIM after restarting.

    Any ideas will be appreciated.
    Thanks

    What are the errors in the error log and what do you see in the default trace? Any errors in the Windows Event Log?
    😎

  • Hello Eirikur,
    No errors in Windows Event Log. I am including the errors of 14 Stack dumps of only ONE Server:

    * BEGIN STACK DUMP:                               
    * 11/26/17 17:19:39 spid 5424                           
    * Exception Address = 00007FFDC626903C Module(sqllang+000000000003903C)             
    * Exception Code  = c0000005 EXCEPTION_ACCESS_VIOLATION                  
    * Access Violation occurred reading address FFFFFFFFFFFFFFFF  

    * BEGIN STACK DUMP: 
    * 11/26/17 17:20:43 spid 6156                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/26/17 21:01:10 spid 12148                           
    * ex_terminator - Last chance exception handling   

    * BEGIN STACK DUMP:                               
    * 11/26/17 22:14:34 spid 11072                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/27/17 11:10:04 spid 8924                           
    * ex_terminator - Last chance exception handling                     

    * BEGIN STACK DUMP:                               
    * 11/27/17 11:21:28 spid 12000                           
    * Exception Address = 00007FFEEA42903C Module(sqllang+000000000003903C)             
    * Exception Code  = c0000005 EXCEPTION_ACCESS_VIOLATION                  
    * Access Violation occurred reading address FFFFFFFFFFFFFFFF                 

    * BEGIN STACK DUMP:                               
    * 11/27/17 11:22:44 spid 4296                           
    * Non-yielding Scheduler                             

    * BEGIN STACK DUMP:                               
    * 11/27/17 12:04:12 spid 5044                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/27/17 12:29:21 spid 5100                           
    * Exception Address = 00007FFABC01903C Module(sqllang+000000000003903C)             
    * Exception Code  = c0000005 EXCEPTION_ACCESS_VIOLATION                  
    * Access Violation occurred reading address FFFFFFFFFFFFFFFF 

    * BEGIN STACK DUMP:                               
    * 11/27/17 12:29:22 spid 5728                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/27/17 13:17:48 spid 1896                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/27/17 13:21:21 spid 6464                           
    * ex_terminator - Last chance exception handling

    * BEGIN STACK DUMP:                               
    * 11/27/17 13:28:01 spid 936                           
    * ex_terminator - Last chance exception handling          

    * BEGIN STACK DUMP:                               
    * 11/27/17 14:03:10 spid 1064                           
    * Non-yielding Scheduler  

    this is a typical memory info:
    MemoryLoad = 84%       
    Total Physical = 28671 MB    
    Available Physical = 4586 MB   
    Total Page File = 33023 MB   
    Available Page File = 8514 MB  
    Total Virtual = 134217727 MB   
    Available Virtual = 134174939 MB

    After this situation - 14 crashes in one day, the server is running continuously  for 5 days.
    Another thing I changed last 3 days: The server was exposed to internet - (it has to be accessed by our customer's utilities), and 3 days ago it is behind a firewall.

    Thanks.

  • What other services apart from SQL Server Service are running on the servers?
    😎

    Given this:
    Access Violation occurred reading address FFFFFFFFFFFFFFFF
    then this looks like either a bug or an hacking attempt. Have you checked the incoming connections prior to the crashes? Suggest you persist the dm_exec_connection and dm_exec_sessions for post crash analysis.

  • How can I check a hacking attempt?
    I know for sure that the period with the crashes the server was known to hackers, and there were attempts to login (from China, South America, Russia, etc) using sa and wrong passwd. I wrote a service and as I see a wrong login attempt logged, I block this IP in the firewall. But could it be a hacking attempt without a logged login attempt?
    Anyway, now I permit login only from IPs from my country, and I cannot see wrong password trials.

    Thanks

  • mitsos 4066 - Sunday, December 3, 2017 5:00 AM

    How can I check a hacking attempt?
    I know for sure that the period with the crashes the server was known to hackers, and there were attempts to login (from China, South America, Russia, etc) using sa and wrong passwd. I wrote a service and as I see a wrong login attempt logged, I block this IP in the firewall. But could it be a hacking attempt without a logged login attempt?
    Anyway, now I permit login only from IPs from my country, and I cannot see wrong password trials.

    Thanks

    What I normally do is to persist the dm_exec_connection and dm_exec_sessions by writing the deltas of those into permanent tables, correlate with any available network logs and on top of that, use either the likes of Wireshark or similar to gather packet level information. My favorite is to introduce a Linux box connected to a full dump port on the closest managed switch, normally catches everything in any direction.
    😎
    IP's can be forged, do not give full security unless one is using SSL etc. are any of the connections unencrypted?

  • Connections are unencrypted, but each user has access only to his database. I use contained databases. Do you think the reason for an access violation can be a hacking attack? Or can a user having access to his database only, bring the server down?

    Using Windb to examine the dumps, I see the problem is always at module sqllang.dll, which as I can see is the T-SQL processor of the server.

    Thank you

  • mitsos 4066 - Sunday, December 3, 2017 8:41 AM

    Connections are unencrypted, but each user has access only to his database. I use contained databases. Do you think the reason for an access violation can be a hacking attack? Or can a user having access to his database only, bring the server down?Using Windb to examine the dumps, I see the problem is always at module sqllang.dll, which as I can see is the T-SQL processor of the server.Thank you

    This is very interesting, haven't analyzed the contained DBs in this perspective, probably about time I did so😉
    😎

  • I did see one report. Try restarting with -x in the service.

  • Hello Steve,

    Microsoft docs notices:
    Warning: When you use the –x startup option, the information that is available for you to diagnose performance and functional problems with SQL Server is greatly reduced.

    Do you think that this is ok?
    Operating without the capability to diagnose problems maybe is a bigger problem. If you can provide a link to this report I would like to check it.

    Thank you

  • I was just thinking a  test since this worked for someone else.

  • In a new crash in another server (SQL 2016 CU6) I got the following error:

    2017-12-11 10:14:22.60 Server  Error: 17066, Severity: 16, State: 1.
    2017-12-11 10:14:22.60 Server  SQL Server Assertion: File: <sosmemobj.cpp>, line=2772 Failed Assertion = 'pvb->FInUse ()'. This error may be timing-related. If the error persists after rerunning the statement, use DBCC CHECKDB to check the database for structural integrity, or restart the server to ensure in-memory data structures are not corrupted.

    Databases are not corrupted.

    The stack dump is always at the same module (sqllang):

    2017-12-11 10:14:18.36 Server  * Short Stack Dump
    2017-12-11 10:14:18.41 Server  00007FFAE4C73C58 Module(KERNELBASE+0000000000033C58)
    2017-12-11 10:14:18.41 Server  00007FFAD4DCC54E Module(sqllang+000000000102C54E)
    2017-12-11 10:14:18.41 Server  00007FFAD4DD02C9 Module(sqllang+00000000010302C9)
    2017-12-11 10:14:18.42 Server  00007FFAD4E0AD49 Module(sqllang+000000000106AD49)
    2017-12-11 10:14:18.42 Server  00007FFAD309E294 Module(sqldk+000000000005E294)

    Thanks

  • At this point, I recommend you get help from Microsoft.  We're not going to be able to solve this for you especially since we have no clue about the firewall you also brought up.

    In the future, I recommend you only make one change at a time.

    --Jeff Moden


    RBAR is pronounced "ree-bar" and is a "Modenism" for Row-By-Agonizing-Row.
    First step towards the paradigm shift of writing Set Based code:
    ________Stop thinking about what you want to do to a ROW... think, instead, of what you want to do to a COLUMN.

    Change is inevitable... Change for the better is not.


    Helpful Links:
    How to post code problems
    How to Post Performance Problems
    Create a Tally Function (fnTally)

  • Ok Jeff, thank you.

    PS: The firewall was the standard Windows Advanced Firewall.

  • Somehow the FInUse rings a bell, come a cross problems highlighting the function when the authentication mode was incorrectly configured, might be worth looking into
    😎

Viewing 15 posts - 1 through 15 (of 18 total)

You must be logged in to reply to this topic. Login to reply