Operating system error 1117 (I/O device error)

  • Hi,

    I have 3 Hyper-V servers in  a cluster with a SQL2019 VM on one of them.

    Recently we got an error in the logs that we cannot find the root cause of.

    Could any of you help me in the right direction?

    SQL Log:

    07/29/2020 20:52:57,spid48s,Unknown,The attempt to flush file buffers failed during file close activity.: Operating system error (null) encountered.

    07/29/2020 20:52:57,spid48s,Unknown,Error: 17053<c/> Severity: 16<c/> State: 1.

    07/29/2020 20:52:57,Logon,Unknown,Login failed for user 'PAS'. Reason: Failed to open the explicitly specified database 'ASGLOBALData'. [CLIENT: 192.168.69.10]

    07/29/2020 20:52:57,Logon,Unknown,Error: 18456<c/> Severity: 14<c/> State: 38.

    07/29/2020 20:52:57,spid108,Unknown,Database ASGLOBALData was shutdown due to error 9001 in routine 'XdesRMFull::CommitInternal'. Restart for non-snapshot databases will be attempted after all connections to the database are aborted.

    07/29/2020 20:52:57,spid108,Unknown,The log for database 'ASGLOBALData' is not available. Check the operating system error log for related error messages. Resolve any errors and restart the database.

    07/29/2020 20:52:57,spid108,Unknown,Error: 9001<c/> Severity: 21<c/> State: 4.

    07/29/2020 20:52:57,spid8s,Unknown,Write error during log flush.

    07/29/2020 20:52:57,spid8s,Unknown,SQLServerLogMgr::LogWriter: Operating system error 1117(The request could not be performed because of an I/O device error.) encountered.

    07/29/2020 20:52:57,spid8s,Unknown,Error: 17053<c/> Severity: 16<c/> State: 1.

     

    Eventlog:

    07/29/2020 20:52:40 Event ID 140 NTFS -

    The system failed to flush data to the transaction log. Corruption may occur in VolumeId: E:, DeviceName: \Device\HarddiskVolume6.

    (The I/O device reported an I/O error.)

     

    I cannot find any fault in the physical drives or the physical servers themselves.

    I'm not fluent in SQL so bear with me 🙂

    Any tips?

     

    Anders

     

  • First thing I'd do is chkdsk on the physical disk to rule out disk corruption.

    But it sounds to me like some disk based error and likely not a database error.  The error is basically saying that SQL Server tried to write to disk but got an I/O device error and could not write to disk.  Since it can't write to disk, SQL went into panic mode and took the database offline to prevent bad data from going in.

    The value in the event log is basically saying that when it tried to write to disk, the device (the disk) reported an I/O error.

    All signs that I see here are saying there is something wrong with the disk, not with SQL.

    The exception to it being some form of disk corruption is that it could be due to some forced file locking.  What I mean here is something like an antivirus grabbed the database file and started doing a scan on it and locked it to prevent data changes while scanning.  If SQL can't write to the log file or the database file due to ANYTHING locking the file, SQL will shut down the database to prevent data corruption or data loss.

    TL;DR - make sure the disk is not corrupt or corrupting (chkdsk and if any bad sectors come up, replace the disk) and make sure nothing is locking the database files such as an antivirus or antimalware tool.  These must be configured to exclude the database files.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Thank you for your input.

    What you say makes sense. I'll see if I can run chkdsk on a CSV-volume.

  • Quick google - it depends on if it is ReFS or NTFS.  If ReFS, it should be doing self-healing.  I expect this is true until the disks start failing.  On NTFS, chkdsk can be run no problem.

    See - https://techcommunity.microsoft.com/t5/failover-clustering/how-to-run-chkdsk-and-defrag-on-cluster-shared-volumes-in/ba-p/371905

    for ReFS, you would want to run "Get-Item -Path 'E:\*' | Get-FileIntegrity" from powershell replacing E:\ with the path to your ReFS - https://social.technet.microsoft.com/Forums/ie/en-US/820e4dbf-ef21-413c-894c-54276ffee4f5/how-do-you-check-the-health-of-a-refs-volume

    My opinion, on critical systems, these commands should be being run on a schedule similar to how you run checkdb if possible.  If the disk is starting to fail, it is MUCH nicer to find out from chkdsk that it is starting to go bad and had been corrected than to find out as your database becomes corrupted.  If you are on NTFS, chkdsk to check for errors is an online operation with Windows Server 2012 and newer, so running that during a maintenance window is not required, but still recommended as it will do a lot of disk I/O and thus may have performance impacts on your databases.  I say "may" because if your systems have light database usage, you may not notice the performance hit, and if they are on SSD's, the random read/write times are pretty low so you may not notice the performance hit.

    Alternately,if your disks are providing SMART info (which I am not sure a CSV volume would be able to check; I pass those sorts of maintenance off to the server admin guys), that may provide you with some useful info about if a disk is failing and if so, which one.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Yeah, I found that article too, and from what I can gather, chkdsk has not been run since april 2020. I'm going to wait until later tonight to run it manually during off-peak hours.

    The problem has not occured since july 29th though, so hopefully it was something other than disk-corruption this time 🙂

     

  • Hopefully it isn't corruption, but if it isn't, then I'd be checking other things like the antivirus to make sure it is not scanning the database files.

    Might not hurt (if you can get the downtime) to do a check on the memory as well.

    It may just be a weird one-off thing or some sort of server hiccup, or it could be the start of something messy.  If it is something messy, catching it now and fixing it now will be easier than finding out months down the road that the server was starting to fail today and months from now it is dead AND the database has corruption in it.

    Just to confirm, CHECKDB came back clean, right?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Yeah, CHECKDB was clean 🙂

  • This was removed by the editor as SPAM

Viewing 8 posts - 1 through 7 (of 7 total)

You must be logged in to reply to this topic. Login to reply