What component of I/O subsystem caused database corruption

Question

What component of I/O subsystem caused database corruption

IT researcher

SSCertifiable

Points: 7616
More actions
July 10, 2024 at 11:43 am

#4434825

As per Error 824 that has occured in my SQL error Log, I have found that the I/O subsystem is the cause of the database corruption.
https://learn.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-824-database-engine-error?view=sql-server-ver16
However, I want to dig further and understand which specific component of the I/O subsystem caused the corruption.
Is there a methodology to identify this? How can I find out which component of I/O has caused the corruption to my database?
Additionally, I have found information on how to check which component of the I/O subsystem has caused latency in the link below.
https://learn.microsoft.com/en-us/troubleshoot/sql/database-engine/performance/troubleshoot-sql-io-performance
However, this does not address whether I/O subsystem latency causes database corruption.

Viewing 4 posts - 1 through 4 (of 4 total)

You must be logged in to reply to this topic. Login to reply

Grant Fritchey SSC Guru Points: 399340 More actions · Answer 1

Bad news is, probably not. I mean, do run validations on all your hardware, because that could be the issue. However, data corruption can be from external sources. Power surges caused by lightning strikes. Solar flares. Extra solar radiation (that one always gets me angry). All these things can lead to corruption on storage. If your hardware isn't to blame, and check every part of that, disks, controllers, network, switches, all can lead to issues here, then you can attempt to replicate the issue through the software by repeating what lead to the corruption (as much as you're able). That might identify if it's something other than hardware, if you can replicate it. If you can't, then it could be extra solar radiation (or a myriad of other causes). Sometimes, we never find this stuff out.

Generally though, your best bet is the hardware. That's the most common cause.

"The credit belongs to the man who is actually in the arena, whose face is marred by dust and sweat and blood"
- Theodore Roosevelt

Author of:
SQL Server Execution Plans
SQL Server Query Performance Tuning

Perry Whittle SSC Guru Points: 234065 More actions · Answer 2

try working with your storage admins firstly

-----------------------------------------------------------------------------------------------------------

"Ya can't make an omelette without breaking just a few eggs" 😉

Mr. Brian Gale SSC-Insane Points: 24989 More actions · Answer 3

I agree with Grant, I just wanted to add in another potential cause is drivers. For example, if you update your network drivers and you have your database storage on network attached storage (NAS), you MAY have a brief interruption and depending on how fast that interruption is, SQL may not have noticed. I've updated my network drivers on my work laptop before while running a query on SSMS and being on VPN and neither SSMS or VPN even noticed the connection drop briefly.

Alternately, it could be faulty drivers or a bad failover or gremlins. Or, although a bit more rare, it could be malicious intent too. Intentionally creating corruption in the database is a lot harder to do, especially while the database is online, but it is not impossible. It could also be a virus that hit the file before getting cleaned up or antivirus that thought a specific string of bits looked suspicious and cleaned it up.

Just wanted to add that it's not ALWAYS hardware, but in my experience, it is usually hardware for corruption related issues. I recommend doing all the sanity checks on disk (chkdsk) and memory (memtest) and on the network (I don't know of good tools for this, but I'm sure they exist for disk based network checks) if those come back clean it's either software or a one-off thing (cosmic radiation for example).

Basically, try to narrow down the part of the system between the app and the disk that broke. IF the disk is local (not NAS/SAN disk), then may not hurt to check for BIOS updates and new motherboard drivers as it COULD be the SATA/SCSI controller that needs some attention.

Now, with the above being said, it is possible that some logs caught the bad data and how it tried to correct it. It wouldn't be anything in SQL mind you, it'd be the windows logs or the SAN logs or NAS logs. I've seen things pop up in my Application Event log before in Windows that made it look like my disk was starting to fail. It indicated the health check on my C drive failed. So clone the disk and swap the disks and see if it boots. I would recommend starting with hardware diagnostic tools to make sure that you aren't having hardware failure because if it is disk failure, when an SSD starts to fail, I've always found that it starts out small (a few bits are bad) then a whole chip can go bad and before too long, the whole SSD is toast with no way to recover. Checking the SMART status on the disks (if possible) is also a good thing to do periodically.

The above is all just my opinion on what you should do.
As with all advice you find on a random internet forum - you shouldn't blindly follow it. Always test on a test server to see if there is negative side effects before making changes to live!
I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.