OS error 665 - Not CHECKDB

  • Hi,

    SQL Server 2008 R2 Enterprise Edition - 10.50.6220 on a 2 node Windows 2008 R2 cluster.  It's a dedicated host running only this database.
    192GB RAM
    5TB database, 120 filegroups, monthly partitions.

    Last night, one of our databases started receiving the following messages:

    "The operating system returned error 665 (failed to retrieve the text for this error.  Reason: 15105) to SQL Server during a write at offset 0x00000ad1c8c00 in file 'X:\myfile.ndf:MSSQL_DBCC8'
    "Error 17053, Severity: 16, State: 1"

    I have restored the primary filegroup and affected filegroup to a separate instance, run DBCC CHECKDB on it, and it comes back clean.  The error was consistently being written to the SQL Server error log for 30 mins, but has stopped for now.

    Typically I would associate this error with DBCC CHECKDB and sparse files.  However, there was no CHECKDB running at the time the errors were generated.  The SPID associated with the error relates to an overnight delete job that removes data older than 18mths.  Last night, it would appear that it was deleting data from the affected partition, which contributed to the 665 error.  The delete job itself is quite considerate: it pauses if there's a high number of concurrent connections, it only deletes in batches of 500, only runs between certain hours, etc.  There's plenty of space left in the file, so there were no autogrowth events.  We pre-size our files to 100GB, with an autogrowth setting of 1GB, but that rarely gets hit... in short, I would not expect there to be too much fragmentation.

    Very recently we've extended some of the drives on our servers, and last night the SAN was levelling the disks at approx the same time as this delete job was running, so I am fairly convinced the 2 are linked.

    Has anyone encountered anything similar?  I have read Bob Dorr's post on 665 errors not just being for DBCC CHECKDB -  but for all the other scenarios listed, they don't seem to fit our event, and I'd like to pinpoint this categorically.

    Thanks

    Andrew

  • adb2303 - Wednesday, March 15, 2017 9:41 AM

    Hi,

    SQL Server 2008 R2 Enterprise Edition - 10.50.6220 on a 2 node Windows 2008 R2 cluster.  It's a dedicated host running only this database.
    192GB RAM
    5TB database, 120 filegroups, monthly partitions.

    Last night, one of our databases started receiving the following messages:

    "The operating system returned error 665 (failed to retrieve the text for this error.  Reason: 15105) to SQL Server during a write at offset 0x00000ad1c8c00 in file 'X:\myfile.ndf:MSSQL_DBCC8'
    "Error 17053, Severity: 16, State: 1"

    I have restored the primary filegroup and affected filegroup to a separate instance, run DBCC CHECKDB on it, and it comes back clean.  The error was consistently being written to the SQL Server error log for 30 mins, but has stopped for now.

    Typically I would associate this error with DBCC CHECKDB and spare files.  However, there was no CHECKDB running at the time the errors were generated.  The SPID associated with the error relates to an overnight delete job that removes data older than 18mths.  Last night, it would appear that it was deleting data from the affected partition, which contributed to the 665 error.  The delete job itself is quite considerate: it pauses if there's a high number of concurrent connections, it only deletes in batches of 500, only runs between certain hours, etc.  There's plenty of space left in the file, so there were no autogrowth events.  We pre-size our files to 100GB, with an autogrowth setting of 1GB, but that rarely gets hit... in short, I would not expect there to be too much fragmentation.

    Very recently we've extended some of the drives on our servers, and last night the SAN was levelling the disks at approx the same time as this delete job was running, so I am fairly convinced the 2 are linked.

    Has anyone encountered anything similar?  I have read Bob Dorr's post on 665 errors not just being for DBCC CHECKDB -  but for all the other scenarios listed, they don't seem to fit our event, and I'd like to pinpoint this categorically.

    Thanks

    Andrew

    With the errors being file system errors then it could very well be related to the leveling but I'm not sure how you could absolutely prove it. Have you checked with whoever manages the storage? Have you checked the Windows Event logs?
    Depending on how the SAN is configured, another option is to check the logs of other servers that could be using the same pool/disks and see if they have the same errors at the same time.

    Sue

  • Hi,

    After further investigation, the write failures don't appear to correlate with SAN levelling.  It happened about 40mins ago (nothing out of the ordinary happening on the SAN), initiated by service broker, on the same partition as had previously reported errors.

    A couple of things to note:

    Available memory is low and the server's paging a bit... could this be related?
    The write failure is against 'X:\myfile.ndf:MSSQL_DBCC8'.

    Anyone know what the :MSSQL_DBCC8 suffix is for?  When I look at Resource Manager on the server, I can see lots of database files with this suffix.  The write failures have all been against this ndf with this suffix.  We don't use snapshot isolation, versioning or anything like that.  No DB growths, no index rebuilds/reorgs running.  Definitely nothing like CHECKDB running, though I can't help but feel it's some sort of NTFS sparse file limitation (a hunch - but I have been wrong before)

    The drive is a 6TB volume with 1.85 TB free.

    Again, today, we restored the PRIMARY, INDICES, affected filegroup to another instance, ran CHECKDB on it, it came back fine.  We ran the same delete job that has been generating the OS 665 errors against the restored version, and that completed okay with no errors.  We have disabled the scheduled job for tonight, and I have scheduled a DBCC CHECKFILEGROUP of the affected filegroup and adjacent filegroups, including PRIMARY and INDICES.  Tomorrow we will try and manually delete records (we prune over 18mth old) from this partition, which should be about 1.5million rows.  We might get a clearer idea of what's going on if we can observe it happening.

  • adb2303 - Thursday, March 16, 2017 6:58 AM

    Hi,

    After further investigation, the write failures don't appear to correlate with SAN levelling.  It happened about 40mins ago, initiated by service broker, on the same partition as had previously reported errors.

    A couple of things to note:

    Available memory is low and the server's paging a bit... could this be related?
    The write failure is against 'T:\MSSQL\DATADIR\XXXXX_109.ndf:MSSQL_DBCC8'.

    Anyone know what the :MSSQL_DBCC8 suffix is for?  When I look at Resource Manager on the server, I can see lots of database files with this suffix.  The write failures have all been against this ndf with this suffix.  We don't use snapshot isolation, versioning or anything like that.  No DB growths, no index rebuilds/reorgs running.  Definitely nothing like CHECKDB running, though I can't help but feel it's some sort of NTFS sparse file limitation (a hunch - but I have been wrong before)

    The drive is a 6TB volume with 1.85 TB free.

    Again, today, we restored the PRIMARY, INDICES, 109 filegroup to another instance, ran CHECKDB on it, it came back fine.  We ran the same delete job that has been generating the OS 665 errors against the restored version, and that completed okay with no errors.  We have disabled the scheduled job for tonight, and I have scheduled a DBCC CHECKFILEGROUP of the affected filegroup and adjacent filegroups, including PRIMARY and INDICES.  Tomorrow we will try and manually delete records (we prune over 18mth old) from this partition, which should be about 1.5million rows.  We might get a clearer idea of what's going on if we can observe it happening.

    DBCC8 is appended to the internal snapshot used for DBCC.
    Are you using diskeeper?

    Sue

  • Hi,

    No, we're not using diskeeper...

    What DBCC commands would they be that's constantly creating the DBCC8 files?  If they're snapshots, then it looks like it could be this 665 issue is the same as what you see on checkdb, i.e. sparse files...

  • adb2303 - Thursday, March 16, 2017 7:44 AM

    Hi,

    No, we're not using diskeeper...

    What DBCC commands would they be that's constantly creating the DBCC8 files?  If they're snapshots, then it looks like it could be this 665 issue is the same as what you see on checkdb, i.e. sparse files...

    DBCC CHECKDB. I've never heard of them constantly creating files though.

    Sue

  • That's the strange thing... we're not running checkdb (or anything remotely like that), though they're constantly popping up.  There's no reads from them, writes only.  For every ndf file there *may be a MSSQL_DBCC8 file, but there are no MSSQL_DBCC8 files without a corresponding ndf file... weird.

  • So, they're definitely associated with snapshot files, comprised of <filename.extension>:MSSQL_DBCC<database_id_of_snapshot> 

    https://support.microsoft.com/en-gb/help/2974455/dbcc-checkdb-behavior-when-the-sql-server-database-is-located-on-an-refs-volume

    Something else, other than CHECKDB creates them though...

  • adb2303 - Thursday, March 16, 2017 8:40 AM

    So, they're definitely associated with snapshot files, comprised of <filename.extension>:MSSQL_DBCC<database_id_of_snapshot> 

    https://support.microsoft.com/en-gb/help/2974455/dbcc-checkdb-behavior-when-the-sql-server-database-is-located-on-an-refs-volume

    Something else, other than CHECKDB creates them though...

    Yes, as I said they are for DBCC snapshots. I have never heard of them being creating for anything other than checkdb or subsets of that such as checkalloc, checktable, etc. Maybe check the default trace files and see if something is in there. 

    Sue

  • thanks

    there's nothing in the default trace unfortunately.

Viewing 10 posts - 1 through 9 (of 9 total)

You must be logged in to reply to this topic. Login to reply