Nutanix - 'misaligned log IOs which required falling back to synchronous IO’

  • Hi

    I have restored a SQL database onto a nutanix VM, the disks are formatted to 64k.

    When restoring or running I get error

    Error: ‘misaligned log IOs which required falling back to synchronous IO’

    The database was restored off an old Dell physical 710 server that had disks set to bytes per sector &physical sector of 512 bytes, however I can't understand if doing a restore to nutanix  disks with 64k cluster size would cause this.

    Thoughts welcome...

  • When is that error thrown?  Is it thrown during the database restore or at some other point?

    Quick google of the error is being thrown by an AG during a sync process.  If the server is part of an AG (availability group), and the block size is different between the systems on the AG, that is going to be the problem.

    I suspect this is the case and that it is actually unrelated to to Nutanix specifically.

    To "fix" it, you will need to migrate all of the SQL instances in the AG to have the same block size.

    Sources for this:

    https://sqlundercover.com/2019/07/08/error-misaligned-log-ios-which-required-falling-back-to-synchronous-io-and-slow-ag-synchronisation/

    https://www.mssqltips.com/sqlservertip/5942/sql-server-misaligned-log-ios-which-required-falling-back-to-synchronous-io/

    If it is NOT part of an AG, then I'd be curious to know when the error is being thrown.  Is it during the RESTORE DATABASE command?  Does it restore successfully but errors are appearing in the logs?  etc.

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hi

     

    It seems to occur in the last part of the restore as the last part of the log is being laid out on disk. The database appears to work OK. We did originally have this database in an AG but got these errors with it running in the AG. So we  then broke the AG, deleted the databases on the secondary, did a restore as a stand-alone box and still had the issue.

    This is a SQL 2014 box with SP3 CU3 + Feb 2020 Security GDR on it ( 2014.120.6118.4  ).  The old physical box is same SQL version.

    The weird thing is, we have migrated other SCOM SQL databases from a physical server of 2013  (Dell 720) server across to Nutanix, and we haven't seen this issue.

    As a double-check, I restored the database to another SQL 2014 server with same disk structure as the old SCOM physical server - without issue.  I have never seen SQL complain about disk structure before. Missing or offline disks...yes.

    The old physical disks from Dell 720  were of this format:

    Byes per sector  512

    Bytes per physical sector  512

    Bytes per cluster    4096

     

    The new disks on brand new Nutanix kit are :

    Byes per sector  512

    Bytes per physical sector  4096

    Bytes per cluster    65536

     

    This is the restore process . Bear in mind we also saw huge numbers of these errors also when the db was running in an AG.

    Can the SQL log be fussy about the underlying disk?

    06/03/2020 17:49:42,Backup,Unknown,RESTORE DATABASE successfully processed 87258190 pages in 3013.308 seconds (226.231 MB/sec).

    06/03/2020 17:49:42,Backup,Unknown,Database was restored: Database: DATABASE<c/> creation date(time): 2016/01/25(11:35:26)<c/> first LSN: 3977393:824661:1<c/> last LSN: 3977507:504922:1<c/> number of dump devices: 1<c/> device information: (FILE=1<c/> TYPE=DISK: {'DB.BAK'}). Informational message. No user action required.

    06/03/2020 17:49:42,Backup,Unknown,Restore is complete on database 'DATABASE'.  The database is now available.

    06/03/2020 17:49:41,spid56,Unknown,CHECKDB for database 'DATABASE' finished without errors on 2020-05-30 01:13:19.850 (local time). This is an informational message only; no user action is required.

    06/03/2020 17:49:40,spid56,Unknown,Recovery completed for database DATABASE (database ID 5) in 556 second(s) (analysis 265 ms<c/> redo 177276 ms<c/> undo 279714 ms.) This is an informational message only. No user action is required.

    06/03/2020 17:49:40,spid56,Unknown,last target outstanding: 7440<c/> avgWriteLatency 31

    06/03/2020 17:49:40,spid56,Unknown,average writes per second: 1816.24 writes/sec <nl/>            average throughput: 198.59 MB/sec<c/> I/O saturation: 61676<c/> context switches 73076

    06/03/2020 17:49:40,spid56,Unknown,FlushCache: cleaned up 2527893 bufs with 180614 writes in 99444 ms (avoided 0 new dirty bufs) for db 5:0

    06/03/2020 17:48:01,spid56,Unknown,Recovery is writing a checkpoint in database 'DATABASE' (5). This is an informational message only. No user action is required.

    06/03/2020 17:46:25,spid56,Unknown,There have been 204544 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:45:25,spid56,Unknown,There have been 169728 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:44:25,spid56,Unknown,There have been 118784 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:43:25,spid56,Unknown,There have been 85504 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:43:21,spid56,Unknown,The tail of the log for database DATABASE is being rewritten to match the new sector size of 4096 bytes.  3072 bytes at offset 169079190528 in file F:\\DATABASE.ldf will be written.

    06/03/2020 17:42:24,spid56,Unknown,There have been 66816 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:41:24,spid56,Unknown,There have been 50176 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:40:24,spid56,Unknown,There have been 34560 misaligned log IOs which required falling back to synchronous IO.  The current IO is on file F:\\DATABASE.ldf.

    06/03/2020 17:40:23,spid56,Unknown,Starting up database 'DATABASE'.

    06/03/2020 17:40:19,spid56,Unknown,last target outstanding: 4698<c/> avgWriteLatency 38

    06/03/2020 17:40:19,spid56,Unknown,average writes per second: 1654.73 writes/sec <nl/>            average throughput: 214.46 MB/sec<c/> I/O saturation: 113800<c/> context switches 144653

    06/03/2020 17:40:19,spid56,Unknown,FlushCache: cleaned up 3912762 bufs with 235854 writes in 142533 ms (avoided 0 new dirty bufs) for db 5:0

    06/03/2020 17:30:13,spid56,Unknown,The database 'DATABASE' is marked RESTORING and is in a state that does not allow recovery to be run.

    06/03/2020 17:30:13,spid56,Unknown,Starting up database 'DATABASE'.

    • This reply was modified 1 year, 7 months ago by  sqlguy7777.
  • Update

    ======

     

    We think we may have found the cause - this sytem was originally created on SQL 2008 R2 and later upgraded to SQL2014.

    SQL 2008 R2 system databases apparently were set up with 512 bytes per sector, 512 bytes per physical sector hard coded disk compatibility. I suspect when the SCOM system was upgraded, it kept this internal configuration limitation in SQL 2014.

    https://support.microsoft.com/en-au/help/2624272/fix-error-9013-when-you-try-to-start-the-sql-server-services-after-you

    https://docs.microsoft.com/en-us/archive/blogs/psssql/after-applying-sql-server-2008-r2-sp1-error-9013-is-logged-the-tail-of-the-log-for-database-ls-is-being-rewritten-to-match-the-new-sector-size-of-d-bytes

    https://docs.microsoft.com/en-us/archive/blogs/psssql/sql-server-new-drives-use-4k-sector-size

    I'm considering using the SSMS database copy utility, to copy the database from current old physical server  to a new virtual server, so it re-creates the database exactly on new kit, but from the ground up without disk structure limits.

    Thoughts welcome....

    Thanks.

     

     

  • After the restore completes, are you still seeing those errors?  The line:

    06/03/2020 17:43:21,spid56,Unknown,The tail of the log for database DATABASE is being rewritten to match the new sector size of 4096 bytes.  3072 bytes at offset 169079190528 in file F:\\DATABASE.ldf will be written.

    in the log makes me think it fixed the log file for the filesystem type change.  If those errors only occur during the restore, it is likely safe to ignore them.  If the errors continue after the restore has completed, might not hurt to do things fresh as it sounds to me like the server thinks the AG is still set up (mind you I have not set up or torn down an AG before, so I could be completely out to lunch on this).

    It could be something with using an in-place upgrade of your SQL instances?  I have done migration installs which I prefer as it gives me a chance to review settings and make sure that everything is set up the way I am expecting, plus it allows for fail back in the event something hits the fan.  It MIGHT be related to doing an in-place upgrade.

    As for a database copy utility, it is not a tool I have used before, so I cannot speak to its value.  My thought though is if it builds the database from the ground up and makes it identical on the new instance, what if the problem is some weird config issue and it pulls that config across?  What may be a safer option would be to script out the objects (or even better - if they are in source control, pull out the scripts and run them against the new system) and use something like SSIS to pull the data across.  Or if you have a license for SQL compare and SQL Data Compare, you can use those tools to pull everything across.  The advantage of this method is if there is a configuration issue on the database, it should correct it.  My concern with the SSMS Database Copy Utility is that I don't know what it pulls across.  You MAY be pulling everything across.

    BUT as I said at the start of my reply, if the error stops after the restore completes, it is likely nothing to worry about as it sounds like it may have fixed itself?

    The above is all just my opinion on what you should do. 
    As with all advice you find on a random internet forum - you shouldn't blindly follow it.  Always test on a test server to see if there is negative side effects before making changes to live!
    I recommend you NEVER run "random code" you found online on any system you care about UNLESS you understand and can verify the code OR you don't care if the code trashes your system.

  • Hi

    I have experienced a similar issue. Production Availability Group is on a SAN with 512 bytes per physical sector.  We have no issue in the AG but when we restore backups of these databases to our other servers which are 4096 bytes per physical sector we experience the same errors you have shown - ‘misaligned log IOs which required falling back to synchronous IO’.

    This is causing us huge issues because it massively increases the duration of the restore.  It seems to be worse when multiple databases are restoring at the same time.  This is only happening since we moved our production servers to this new SAN which is configured with the 512 byte sectors.  All servers running SQL Server 2017.

    All documentation says this is an issue within availability groups, mirroring or log shipping but I believe restoring a database to a server of a different sector size is technically the same issue.  When this is found within an AG, MS recommends enabling trace flag 1800 at startup however I do not have another 512 byte sector server to test this with before applying it to production. I also can't find enough documentation about this trace flag to be confident that it would not cause any other issues.

    Re copying the database... I actually tested restoring to a 4096 server (issue with misaligned IOs occurred as expected), I took a backup of it and then restored it again... no misaligned IO errors.  This is not a suitable solution for my issue unfortunately but it was worth the test.

    Really keen to get to the bottom of this.

    Any suggestions welcome

Viewing 6 posts - 1 through 5 (of 5 total)

You must be logged in to reply to this topic. Login to reply