Server losing access to disks

  • Hi,

    Last friday we installed incipient on our san switches, from what I know, it decouples the OS from the san and lets you move data around without making os changes - I'm not a SAN expert, so I won't pretend to know the details. We have two paths to the arrays (two switches). They installed on one switch and then the other. At this time our production server lost access to the tempdb with these error messages:

    - - - -

    SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file [J:\Logs\Budget_log.ldf] in database [Budget] (7). The OS file handle is 0x0000000000000B78. The offset of the latest long I/O is: 0x000000c4e9d400

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    - - - -

    LogWriter: Operating system error 21(The device is not ready.) encountered.

    For more information, see Help and Support Center at http://go.microsoft.com/fwlink/events.asp.

    - - - -

    The log for database 'tempdb' is not available. Check the event log for related error messages. Resolve any errors and restart the database.

    - - - -

    Sql 2005 restarted itself. It then happened again today with a slightly different message on another server on the same array.

    -----

    17053 :

    LogWriter: Operating system error 1784(The supplied user buffer is not valid for the requested operation.) encountered.

    -----

    I thought it was just an issue at the install time, but now it's reoccuring - a major problem. SQL 2000 just sat there until I restarted sql service.

    Anyone have insight on this. I'm sure our SAN guys will have some ideas, but I was curious.

    Thanks

  • We had one issue like that. Thankfully it was on our test environment. One of our administrators changed some settings on the switch and the switch had to be rebooted.

    One thing you can look for is to see of the switch is getting rebooted for some reasons. Then the Disks go offline. That could trigger this.

    -Roy

  • you did test this first ???? I would have thought you would have installed a storage virtualisation product with your sql servers safely shut down. I don't know anything about this product but a SAN is no different to any other network and if you start messing around with the switches you'll lose connectivity. You should check if this software is certified for use with SQL Server - a properly set up SAN is great but it doesn't take much induced latency to screw things up big time.

    I'd raise a call with the vendors.

    [font="Comic Sans MS"]The GrumpyOldDBA[/font]
    www.grumpyolddba.co.uk
    http://sqlblogcasts.com/blogs/grumpyolddba/

  • I would recommend measuring Avg Disk Sec/Read, Avg Disk Sec/Write, and Avg Disk Queue Length within performance monitor. You can use SQLH2 to help capture this information over a period of time. I would also suggest using IOMeter or SQLIOSim to benchmark performance.

    For more information -

    http://www.sql-server-performance.com/tips/monitor_io_counters_p1.aspx

    http://www.sqlteam.com/article/benchmarking-disk-io-performance-size-matters

  • Supposedly, the san software only affects disk areas that have been 'imported' into it. Production systems were not imported. Now they are being affected. We had one hiccup during the install, which should not have happened - since that server was not being imported, but did use the switch being updated. The server we had a problem with today is a virtual server - so we are thinking it may be something else, since if the virtual data was lost, windows would not be happy, nevermind sql. Virtual data is stored in one location for all drives.

  • Colin,

    Yes we did shut down the dev servers that were being affected by the install. The productions servers should have taken an alternate route through the mds which wasn't being rebooted.

    thanks.

  • The virtual server turned out to be an issue with a disaster recovery backup snapshot happening and sql trying to write at the same time.

    Vendor says he highly suspects a misconfig of our SAN Multipathing. Thanks for your help!

  • We are experiencing very similiar issues with our SAN. What type of SAN are you running. Also, are you using 32 or 64 bit OS?

  • 64 bit sql 2005

    We are using HP EVAs

    Turned out that we also had a problem with a virtual sql server - the backup software conflicted with sql and caused it to lose access to the data files.

  • Are you using LiteSpeed for your backups? That's what we're using and we're trying to determine if this is a LiteSpeed issue or if we have some hardware issue. We have an IBM SAN and initially thought that was the problem, but apparently not.

  • We're using another product, not litespeed.

  • Hi Guys,

    I have a client experiencing these errors in their event logs.

    Here's some notes:

    - Lost access to D Drive (DATA) drive. C Drive (OS) is fine.

    - Storagecraft snapshot completed about 10 minutes before Event ID: 17053 - LogWriter: Operating system error 21(error not found) encountered.

    - In Win disk Management C: is fine, D: appears ok, however is does not have its drive letter (D:). It does have it's 'DATA' label though.

    - Other than the drive letter issue, the partition presents ok.

    - I have run Checkdisk via Acronis Disk Director Server and it has completed fine with no errors/bad sectors.

    - Acronis presents C: and D: as if there is no problem at all. I can see all D: attributes fine as well.

    This server is a single server with a 70GB OS partition on a Raid 1 and a 300GB DATA partition on a 3 disk Raid 5.

    Does this sound like the SAN issues you guys have been having?

    Did anyone find a fix for their issue?

    Any help is greatly appreciated.

    Michael

  • Please post new questions in a new thread. Thank you.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Errors are related to this thread. Thanks.

Viewing 14 posts - 1 through 13 (of 13 total)

You must be logged in to reply to this topic. Login to reply