Faultfinding possible I/O issues

  • Hi all,

    Over the last few weeks 3 of our secondary (log-shipped) DB's have been marked 'Suspect', requiring drop+restore. I've been advised to check the I/O and try to faultfind.

    What practices/native tools exist for SS2K to get started on the investigation? BTW, if the initial diagnosis involves creating non-temp tables/objects, I would rather avoid this as even making slight changes involves having to raise an RFC.

    Also, would you recommend checking I/O on both Primary + Secondary servers?

  • If the secondary has gone suspect and the primary is fine, then it's the secondary's IO subsystem that's the problem.

    Start with the windows error log, any RAID logs, SAN logs. If you can, stop SQL on there and run SQLIOSim (I wouldn't run it with SQL running, too much load)

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hi Gail,

    I've done some Perfmon analysis during the 100 seconds after which log shipping runs (every 15mins on the hour), only the logical disk today (physical tomorrow) but the results for the W: drive to which the logs are copied (and restored from) are as follows, I presume the values are milliseconds:

    Avg Disk Bytes/Read:

    - Avg = 18,199

    - Max = 26021

    Avg Disk Bytes/Transfer:

    - Avg = 40,651

    - Max = 65,536

    Avg Disk Bytes/Write:

    - Avg = 53,696

    - Max = 65,536

  • Perfmon is not the place to look, you don't have disk performance problems, you have disk stability problems.

    And no, the figure for bytes/write is not milliseconds. It's bytes. It shows the average number of bytes written per second.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (4/12/2012)


    Perfmon is not the place to look, you don't have disk performance problems, you have disk stability problems.

    Agreed, but I don't have a lot of immediate avenues of investigation left, so I was reaching. The event log (app/systeM) showed nothing suspicious around or immediately before the initial failure. We don't have the SAN/RAID guys in until Monday, and stopping the SQL service, even temporarily on Secondary, will require a bunch of form-filling. Ok, actually swapping the disk out is not a lengthy procedure, but I need to make a business case for the switch, and thus need proof the disk is not quite stable.

  • Nothing in any of the error logs?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • Hunted around but couldn't make much sense of it...

    Error: 5180, Severity: 22, State: 1

    Could not open FCB for invalid file ID 0 in database 'XXXXXXXXXXXXX'

  • What about the windows event logs?

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • GilaMonster (4/13/2012)


    What about the windows event logs?

    Zip. The app log filled up with infomercials and doesn't stretch back that far. However I DID check it on the morning in question (the 11th) and found nothing. The only other 'critical' error was in te System log, a virtual disk service error, about 8 hours before and after the restore job failed:

    "Unexpected failure. Error code: 2@0200001D"

  • Any further thoughts, anyone?

  • Both of the errors you've listed indicate there's some form of disk problem. Maybe contact the SAN vendor (assuming it's a SAN) and get them to check it out.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • SQLIOSIM is the tool to use to validate that an IO subsystem will properly handle SQL Server IO-style workloads.

    Best,
    Kevin G. Boles
    SQL Server Consultant
    SQL MVP 2007-2012
    TheSQLGuru on googles mail service

  • http://support.microsoft.com/default.aspx?scid=kb;en-us;815183

    the error Could not open FCB for invalid file ID %d in database '%.*ls'. is know to cause data corruption, thread errors and runtime errors

    are there different service pack versions on the shipper and shipee?

    MVDBA

  • TheSQLGuru (4/25/2012)


    SQLIOSIM is the tool to use to validate that an IO subsystem will properly handle SQL Server IO-style workloads.

    But SQL needs to be stopped when running that. The aim is to validate the IO subsystem, nor slaughter it.

    Gail Shaw
    Microsoft Certified Master: SQL Server, MVP, M.Sc (Comp Sci)
    SQL In The Wild: Discussions on DB performance with occasional diversions into recoverability

    We walk in the dark places no others will enter
    We stand on the bridge and no one may pass
  • by the way -

    FCB stands for File Control Block, the physical file structure used

    by SQL to write in and read data from the storage.

    i've had these before when i defragged a database and the log shipping made the same changes on the target.

    MVDBA

Viewing 15 posts - 1 through 15 (of 23 total)

You must be logged in to reply to this topic. Login to reply