Weird I/O Problem with server

  • Across my environment we are seeing sporadic issues appear on our database servers. The issue first appeared about 4 months ago and it is only resolved by rebooting the server. It has never appeared on the same server twice and it has now happened on a half dozen servers. The issue has occurred on a variety of server types, two different OS's and two different versions of SQL Server.

    The issue is always the same, we start to notice high CPU utilization and poor performance in SQL server. The smallest of action, even when nothing else is hitting the database will cause the CPU to jump up. After checking all of the hardware, OS and SQL Logs and finding nothing we decided to check IOMeter. We now can conclude that this issue always has the same symptom that the IOPS drop by a huge amount (for example 15k IOPS on write drops to 2k). Note this happens when nothing else is going on in the server. When this happens we have tried resolving the issue by doing everything possible from stopping all services including SQL Server, resetting the cache policy on the disks, monitoring all file access in and processes in process explorer.

    There's never a pattern, it always just "starts happening" at various points in the day. We have disabled the antivirus to eliminate it as a factor. The hardware and software vary. The only consistent thing is that this only happens on our database server and never any other server.

    ---

    4 out of the 6 servers had this configuration:

    Dell PE R720

    RAID 1 - OS/Logs (2x 15k rpm drives)

    RAID 6 - Data (14x 15k rpm drives)

    Windows Server 2008 R2 SP1

    SQL Server 2008 R2 SP2

    96GB Memory

    1x 6 core 2.5GHZ CPU

    ----

    One scenario was on a HP Server with Windows Server 2003 running SQL Server 2000 against a SAN.

    ----

    One scenario was Windows Server 2003 running SQL Server 2008 R2 against 4 spinning disks in RAID 6.

    Has anyone seen this problem before? If so, were you able to find the cause of the problem and implement a permanent fix rather than just reboot to fix it. We are probably going to contact Microsoft to open a support case, but I figured I would see if any other Sys and/or SQL admins have seen this problem before. It's definitely a strange one.

  • Did you check if this was caused by any Windows\Software\driver updates on the server ? Also did you check the SQL Server Error Logs and Windows Application and System logs during the issue ?

    --

    SQLBuddy

  • We didn't see any patches that would be related to this problem. There aren't any error log or windows event logs that indicate there is something causing the IOPS to drop so drastically.

  • john_c_deprato (4/3/2014)


    We didn't see any patches that would be related to this problem. There aren't any error log or windows event logs that indicate there is something causing the IOPS to drop so drastically.

    Do you do anything common on your Servers like pushing some software on to all these servers ? Or do you having anything in common for these servers like some third part software like backup agent etc .. ?

    --

    SQLBuddy

  • We went down the path of commonality, but nothing jumped out. We could always find something that was different that would make the case against commonality.

  • john_c_deprato (4/3/2014)


    We went down the path of commonality, but nothing jumped out. We could always find something that was different that would make the case against commonality.

    Do you anticipate this to happen on any server ? If so you can start Perfmon with CPU counters and profiler parallely to do some troubleshooting.

    --

    SQLBuddy

  • We have RedGate SQL Monitor so making sure we know when the problem occurs isn't really an issue. We were just hoping someone else may have seen this problem before and may have a solution of some kind.

  • In your admin for the dell RAID controllers, have you checked to see if the batteries are showing warnings? I've seen issues in the past with Dell servers getting slow suddenly, and I/O seems sluggish, and a deeper look found that the RAID battery is currently going through its 'relearning' cycle. When this happens, it looked like the Write-Back policy is disabled and Write-Through is enabled, resulting in writes becoming very slow compared to the standard operation.

    Worth a check...

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Thanks for the reply! We have actually seen that happen in the past as well and it's one of the first things we check and in the case I've specifically described the cache policy was still enabled.

  • Hi I saw your post and wondering have you resolved it and what was your resolution?

    If you have not have you checked trying to re install system drivers?

    More specifically HBA drivers, MPIO drivers, etc?

    Or other applications or tools with drivers that are part of the windows driver stack?

    --------------------------------------------------
    ...0.05 points per day since registration... slowly crawl up to 1 pt per day hopefully 😀

  • Was this ever resolved? We've twice seen the same issue on a Dell R720. Every tiny action maxes the CPU and everything performs horribly. Even a warm boot does not clear it up. We have to dispatch someone to the site and and cold boot (shut down then pull power cables) to get things back to normal.

    Back in the day, something like this would make me suspect the disk controller had regressed from DMA mode to PIO. Can that sort of thing happen with these PERC caching RAID controllers?

  • Well we've had a lot of back and forth with DELL on the issue and they were unable to identify the root of the issue. It does have to do with the RAID controller. Essentially the controller just stops caching. You can try all of their latest updates, but we have decided to go a different direction with any of our new server builds. If you change to a different controller, i.e. adaptec or the like you may have more positive results. The other option is to try and use DELL's cachecade configuration which uses an SSD for additional caching support. We tried that on one of our servers and that seemed to help, but even after all of the updates that they suggested we still had it occur on one of our servers. All I can say is good luck! It's hard trying to fix a problem even the manufacturer of the server can't figure out!

  • It's a feature!

    ______________________________________________________________________________Never argue with an idiot; Theyll drag you down to their level and beat you with experience

  • Thanks, John, for the reply.

  • No problem man. Good luck!

Viewing 15 posts - 1 through 15 (of 21 total)

You must be logged in to reply to this topic. Login to reply