RAID-5 Disk Crash

  • Has anyone experienced a disk crash on a RAID-5 disk set and if so, how were you informed about it - flashing lights, beeping, via some notification software? We have servers in remote offices housed in rooms where no-one goes and so where no-one is going to hear or see anything unusual, so email ontification would seem to be the answer. We use Dell tower servers.

    Any thoughts appreciated, thanks.

  • Do you mean a single disk of a R-5 array?

    We've had notifications set up before in software to notify administrators. At the least this should be written to the Windows event log by your hardware drivers and caught by admins.

    We've also had people catch a red light on a server and investigate.

    You should have some sort of alerting set up for this. Another thought: put in a spare and have someone check the servers once a week or so for drives that aren't working.

  • Yes, I was thinkng of a single disk going down. In a previous job this happened over the weekend; by the Monday morning the other disk (of the two) had crashed too! Very painful. Our current servers have 3 disks. What software are you using for notification?

  • We used to use a few at different times. We've run DELL servers and they can notify you when there's a failure with their system software.

    We've also had the vendor's software write to the Windows System log and used What's Up (and Unicenter) to report on those critical errors.

  • Thanks Steve - I'll chase up those two packages, see if we can get something organised here!

  • Steve also mentioned "put in a spare":  not sure if he means Hot Spare, but I would recommend always populating an array as the disks you need plus a Hot Spare.  Then you can survive two disk failures before you are running degraded (the chance of two near-simultaneous disk failures is much higher than statistics suggest...!).

    Also, despite other threads on the subject, we run a lot of of SQL Server on RAID 5 perfectly effectively, even (horror of horrors) with logs and data on the same array.  It's certainly not what I choose when given the chance and enough disks, and it will deliver sub-optimal performance; but for moderate non-intensive use it makes better use of available storage than 1-to-1 mirroring; and despite the parity overhead (I think people often exaggerate this, though I expect it matters where there is really heavy usage), it provides a lot better protection than having no RAID.

  • I have experienced both individual drive and multiple drive failures in RAID5 configurations, as well as a individual drive failures in RAID 1 configurations.

    I will concurr with Steve Jones comments about using the server manufacturers' utilities and management software to notify you of a failure.  Most of my drive / power supply failure notifications come from my management software.

    I've worked with Dell servers as well as HP Proliant servers, with more experience with HP servers.

    The first thing you need is to install the appropriate management utilities / agents on the server.   For Dell servers, this is Server Assistant.   For HP servers, it is the Proliant Support Pack.   Both use SNMP for remote management.  These items are free downloads.

    Once you have the proper agents on the system, you need a management server.  For Dell Servers, this is ITAssistant. For HP servers, it is Systems Insight Manager.  These are free downloads.

    Although you can run either on a desktop machine, they work better on server hardware.  If you have a significant number of machines to monitior, get a dedicated machine, even if it is an older machine being redeployed. 

    Be aware that you will need a SQL database for either (and I think it can be on another machine), and will have the option of installing MSDE during the installation if you don't have one.  The current version of Insight Manager won't run on SQL Server Express, but will run on a SQL 2005 Standard / Enterprise, as stated in their documentation.   I haven't looked at the ITAssistant recently for SQL requirements.

    Once you have the managament software installed, you will need to make sure it can see the managed servers.  In Insight Manager, this is done via a certificate that needs to be configured on the managed server.   ITAssistant, if I remember correctly, doesn't have this issue.

    Once you have the management software and the managed server communicating, the management server will poll the managed server at regular intervals to check on status.  A check of the management server's console will show you server status at a glance.

    To get notifications of failures, you will need to set up SNMP traps.   A SNMP Trap is a message sent from a managed server to a management server.  You configure this by adding a destination address in the managed server SNMP service configuration, under the Traps tab.

    The management server needs to be configured to act on the trap message in some way.   In my case, I have my Insight Manager set to page me (by sending an email to my pager) when it receives a trap from specific servers.  

    I currently have only my most critical servers (about 6) set to send SNMP traps to my management server, and of those, only 3 get all messages paged out. 

    When a drive fails on one of these 3 servers, the trap is sent to my management server.  The management server then pages me.  I can then open a service request (my hardware failures get resolved by a 3rd party, who has 24x7, 4 hour response for servers, and has the most common parts on hand or in the metro Chicago area)

    You will need to set what items you want to page for traps, otherwise you could get a large number of pages.  I also recomend starting small with the number of servers that trap failures, potentially using a test machine to simulate failures to see which issues you need to trap.  

    As an example, rebooting one of the 3 servers generates at least 4 pages from 4 traps: a cold start trap (sent when SNMP starts) and 3 x link up message on the NICs in the server.  If the management server notices the server isn't reachable while it is going down / coming up (as part of the management server regular polling), I get a page indicating the server is unreachable and another when it becomes reachable.  If your remote server are regularly scheduled to reboot, you want to not page for these traps (or at least not page during the reboot window.

    Two More things about SNMP:  You will also need to configure the SNMP Communities on your managed server, with corresponding settings on your management server to make this all work.   Although the default settings will get you working, they are considered a security risk (because they are well known).   Also, Windows 2003 has SNMP defaulting to only accept SNMP packets from itself.  You will need to change this to the IP address of your management server to get these things working.

    If you are using HP hardware, they offer a 2 day class on Insight Manager for about $1200, although they don't deal with the SNMP communities settings much.  I haven't checked Dell's training offerings.

     

  • Thanks Scott, that's very comprehensive. For general server availability we have been using IPCheck Monitor, but I don't think that can check for disk crashes. I'm surprised no-one has mentioned IT Assistant here before, but having had a quick look now at teh documentation it seems to be what we need.

    Thanks to everyone for their contributions.

  • Glad to help.

    From my experience with both Dell and HP servers, I will say the emphasis on ItAssistant / Insight Manager is kind of muted in the documentation.  

    When you buy a server, you get a set of CDs containing the Hardware's OS installation assistance CD and usually the management CD.   If you haven't been taught how to use them, you put them aside for when someone tells you to use them and start loading the OS from the OS CD. 

    I got started working with Insight Manager initially when my group was starting to increase the number of servers we supported (and dedicating people to hardware / OS support).  On the first day I had it installed, I detected a memory error that had been causing issues on one server. 

    It was after I took a class on Insight Manager that I learned the power of using it (and importantly, learned how to use their tools for monitoring which drivers/agents/utilities/ROMs needed updating).

    After I saw what Insight manager could do (and finding it didn't monitor the Dell servers as well as I'd like), I decided to look at ITAssistant for my Dell Servers and found similar capabilities to Insight Manager, although I like the Dell driver/agent/utilities/ROM update indications a little better.

    I will say that once you get ITAssistant up and running, you will come to rely on it for primary notification of issues (and find it is easier for looking up information than going to individual servers).

    Good Luck.

     

  • There's a postcript to all of this...

    Just as one engineer was checking out the Dell software for managing this sort of scenario, another was grappling with a server making an 'odd whining noise'. Thought to be the fan or the power supply, they transferred all the disk to a new box, but still had the noise. It eventually turned out to be the RAID controller making the noise: one of the disks had crashed, though the server was still operating. All was needed was a new disk and to go through the re-build and verify routines. After finding a new disk of the same size (!), it turned out it was quicker to do a restore from backup. Anyway it's fully funcitonal again now.

  • We've been using a product called 'LogMeister' for quite a few months now and found it to be excellent and very cheap for what it does (US$129 or so).

    Well worth checking out.

    A lack of planning on your part does not constitute an emergency on mine.

  • Hello,

    As Ewan H. mentioned above, the chance of 2 disks failing simultaneously is much higher than statistics suggest. This is especially true for servers sold by HP and Dell. The reason is because the drives they are using are from the same manufacturing sets. If that particular manufacturing run of drives had a problem with the read/write head assembly, then it's highly probable that all 3 drives of your RAID 5 will have a common problem.

    As a RAID data recovery specialist, I see this exact scenario every day. It is a misconception that a RAID 1, RAID 5, RAID 10, does not require a backup. A lot of people feel that since their hardware is redundant or fault tolerant that they are protected. Nothing could be further from the truth since data loss can occur on a RAID system due the data becoming corrupted or multiple disk failure. Lets' face it, a database with a load of corrupt page ids is useless whether its running on a single quantum fireball or a enhance tech RAID 6.

    As far as notifications go, I recommend just getting a controller that has email support built in. Some of my favorite products are from 3ware and enhance tech. Setup an email alias like failure@example.com and have it deliver to all your admins and a pager that beeps when it receives an email.

    Wesley Gill

    Gillware Inc.

    gillware.com

Viewing 12 posts - 1 through 11 (of 11 total)

You must be logged in to reply to this topic. Login to reply