Ambiguous RAID failure

Question

Level 1

5 points

Ambiguous RAID failure

Early 2008 Mac Pro with Apple RAID card and 4 x 1TB drives installed.

I've had yet another RAID failure on my system (it's happened several times before) but this time the diaganostic is ambiguous and I need to make absolutely sure what's happened before I try to recover from it.

Overnight the RAID Utility log reported that Drive 3 had failed, that Raid Set RS1 was now degraded, and that there was no spare available for rebuild.

When I look now in RAID Utility at the status of the drives and of the array, all four drives show "green" (SMART: Verified and Status: Good), and Raid Set RS1 is "Viable (Degraded)". But it shows the drives in Bays 2, 3, 4 as "Assigned" to Raid Set RS1, while the drive in Bay is not: it shows as "Roaming".

I'm fairly sure that one of the drives actually is problematic, because I've been having increasingly frequent episodes of freezing and non-responsiveness on the system (spinning beachball). In the past couple of days it got so bad that it was difficult to do anything at all following a restart; the freeze/beachball happened very soon after. I remember now that I had exactly this symptom in the past, just prior to a drive failure that RAID Utility reported.

So I guess I need to replace one of the drives, mark it as "Spare" in RAID Utility, and let the array rebuild.

But WHICH drive should I replace? The log says that Drive 3 failed (I'm assuming that "Drive 3" is the drive in "Bay 3"), but now that drive shows as "good"--as do all four drives. It's Drive 1 (i.e. the drive in Bay 1) that's been taken out of the array; drives 2, 3, and 4 are in Raid Set RS1. Is that a red herring? Is it possible that Drive 1 is bad even though the report was about Drive 3? (Drive 1 is the only drive that has never been replaced at any time in the four years since I got this system.)

I think/fear that if I replace Drive 3, I'll blow away the array.

So it seems to me that I should rebuild the array by marking Drive 1 as spare (since it's the only drive that's unassigned), wait for it to complete, and then replace Drive 3 and rebuild again. Or maybe I should just replace Drive 1 pre-emptively.

I don't know, but it takes a full 72 hours for the re-build to complete, a nerve-wracking time because throughout it the system is vulnerable to a second drive failure, so I would prefer not to have to do it multiple times.

Can someone please tell me in detail what the safest/most correct way is to proceed in order to recover from this?

Thanks.

Mac Pro (Early 2008), Mac OS X (10.7.4)

Posted on May 18, 2012 8:39 AM

Reply

Answer 1

May 18, 2012 9:10 AM in response to rrgomes

Use Console utility and retrieve the EXACT TEXT of that message and post it.

Reply

Answer 2

rrgomes Author

Level 1

5 points

May 18, 2012 10:04 AM in response to Grant Bennet-Alder

Sorry, which messages exactly are you asking about?

The messages from the log in RAID Utiliity pertaining to this (got them by Exporting the log) were, in reverse time order (the 8:22 entry coincided with this morning's restart):

Friday, May 18, 2012 08:22:38 ET Degraded RAID set RS1 - No spare available for rebuild critical

Friday, May 18, 2012 03:51:09 ET Degraded RAID set RS1 - No spare available for rebuild critical

Friday, May 18, 2012 03:49:53 ET Degraded RAID set RS1 warning

Friday, May 18, 2012 03:49:51 ET Drive 3:5000cca216e13980 failure detected - Previous drive status was inuse critical

I can't find anything in the logs pertaining to the status of all the other drives, except that there are a few hundred "AppleRAIDCard" entries in kernel.log starting at 3:19, reporting scsi_request and scsi_task errors. Those stopped at 3:51 and have not recurred. I can post some (or all) of those if it would be helpful.

Thanks.

Reply

Answer 3

May 18, 2012 10:53 AM in response to rrgomes

I see that there is a message with the word Drive and a 3 in it, but I think it is a huge assumption that any of those messages specifically calls out the drive in Bay 3.

I would listen to the advice of RAID Utility, and replace the drive in Bay 1 if that is the one that is not running with the rest.

Reply

Answer 4

rrgomes Author

Level 1

5 points

May 18, 2012 11:14 AM in response to Grant Bennet-Alder

Yeah, I had noticed that it said "Drive 3:xxxx" and realized that there was an ambiguity there. I might quibble with the contention that it's a "huge" assumption to think it's Drive 3 since in the past I've had such messages in which the number in the message did correspond to the bay number of the failed drive. But that could have been coincidental.

I'm going to take your advice and replace the drive in Bay 1 as soon as the bootable backup that I'm making via SuperDuper completes. (I have a TM backup but would prefer to have an easily bootable image in case something else goes wrong.) I'll report back.

One other question: the Hitachi drive in Bay 1 (the presumably failed drive) was supplied by Apple with the Mac Pro four years ago. Is there any reason to seek a replacement from Apple rather than from Hitachi? (Assuming the drive is still under warranty at all. If it were a Seagate, it might still be under warranty.)

Thanks for your help.

Reply

Answer 5

May 18, 2012 4:35 PM in response to rrgomes

For a Mac Pro, conventional wisdom is that the drives, even with the Apple logo on the label, are nothing special.

For an iMac, you need a specific set of drives -- ones that are equipped with a calibrated Heat sensor and extra pins that the machine uses to read the drive temperature and use that temperature for fan speed control.

Reply

Answer 6

rrgomes Author

Level 1

5 points

May 19, 2012 7:58 AM in response to Grant Bennet-Alder

Update: even with the failed drive removed from the RAID Set the system was too unstable and wouldn't stay up long enough to complete a full backup with SuperDuper.

So I replaced the drive in Bay 1 anyway and all is well so far. Marked it as a spare and the rebuild has begun, If past experience is any indication it will take about 72 hours to complete.

Reply

Answer 7

rrgomes Author

Level 1

5 points

May 19, 2012 1:56 PM in response to rrgomes

The RAID rebuild is happening much more quickly than it has in the past. If the current speed is any indication then it will be done in closer to 24 hours than the usual 72 hours.

This is the first time that the system has had four matched Seagate drives in it; before now there were always three Seagates plus the Apple-supplied Hitachi that came with the system.

This prompts me to ask: does this suggest that the previous configuration was sub-optimal because of the mismatch? Does it suggest that the Hitachi might have been problematic all along?

Reply

Answer 8

May 19, 2012 3:58 PM in response to rrgomes

It is also possible that it was slow due to marginal blocks on that drive, that required multiple re-trys to read good data.

Reply

Answer 9

rrgomes Author

Level 1

5 points

May 21, 2012 4:43 AM in response to Grant Bennet-Alder

RAID array was rebuilt in under 24 hours and RAID Utility reports that all is fine.

Unfortunately my system is still misbehaving in the same way: unpredictable and frequent freezes, spinning beachball, etc. It did this several times during the rebuild (requiring a reboot) and a couple times since.

In the past when I had these symptoms, replacing the ultimately-failed disk drive solved it. So I'm not sure how to troubleshoot this further. I don't see anything obvious in the logs under /var/log that would point to the problem (like a ream of messages about a problematic disk drive) but perhaps I'm not looking closely enough.

Reply

Answer 10

May 21, 2012 7:36 AM in response to rrgomes

Use Activity Monitor to check memory usage:

Activity Monitor: View system memory usage

When you are confident in that information, especially that Pageouts are not killing your performance, change to this display:

Runaway applications can shorten battery runtime

Reply

Answer 11

rrgomes Author

Level 1

5 points

May 22, 2012 11:07 AM in response to Grant Bennet-Alder

Thanks, but it doesn't seem to be related to swapping/paging or to CPU usage by a runaway process. The system always seems to have plenty of RAM (18 GB is installed) and neither Activity Monitor nor iStat Menus ever seem to show anything obviously suspicious (nor does "top" when I run that), No swapping, no processes consuming all the CPU (also, it's an 8-core machine).

What typically happens is that the system will seem fine, but then running a new command, or (e.g.) asking my IMAP client to move some files, provokes a hang: everything locks up, spinning beachball, etc. Sometimes this is preceded by an obvious degradation in performance, but sometimes not.

It "feels" disk-related and in the past I had similar symptoms preceding a disk failure (as I did before replacing the drive a few days ago). But I can't see anything relevant in the log files, and in the past, when a disk failed, I didn't see clear evidence of this in the logs until RAID Utility declared the failure--nothing previous to that.

Is there some way to increase the verbosity of the disk-related logging? If a disk really is problematic, even if the problems are recoverable after retries, you'd think that would be detectable by the system and that it could be logged.

The only other things I can think of trying right now are either: (1) replacing the remaining drives in the RAID Array one by one, starting with the drive in Bay 3 (because of the earlier ambiguity), and/or (2) disconnecting as much as possible from the system and seeing if it becomes more stable in that configuration. Though I've already mostly done (2) without any obvious improvement,

Reply

Answer 12

The hatter

Level 9

60,990 points

May 22, 2012 11:20 AM in response to rrgomes

Do any 3rd party utiliities show spare blocks remaining vs used? or do background scan of media?

Is Apple RAID card 'worth' the trouble etc is another question to consider.

Have you thought of just software RAID? maybe SoftRAID 4.x which is a solid product and will scan in idle background to insure your drives do not experience I/O errors (even marginal ones, you set the threshold).

With hardware RAID, if it is/was anything like with SCSI/SAS the drives all had the same revision model and firmware, people would buy spare drives at time of purchase to insure that they did have on hand extras later.

If it feels like I/O or taking longer (time limited error recovery) try switching to WD RE series drives? scan for bad blocks. Rebuild the directory (is Disk Warrior 4.4+ 64-bit yet? and would it work properly?

2TB RE drives (high density and I/O) in a 3-drive mirror (stripe reads using SoftRAID)

4 x WD 10K using the new 1TB models (200MB/s) instead of 1TB Seagates....

May seem like drastic surgery but would I think provide better support and performance.

Reply

Answer 13

May 22, 2012 11:21 AM in response to rrgomes

If you are suspicious that Bad Blocks is the problem, third-party Utilities can do a live "Scan for Bad Blocks". Tech Tool Pro and Drive Genius come to mind.

Another way to do this is to get a spare drive like the others (don't you need one on hand anyway?), add it to the set as a Hot Spare, and kick out a suspect drive (or power off and remove one) to get it to rebuild onto the Hot Spare.

On the now surplus drive, perform an Erase with Security Erase Option: "Zero all Data" (One pass). This takes several hours to complete, but forces the drive to substitute spares from its private pool to replace any found to be Bad after Zeroing. Then that drive can become the new Hot Spare, and continue the process until all have been "laundered".

Occasionally, a drive with good SMART Status returns "Initialization Failed" caused by more than 10 Bad Blocks. The only proper response at that juncture is to throw your arms in the air and scream, "YES! I knew it!"

Reply

Answer 14

The hatter

Level 9

60,990 points

May 22, 2012 11:45 AM in response to Grant Bennet-Alder

TTP will scan but it isn't any use, not to me and others, when it doesn't map out or even tell you what the block's sector ID is (use a tool to manually map out sector).

What did work and people miss from OS 9's Drive Setup -

http://support.apple.com/kb/TA21976

Fix random lengthy pauses in OS X by correcting bad blocks ...

Those are Seagate drives use their own Seatools - burn ISO to CD, or use WD Lifeguard the same way which I know does map them out and cure ills of disk drive diseases.

Reply

Answer 15

rrgomes Author

Level 1

5 points

May 22, 2012 12:41 PM in response to The hatter

Thanks. Scanning for bad blocks does seem like a good idea. I don't have Tech Tool Pro, Drive Genius, or Disk Warrior; they're each about $100 but I'll happily buy whichever one of them is likely to be most useful here. Would that be DG? Mr. Hatter doubts that TTP will be as useful here, and Drive Warrior sounds like a higher-level recovery utility, operating at the file system level rather than at the level of blocks.

But it also sounds like I could try Seagate's SeaTools first to achieve the same thing, is that right?

And can all these utilties scan for bad blocks without also reformatting the drive at the same time? And on a live system?

I've considered replacing all the drives with newer ones but of course it's pricey to do that.

Let me ask this: my Netgear ReadyNAS Pro allows drives in the array to be replaced with higher-capacity drives, and the array will be expanded to take advantage of the additional space. I guess it's safe to assume that the Apple RAID Card has no such capability?

In which case, if I want to grow the array, do I need to replace all the drives with higher-capacity drives (2 TB or 3 TB), recreate the array using RAID Utility, and then restore my system from backup?

I've chosen Seagate drives in the past because their warranty replacement process is so straightforward; essentially if a drive is in warranty they'll just replace it without making you jump through hoops. Not sure if other manufacturers are as easygoing in that respect.

Reply