Currently Being ModeratedMay 18, 2012 10:04 AM (in response to Grant Bennet-Alder)
Sorry, which messages exactly are you asking about?
The messages from the log in RAID Utiliity pertaining to this (got them by Exporting the log) were, in reverse time order (the 8:22 entry coincided with this morning's restart):
Friday, May 18, 2012 08:22:38 ET Degraded RAID set RS1 - No spare available for rebuild critical
Friday, May 18, 2012 03:51:09 ET Degraded RAID set RS1 - No spare available for rebuild critical
Friday, May 18, 2012 03:49:53 ET Degraded RAID set RS1 warning
Friday, May 18, 2012 03:49:51 ET Drive 3:5000cca216e13980 failure detected - Previous drive status was inuse critical
I can't find anything in the logs pertaining to the status of all the other drives, except that there are a few hundred "AppleRAIDCard" entries in kernel.log starting at 3:19, reporting scsi_request and scsi_task errors. Those stopped at 3:51 and have not recurred. I can post some (or all) of those if it would be helpful.
I see that there is a message with the word Drive and a 3 in it, but I think it is a huge assumption that any of those messages specifically calls out the drive in Bay 3.
I would listen to the advice of RAID Utility, and replace the drive in Bay 1 if that is the one that is not running with the rest.Mac Pro (Early 2009), Mac OS X (10.6.8), & Server, PPC, & AppleTalk Printers
Currently Being ModeratedMay 18, 2012 11:14 AM (in response to Grant Bennet-Alder)
Yeah, I had noticed that it said "Drive 3:xxxx" and realized that there was an ambiguity there. I might quibble with the contention that it's a "huge" assumption to think it's Drive 3 since in the past I've had such messages in which the number in the message did correspond to the bay number of the failed drive. But that could have been coincidental.
I'm going to take your advice and replace the drive in Bay 1 as soon as the bootable backup that I'm making via SuperDuper completes. (I have a TM backup but would prefer to have an easily bootable image in case something else goes wrong.) I'll report back.
One other question: the Hitachi drive in Bay 1 (the presumably failed drive) was supplied by Apple with the Mac Pro four years ago. Is there any reason to seek a replacement from Apple rather than from Hitachi? (Assuming the drive is still under warranty at all. If it were a Seagate, it might still be under warranty.)
Thanks for your help.
For a Mac Pro, conventional wisdom is that the drives, even with the Apple logo on the label, are nothing special.
For an iMac, you need a specific set of drives -- ones that are equipped with a calibrated Heat sensor and extra pins that the machine uses to read the drive temperature and use that temperature for fan speed control.Mac Pro (Early 2009), Mac OS X (10.6.8), & Server, PPC, & AppleTalk Printers
Currently Being ModeratedMay 19, 2012 7:58 AM (in response to Grant Bennet-Alder)
Update: even with the failed drive removed from the RAID Set the system was too unstable and wouldn't stay up long enough to complete a full backup with SuperDuper.
So I replaced the drive in Bay 1 anyway and all is well so far. Marked it as a spare and the rebuild has begun, If past experience is any indication it will take about 72 hours to complete.
The RAID rebuild is happening much more quickly than it has in the past. If the current speed is any indication then it will be done in closer to 24 hours than the usual 72 hours.
This is the first time that the system has had four matched Seagate drives in it; before now there were always three Seagates plus the Apple-supplied Hitachi that came with the system.
This prompts me to ask: does this suggest that the previous configuration was sub-optimal because of the mismatch? Does it suggest that the Hitachi might have been problematic all along?
It is also possible that it was slow due to marginal blocks on that drive, that required multiple re-trys to read good data.Mac Pro (Early 2009), Mac OS X (10.6.8), & Server, PPC, & AppleTalk Printers
Currently Being ModeratedMay 21, 2012 4:43 AM (in response to Grant Bennet-Alder)
RAID array was rebuilt in under 24 hours and RAID Utility reports that all is fine.
Unfortunately my system is still misbehaving in the same way: unpredictable and frequent freezes, spinning beachball, etc. It did this several times during the rebuild (requiring a reboot) and a couple times since.
In the past when I had these symptoms, replacing the ultimately-failed disk drive solved it. So I'm not sure how to troubleshoot this further. I don't see anything obvious in the logs under /var/log that would point to the problem (like a ream of messages about a problematic disk drive) but perhaps I'm not looking closely enough.
Use Activity Monitor to check memory usage:
When you are confident in that information, especially that Pageouts are not killing your performance, change to this display:Mac Pro (Early 2009), Mac OS X (10.6.8), & Server, PPC, & AppleTalk Printers
Currently Being ModeratedMay 22, 2012 11:07 AM (in response to Grant Bennet-Alder)
Thanks, but it doesn't seem to be related to swapping/paging or to CPU usage by a runaway process. The system always seems to have plenty of RAM (18 GB is installed) and neither Activity Monitor nor iStat Menus ever seem to show anything obviously suspicious (nor does "top" when I run that), No swapping, no processes consuming all the CPU (also, it's an 8-core machine).
What typically happens is that the system will seem fine, but then running a new command, or (e.g.) asking my IMAP client to move some files, provokes a hang: everything locks up, spinning beachball, etc. Sometimes this is preceded by an obvious degradation in performance, but sometimes not.
It "feels" disk-related and in the past I had similar symptoms preceding a disk failure (as I did before replacing the drive a few days ago). But I can't see anything relevant in the log files, and in the past, when a disk failed, I didn't see clear evidence of this in the logs until RAID Utility declared the failure--nothing previous to that.
Is there some way to increase the verbosity of the disk-related logging? If a disk really is problematic, even if the problems are recoverable after retries, you'd think that would be detectable by the system and that it could be logged.
The only other things I can think of trying right now are either: (1) replacing the remaining drives in the RAID Array one by one, starting with the drive in Bay 3 (because of the earlier ambiguity), and/or (2) disconnecting as much as possible from the system and seeing if it becomes more stable in that configuration. Though I've already mostly done (2) without any obvious improvement,
Do any 3rd party utiliities show spare blocks remaining vs used? or do background scan of media?
Is Apple RAID card 'worth' the trouble etc is another question to consider.
Have you thought of just software RAID? maybe SoftRAID 4.x which is a solid product and will scan in idle background to insure your drives do not experience I/O errors (even marginal ones, you set the threshold).
With hardware RAID, if it is/was anything like with SCSI/SAS the drives all had the same revision model and firmware, people would buy spare drives at time of purchase to insure that they did have on hand extras later.
If it feels like I/O or taking longer (time limited error recovery) try switching to WD RE series drives? scan for bad blocks. Rebuild the directory (is Disk Warrior 4.4+ 64-bit yet? and would it work properly?
2TB RE drives (high density and I/O) in a 3-drive mirror (stripe reads using SoftRAID)
4 x WD 10K using the new 1TB models (200MB/s) instead of 1TB Seagates....
May seem like drastic surgery but would I think provide better support and performance.
If you are suspicious that Bad Blocks is the problem, third-party Utilities can do a live "Scan for Bad Blocks". Tech Tool Pro and Drive Genius come to mind.
Another way to do this is to get a spare drive like the others (don't you need one on hand anyway?), add it to the set as a Hot Spare, and kick out a suspect drive (or power off and remove one) to get it to rebuild onto the Hot Spare.
On the now surplus drive, perform an Erase with Security Erase Option: "Zero all Data" (One pass). This takes several hours to complete, but forces the drive to substitute spares from its private pool to replace any found to be Bad after Zeroing. Then that drive can become the new Hot Spare, and continue the process until all have been "laundered".
Occasionally, a drive with good SMART Status returns "Initialization Failed" caused by more than 10 Bad Blocks. The only proper response at that juncture is to throw your arms in the air and scream, "YES! I knew it!"Mac Pro (Early 2009), Mac OS X (10.6.8), & Server, PPC, & AppleTalk Printers
Currently Being ModeratedMay 22, 2012 11:45 AM (in response to Grant Bennet-Alder)
TTP will scan but it isn't any use, not to me and others, when it doesn't map out or even tell you what the block's sector ID is (use a tool to manually map out sector).
What did work and people miss from OS 9's Drive Setup -
Those are Seagate drives use their own Seatools - burn ISO to CD, or use WD Lifeguard the same way which I know does map them out and cure ills of disk drive diseases.
More Like This
- Retrieving data ...
- This solved my question - 10 points
- This helped me - 5 points