System Unreliable When RAID-1 Degraded
Several days ago, I became aware that my Time Machine backups were hanging at "Finishing Backup". After nearly two weeks troubleshooting the problem, I discovered that the RAID-1 mirrorset that contained the system (boot) volume was degraded because one of the two disks had failed, and that was causing my Time Machine problem.
So here's the new problem:
RAID-1 is supposed to provide high availability by allowing the system to continue operating when one drive of a pair fails. However, that is not proving to be the case. For me, when one drive in the mirror set fails, macOS 10.13.4 gets flaky and exhibits the following behaviors:
- Time Machine backups will hang at Finishing Backup. The TM log file will indicate that the snapshot was successfully created, but the none of the post-backup steps will be taken, the backup job never completes, nothing further is logged to the TM log, and future scheduled backups will fail to start for the reason that "Another backup is in progress".
- Disk Utility will hang at "Loading disks". No amount of waiting will ever allow this to clear. Nothing helpful gets logged to indicate what DU is waiting for. Some have suggested that these symptoms are indicative of a rogue fsck process or an improperly dismounted external USB drive, but neither is the case.
- The system will not shut down. If I issue a sudo shutdown -r now the display will freeze and a little wait/progress spinner will be displayed, but the system will never halt or reboot no matter how long I wait. At this point, the only option will be to do a forced power-off.
- If I do a forced power-off after a hung shutdown, the Open Directory database will get corrupted and I will have to restore it from a previous backup.
I've experienced the above behavior five times now. If instead of trying to shutdown the system, I instead replace the failed drive and rebuilt it using the command line version of DU (in my case that's something like diskutil appleRAID repair disk2 /dev/disk1), the mirror set will rebuild and the RAID status will return to healthy. However, the unreliable behavior of TM, DU, and shutdown will continue until the system is rebooted. That is to say, even after the RAID is healthy, TM still hangs, DU still hangs, and the system won't shutdown without hanging.
The first few times this happened, I didn't actually replace any disks since rebooting the server caused the failed disk to rebuild. However, the time before last, I did replace the failed disk with a NOS ADM. 48 hours after the rebuild, the SECOND disk failed. I find it VERY difficult to believe these disks are failing after such light duty use. I just replaced the second disk, and if my guess is correct, the first disk will now turn up as failed in the next 48 hours.
Can anybody comment on my experience? I thought Apple's software RAID-1 implementation was pretty stable and reliable, according to threads I've read online. The fact that I have a disk fail in a mirror set and suddenly TM, DU and shutdown all hang, and nothing resumes normal operation until the server is rebooted... you should NOT have to reboot after rebuilding a mirror set. That defeats the entire purpose of high availability through RAID! And with fairly new AMDs, getting hardly any use at all, I can't understand why they're failing every 48 hours.
Rack Mount Servers - Intel