Where are RAID failures logged?

I wanted to ensure that Time Machine backups were stable, so I put them on a 3-way mirror of external 5 TB drives. All was good until this weekend. One drive showed as damaged/missing, but seemed to be responding OK (S.M.A.R.T. status: Verified). I tried adding it back as a member and the rebuild ran all day and night. The next morning, two drives were damaged/missing. I tried adding the second failed drive back as a member and the rebuild ran all day and night. This morning that drive is showing as failed, but currently responding OK. This morning I added the drive back as a member and it is currently rebuilding. I expect it to fail.


Right now I am down to a single functioning drive that is working very hard to rebuild other drives. I sincerely hope it does not fail.


I want to RMA the two failing drives, but they repeatedly report and act OK, except when 12+ hours into a rebuild. Are these failures logged anywhere? I expect to have some pushback from the drive manufacturer when I tell them, "The drives have failed but are responding perfectly. I have no logs of the failures."

MacBook Pro 14″, macOS 13.3

Posted on Apr 10, 2023 9:23 AM

Reply
Question marked as Top-ranking reply

Posted on Apr 10, 2023 11:41 AM

What's the RAID controller involved here? Apple hardware RAID? Apple software RAID? Or a RAID controller from some third-party vendor?


Apple hardware RAID uses the RAID Utility, and that shows activity.


Given the use of RAID-1 mirroring (and not, for instance, RAID-6), and given your phrasing, and given the MacBook Pro footer, I'm guessing this is Apple software RAID on some external storage. Which is about as low-end as RAID can be gotten.


That logs in the system logs, available in Console app:

... Find log messages and activities in Console on Mac - Apple Support


RAID rebuilds are notorious for increased error rates, and that fundamental behavior of rebuilds is why using RAID-5 can be so hazardous, particularly as the storage volume capacities increase. RAID-6 reduces the exposure to the usual RAID-5 catastrophic second error. (Related: rebuild risk calculator, and background info, and Backblaze SSD and HDD drive stats review and reliability data, including 2022, and the canonical Google HDD drive reliability study.)


If the storage lacks support for WRITE LONG or analogous to mark storage as bad, then the only option for the host RAID software or RAID controller is to toss the whole storage device out of the RAID set. And there's a lot of gear that doesn't send that command, and a lot of storage that doesn't support receiving that (or equivalent) command. Which means more than a few configurations twill toss out the whole storage device on any error. One error will kick the volume out of a lower-end RAID storage array, which is why vendors are loathe to issue an RMA.


Here? You're likely headed for some new devices, and getting an immediate backup of any data not already archived elsewhere. I'd probably replace this with RAID-capable network-attached storage (NAS) and some higher-spec HDDs, and retire the use of the Apple software RAID, too. Preferably a NAS with support for Time Machine server. Or replace it with direct-attached storage (DAS) hardware RAID capable of RAID-6, such as a Promise Pegasus storage array, but DAS would require wiring it to the MacBook Pro, where a NAS can work via Wi-Fi.


And as mentioned earlier in the thread, RAID is not backup. Not even close.

Similar questions

14 replies
Question marked as Top-ranking reply

Apr 10, 2023 11:41 AM in response to martyscholes

What's the RAID controller involved here? Apple hardware RAID? Apple software RAID? Or a RAID controller from some third-party vendor?


Apple hardware RAID uses the RAID Utility, and that shows activity.


Given the use of RAID-1 mirroring (and not, for instance, RAID-6), and given your phrasing, and given the MacBook Pro footer, I'm guessing this is Apple software RAID on some external storage. Which is about as low-end as RAID can be gotten.


That logs in the system logs, available in Console app:

... Find log messages and activities in Console on Mac - Apple Support


RAID rebuilds are notorious for increased error rates, and that fundamental behavior of rebuilds is why using RAID-5 can be so hazardous, particularly as the storage volume capacities increase. RAID-6 reduces the exposure to the usual RAID-5 catastrophic second error. (Related: rebuild risk calculator, and background info, and Backblaze SSD and HDD drive stats review and reliability data, including 2022, and the canonical Google HDD drive reliability study.)


If the storage lacks support for WRITE LONG or analogous to mark storage as bad, then the only option for the host RAID software or RAID controller is to toss the whole storage device out of the RAID set. And there's a lot of gear that doesn't send that command, and a lot of storage that doesn't support receiving that (or equivalent) command. Which means more than a few configurations twill toss out the whole storage device on any error. One error will kick the volume out of a lower-end RAID storage array, which is why vendors are loathe to issue an RMA.


Here? You're likely headed for some new devices, and getting an immediate backup of any data not already archived elsewhere. I'd probably replace this with RAID-capable network-attached storage (NAS) and some higher-spec HDDs, and retire the use of the Apple software RAID, too. Preferably a NAS with support for Time Machine server. Or replace it with direct-attached storage (DAS) hardware RAID capable of RAID-6, such as a Promise Pegasus storage array, but DAS would require wiring it to the MacBook Pro, where a NAS can work via Wi-Fi.


And as mentioned earlier in the thread, RAID is not backup. Not even close.

Apr 13, 2023 2:24 PM in response to martyscholes

As others have mentioned it takes nothing to break a macOS software RAID. I have seen this occur with two seemingly healthy drives installed internally inside a Mac Pro. With external RAID drives you chances of issues increases dramatically especially if each drive in the array is a separate external drive. If you want to use external drives in a software RAID (especially with macOS), then make sure to purchase an enclosure which can handle the number of drives desired. At least with this single enclosure all the drives should receive power at the same time and hopefully go ready close enough together to satisfy the software RAID driver.


Also, Disk Utility's "SMART Status Verified" is not enough to ensure the drives are healthy. macOS and Disk Utility will only show a "SMART Status Failed" once the drive reports enough failures to exceed the manufacturer's allowed number of failures. With hard drives this rarely happens and when it does, the drive will already be unusable anyway. If you want to monitor the health of a hard drive, then you need to use a dedicated app such as SMARTReporter to alert you when any bad blocks occur which is the main issue when hard drives have issues. If you are using SSDs, then using such a utility is only an alert to manually examine the SSD's health information to determine whether it is a serious issue or whether it is just the normal operational behavior of the SSD.


Most drive manufacturer's will replace a hard drive under warranty if a hard drive is showing bad blocks and/or fails the SMART selftest. I've never had a problem with any of them honoring a warranty even with a low number of reallocated/pending sectors. Drive manufacturers are unlikely to replace a drive just because it dropped out of a RAID....you will need to prove a hardware issue. Keep in mind drive manufacturers only provide certain model drives for a RAID setup unlike many years ago where any drive could be used.


What is the exact make & model of these hard drives?


As for why you see so many contributors mention "RAID is not a backup" is because we have seen so many users thinking otherwise. We are just trying to be cautious & helpful especially since you seemed so concerned that the remaining drive may fail before you can rebuilt the array....it sent warning flags to us.




Apr 13, 2023 8:40 PM in response to martyscholes

martyscholes wrote:

Thanks for the response. The SMARTReporter tool does not support external drives. The disks in question are WD 5 TB Elements Portable USB drives. These are low-end kit, to be sure. The Apple implementation of RAID is also low-end, to be sure.

You need to install a special USB driver in order to attempt to access the external drive's health information. Some adapters/docks/enclosures may prevent access to this health information even when using the special USB driver. I didn't think to mention it before since you mentioned Disk Utility showed "SMART Status Verified" so I thought these may have been Thunderbolt drives instead of USB drives.


The SMARTReporter FAQ does mention that a special driver is needed and has links to the open source project's GitHub page.

https://www.corecode.io/smartreporter/faq.html


You can get the special USB driver from the DriveDx website (they have some nice instructions there for uninstalling the driver) or from the USB driver project's GitHub page. The DriveDx version I believe is the better option as it is signed by the DriveDx developers and will likely be easier to install with macOS' security features.


FYI, while I normally suggest using DriveDx instead of SMARTReporter, it appears DriveDx from BinaryFruit may no longer be an active business since the download button has not worked for several months (another user did say the site will happily take people's money even though the download link is broken...not sure what happened to them...perhaps just a single person there and they are for whatever reason unavailable to fix the broken button/link although the last update was in 2021). Our organization uses both apps and technically it is still possible to download the DriveDx app from the site with the proper link.

https://binaryfruit.com/drivedx/usb-drive-support


https://github.com/kasbert/OS-X-SAT-SMART-Driver






Apr 10, 2023 10:46 AM in response to martyscholes

RAID is far, far, far more sensitive than the criteria Drive-makers use for RMA. ANY error tears apart the RAID.


The appropriate procedure is to re-write the bad blocks, which you have done by re-imaging. This will cause the drive controller to substitute a "spare" block for any blocks it has observed are Bad. Spare Blocks are blocks the drive keeps in a private reserve for exactly this purpose.


The criteria for RMA are that the drive has such an overwhelming number of Bad Blocks that it runs out of spares, and can no longer substitute spare blocks because the reserved blocks are depleted.


-----

The question YOU have to answer for yourself is, "is this drive still good enough to hold my precious data?"


If not, this drive is no longer good enough to be part of the RAID. However, it may not be BAD enough to be returned to the drive-maker under RMA.

Apr 10, 2023 10:57 AM in response to martyscholes

The R in RAID is for redundant, but RAID is NOT backup!


• Mirrored RAID is used to reduce the time-to-repair after a failure, and to keep drive failures from becoming a data disaster. It does not protect from human error, crazy software, or 'just-because' failures.


• Striped RAID can be somewhat faster in some cases (especially in an array built from Rotating Magnetic drives), but it is brittle, and you MUST have another copy nearby in case of failure. A striped RAID failure destroys EVERYTHING on it, with No hope of recovery. Most users would be better served by a faster SSD than a striped rotating magnetic RAID.


• Concatenated RAID is not really RAID at all, it is "just a bunch of drives" aka JBOD, pasted together and acting as if it were one HUGE drive. So you can take two larger drives, concatenate them into one Volume, and have a really big Backup drive, for example.


• RAID 5 computes checksums of the data blocks (in real-time, coming and going), and stores two copies of the data AND the checksum blocks in such a way that a failure in any one of the three drives still allows the data can to be recovered from the other two drives. It requires checksum-computing hardware to be seen as anywhere near fast enough for most uses.


Criticisms of RAID-5 include the cost and delays induced by the extra hardware, and the HUGE amount of time it takes to re-create a large data set using RAID-5. Re-creation time is so large, another drive is non-trivially likely to fail in the time it takes to re-create the data, making the entire concept shaky.


Executive summary: Most users would be better served using multiple drives to make multiple backups, rather than dedicating multiple drives to RAID arrays.


Apr 10, 2023 11:20 AM in response to Grant Bennet-Alder

Thanks for the quick reply.


The criteria for RMA are that the drive has such an overwhelming number of Bad Blocks that it runs out of spares, and can no longer substitute spare blocks because the reserved blocks are depleted.


Presumably, there is a problem of some sort since MacOS has declared it as failed. Has it been determined to fail because of an overwhelming number of Bad Blocks? If so, where is that logged? Why did MacOS declare the drive has failed?


Executive summary: Most users would be better served using multiple drives to make multiple backups, rather than dedicating multiple drives to RAID arrays.


Is that possible with Time Machine, to spray the backups to multiple drives such that if a drive fails I can rebuild what was there? This "array" (and I use that term loosely) is the target for Time Machine backups, enabling PITR from human error, crazy software, or 'just-because' failures.


Thanks again,

Marty

Apr 10, 2023 11:49 AM in response to MrHoffman

Thanks for the quick reply. I could not find the logs in the console, but was unsure what keywords to use or which file to view.


You are correct that this is MacOS software RAID and it is also very low-end (yet it feels somewhat ZFS-ish), as are the spindles I am using. It seems WD is experiencing some sort of security incident right now and will not take RMAs, but I submitted a question to support. If I have to eat these drives and replace them, I can find a way to live with that. I was hoping that someone knew off the top of their head where the logs for the rebuild were. It seems that this is a mystery to many of us.


And as mentioned earlier in the thread, RAID is not backup. Not even close.


Literally every replier to this thread has stated that RAID is not backup. That tells me that people have this intense urge to state that something is not something else, but I do not understand the source of that urge. My post did not even hint that RAID == backup, so you and the other repliers are not correcting my post, but what are you really trying to say here? I don't run around stating similar expressions of inequality, such as "the sun is not water, not even close" without a reason for stating it. Where is this statement coming from?


Many thanks,

Marty

Apr 10, 2023 12:01 PM in response to martyscholes

Why the mention of RAID and backup? I'm personally familiar with multiple cases involving data loss due to RAID failures, and no backups. I've encountered other cases posted here. We don't know what you know or do not know, only what you have posted.


WD has been offline after a server security breach, including user data offline at the WD My Cloud servers. A server breach which reportedly occurred on March 26th. I deal with a number of HGST HDDs from the IBM heritage, and which WD now owns. Not Fun.

Apr 10, 2023 12:32 PM in response to martyscholes

<< Presumably, there is a problem of some sort since MacOS has declared it as failed. Has it been determined to fail because of an overwhelming number of Bad Blocks? If so, where is that logged? Why did MacOS declare the drive has failed? >>


MacOS has not declared anything failed. Your RAID blew apart because ONE drive encountered ONE error that could not immediately be repaired. When a drive has an error that can't be fixed, the RAID software declares that drive "degraded".


You fixed that error by re-imaging. If you had no other issues, your RAID would now be perking along happily.


As far as the drive makers are concerned, as long as you can re-image the drive successfully, your DRIVE is not failed, and that is all they care about. If your RAID blew apart because of the one error, that is NOT cause for drive replacement. Drive makers don't want your error logs and will NOT act on those errors, because they consider them correctable by re-imaging. (Years ago the standard was to Write Zeroes to every block first, then re-image.)


--------

Large-scale studies conducted in google's server farms showed that over an extremely large sample of drives running 24/7, the number of errors encountered generally tended to snowball once the first error had occurred, and generally tended to start a cascade that got worse and worse. Over a very large sample, the average drive in their study was replaced due to an overwhelming accumulation or errors within about six months. However, the variation from this average was very high. Some failed sooner, some continued to be stable for a very long time.

Apr 13, 2023 3:50 PM in response to HWTech

Thanks for the response. The SMARTReporter tool does not support external drives. The disks in question are WD 5 TB Elements Portable USB drives. These are low-end kit, to be sure. The Apple implementation of RAID is also low-end, to be sure.


This volume is simply a target for Time Machine backups, as noted in the post. I am taking backups, using Time Machine, and storing them on a (low-end) RAID array. I am using both backups and RAID, never conflating the two.


The alternatives were either to skip backups, recognizing that my critical files are stored in the cloud, or backing up to a single disk with no redundancy, which I did for several months. I have since changed that policy to back up to a 3-disk RAID1 volume, to increase the likelihood of recovery. I am glad I did, since I have experienced a 66.67% failure rate.

Apr 13, 2023 6:51 PM in response to martyscholes

Allow me to propose something completely different, yet very similar.


Time machine can accept multiple backup destination drives. Every-other backup goes to every-other drive.


When added, each will be the start of a new stand-alone backup set, not dependent on any other drives, Staring from now and alternating with your other backup sets. Drives can be removed and new drives added any time you like.


I am currently using one backup drive next to my computer, and one shared from a Server in the basement.

Sep 30, 2023 11:21 PM in response to martyscholes

I just uninstalled softRAID, because of their new $110/year subscription plan, and switched to Apple RAID which is free.


The drives are just as fast as before and they also "chatter" a lot less when not being accessed. I can use a affordable SMART monitor tool to keep tabs on the disks.


I agree softRAID has a decent interface, but just can't do another subscription... paying in advance for a promise of some future benefit. Nope, let me see the goods now and I will pay you now.

Sep 30, 2023 11:31 PM in response to HWTech

I would like to add that not every RAID implementation is the same. RAID 1+0 is not as risky as RAID 0 for instance. I agree that RAID is not the same as a snapshot of your data, i.e. a backup, but it can protect against losing one or even two HDD if you choose the correct implementation.


I disagree with your comment about always putting every drive of a RAID in one enclosure. An example... softRAID actually suggests using two enclosures for a RAID 1+0. The idea being more redundancy. If one of the enclosures has a problem you still have a working mirror in the other. A mirror in each enclosure striped.

This thread has been closed by the system or the community team. You may vote for any posts you find helpful, or search the Community for additional answers.

Where are RAID failures logged?

Welcome to Apple Support Community
A forum where Apple customers help each other with their products. Get started with your Apple Account.