Software RAID Failure - my experience and solution
I just wanted to share this information with the iCloud community.
I searched a bit and did not find much information that was useful with regard to my software RAID issue.
I have 27 inch Mid 2011 iMac with SSD and Hard drive which has been great.
I added an external hard drive (I think if I mention any brand name the moderator will delete this post) which includes an nice aluminum case with two 3 TB hard drives within it, and it has a big blue light on the front and is connected via Thunderbolt. This unit is about 2 years old and I have it configured in a 3 TB mirrored RAID (RAID 1) via a software RAID configured via Mac OS Disk Utility.
I had at one point a minor glitch which was fixed using another piece of software (again if I mention a brand the moderator will delete this post) which is like a 'Harddrive Fighter' or similar type name LOL. So otherwise that RAID has served me well as a site for my Time Machine back up and Aperture Vault, etc. (I created a 1.5 TB Sparse bundle for Time Machine so that the backup would not use the entire 3 TBs)
I recently purchased a second aluminum block of drives, and set that up as a 4 TB RAID 1.
Each of the two RAIDs are set with the option of “Automatically rebuild RAID mirror sets” checked.
I put only about 400 gb on the new RAID to let it sit for a ‘burning in period.’
A few days ago the monitoring software from the vendor who sells the aluminum block of drives told me I had a problem. One of the drives had “Failed.” The monitoring software strangely enough does not distinguish the drives so you can figure out which pair had the issue, so I assumed it was the New 8 TB model. Long story short, it was the older 6 TB model, but that does not matter for this discussion.
I contacted the vender and this is part of their response.
“This is an indication that the Disk Utility application in Mac had a momentary problem communicating with the drive mechanism. As a result, it marked that drive as "failed" in the header information. Unfortunately, once this designation is applied to a drive by the OS, the Disk Utility will thereafter refuse to attempt any further operations with that disk until the incorrect "failed" marker is manually cleared off the drive.”
That did not sound very good to me…..back up killed by a SOFTWARE GLITCH?
“The solution is to remove the corrupted volume header, and allow the generation of a new one….This command will need to be done for each disk in the array… (using Terminal)…
diskutil zerodisk (identifier)
…3. After everything is finished, you should be able to exit Terminal, and go back into the Disk Utility Application to re-configure the RAID array on the device.”
Furthermore they said.
“If the Disk Utility has placed a flag into the RAID array header (which exists on both drives) then performing this procedure on a single drive will not correct anything.”
And…
“When a drive actually does fail, it typically stops appearing in the Disk Utility application altogether. In that circumstance, it will never be marked "failed" by the Disk Utility, so the header erase operation is not needed.”
This all sounded like a bad idea to me. And what does the Vendors RAID monitor software say then? “Disk Really Really FAILED, check for a fire.”
As I tried to figure out which drive was actually the bad RAID pair I stumbled on a solution.
First I noted that the OS Disk Utility did NOT show a fault in the RAID. It listed both RAIDS as “Online.’ Thus no rebuilding was needed and it did not begin the rebuild process.
The Vendors disk monitor software saw some fault, but Mac was still able to read and write to the RAID, both disks in the mirror. I wrote a folder to the RAID and with various rebooting steps I pulled the “Bad” drive and looked at the “Good” Drive….the folder was there…I put the Bad drive back in and pulled the Good Drive and the folder was there on the “bad” drive. So it wrote to both drives. AND THE VENDORS MONITORING SOFTWARE SHOWED THE PREVIOUSLY LABELED ‘BAD’ DRIVE AS ‘GOOD’ AND THE MISSING DRIVE SLOT AS ‘BAD’.
My stumbled FIX. I moved a bunch of files off the failed RAID to the new RAID but before I moved the sparse bundle, a folder of 500 gigs movies and some other really big folders the DISK UTILITY WINDOW (which I still had open) now showed that the RAID had a Defect and began rebuilding the mirror set itself, out of the blue! I don't know why this happened. But moving about 1/2 of the data off of it perhaps did something? Any Ideas?
This process took a few hours as best I can tell (let it run overnight) and the next day the RAID was fine and the Vendors RAID monitor did not show a fault any longer.
So, the Vendors RAID monitoring software reporting a “FAILED” drive without any specific error codes to look up. Perhaps they could have more info for the user on the specific fault? The support line of the the Vendor said with certainty “the Volume Header is corrupted” and THE ONLY FIX is to completely ZERO THE DRIVE! This was not necessary as it turns out.
And the stick in the eye to me…..
“I've also sometimes seen the drives get marked as "failed" by the disk utility due to a shaky connection. In some cases, swapping the ends of the Thunderbolt cable will help with this. Something to try, perhaps, if your problems come back. “
Ya Right…..
Mike
iMac, Mac OS X (10.7.2), 27