11 Replies Latest reply: Jan 30, 2013 11:37 PM by Seijo
Seijo Level 1 Level 1 (0 points)

We have 3 JBOD drives set-up... one as a shared network drive (12TB), and two 8TB units one of which is the primary drive, and the other a backup to that drive and which also holds the individual system backups for all the networked computers.  The reason for this is that this configuration is to deal with and encode RAW video, HD and otherwise, and we need ALL the files to be instantly accessible and searchable (with basic metadata), and spotlight is such a POS, that it has been unusable since it's inception.  I don't need lectures on spotlight or the use/build of JBODs/RAIDs, so if, like on many of the searches I made here, that is your  goal... please don't.

 

At issue, is that for the first time in 3 years, we are now experiencing problems with the primary JBOD drive.  Unfortunately, there is no software that comes with OSX, or in about 15 different backup and recoveries softwares that allows any form of information retrieval or maintenance on such drives... no SMART info is available from any of the 12 individual SATA drives involved, no software we have sees the drives individually, and most don't even recognize the drives as a single JBOD.  Most softwares only see a drive if it is mounted on the desktop, and they only see it  as it is presented via the OSX system (so CCC and Disk Warrior see them sufficiently for their job, but iPartition fails miserably in any form, and it and TechTool are horrifyingly dangerous to operate with a JBOD mounted).

This means we have no way of determining which of the 4 drives in the primary set is failing, only that as it is doing so, it is leaving bad sectors which are being ignored by both the system and any hardware on the actual drive... our only notification that this was happening was that when the file was accessed (the actual contents read, not just the headers and TOCs), the resulting I/O error force unmounted the drive from the system, claiming we had disconnected it.  This occurred during two backup attempts, which led us to the diagnosis... as soon as we removed the files from the backup process, the I/O error and forced dismount stopped.  This was then confirmed by just attempting to read the file without any actual loading... quickly confirming it when the drive dismounted halfway through the read (but as the indicator lights on three of th drives were blinking, there was no way to figure which one).

We then attempted to use three different programs to locate the affected sectors (using a sector scan, which is supposed to locate AND mark the sectors). But, as scanning 8TB would take a day via eSATA (god forbid firewire or USB), when we got back each time, after leaving the various softwares running overnight (or over the weekend), we discovered none of them survived when they finally came across the sector(s) affected.  When the drive got dismounted, every one of the softwares crashed out (IIRC one caused OSX or the GUI to hang, and a cold reboot was required).  Thank the Lord for log files, even though they aren't very detailed.

 

Also note that the main system is a G5 quad running OSX 10.4.11, under hardware and software necessity, and that, even though we have 10.5, it is only used during times when other work is not on-going (typically during only about 16 hours on Sundays). So software CANNOT be MacIntel, or require OSX over 10.5.

 

So, with this knowledge, the questions are:

1. Is there a software package that can recognize and locate bad sectors, without crashing/unceremoniously dismounting the drive?

2. Is there a software package that can locate the exact position of the data of a file (without actually reading the data itself), across a JBOD or other RAID configuration (drive, tracks, sectors)?  Is this available via a terminal/unix command?

3. Is there recovery software that can access the files on the good drives in the set? (don't need this now, but might need it in the future)

4. Is there a general drive software that is actually aware of Apple's software RAID set-ups, and is configured to deal with them?

5. Is there any way to retrieve SMART data from drives across eSATA or a chipset that doesn't normally do so in a drive enclosure(s)?

 

Thanks


Mac OS X (10.4.11), Several incl. G5-quad, G4 AGPs, G3
  • BobHarris Level 6 Level 6 (14,670 points)

    If you have a PC available, you could try SpinRight from GRC.com. You would put the suspect drives in the PC one at a time and run SpinRight on them. If the drive and it's data can be recovered it will. However once you can get your data you should replace the bad drives.

  • etresoft Level 7 Level 7 (25,620 points)

    I don't know the answer to any of your questions. Being a Mac user, I don't want to deal with low-level issues like that. Using a Mac allows me to focus on the big picture. I will let the Linux people worry about disk sectors.

     

    What's wrong with Spotlight? I think that would have been the better question to address before you got to this point. If there is something going wrong with your Mac, it is best to investigate and solve the problem before you put 18 TB of data on it.

     

    I see now that you have issued a pre-emptive plea to ignore the big picture and focus just on the disk sectors. Sorry, no can do. You have said that your drive is failing, but is not yet dead. This means that you still have a chance to save most of your data. Stop trying to look for the failure. JBOD is just a neat trick. You aren't supposed to use it. Get yourself a real RAID array and move your data to it. Do that now. I will issue my own pre-emptive plea to avoid any petty arguments on a discussion forum. It is your data that is being corrupted here, not mine. You can have a real RAID array in a few days. In Boston, someone local might be able to do it. Then your problem is solved - gone. If you have that much data, you might also want to look into a small HSM. You have gone past consumer equipment here.

  • Seijo Level 1 Level 1 (0 points)

    Unbelievable! You are worse than useless, and a condescending twit, as well!

     

    You clearly didn't read this, and you provided nothing but your senseless and ******** derision... here's something prime you ignored:

    "I don't need lectures on spotlight or the use/build of JBODs/RAIDs, so if, like on many of the searches I made here, that is your  goal... please don't."

     

    1. If you don't know the answer to any of the questions, *** did you even post for?

    2. If you only want to work on the big picture, why are you even looking at low-level help requests?

    3. Why do only Linux people have to worry about sector issues?

    4. What's wrong with spotlight? There's over 20,000 posts here telling you why!  Go read them!

    5. There's nothing wrong with the Mac.  You, that's different.

    6. We've had over 8TB running constantly under this system for over 3 years, there is nothing wrong with it.

     

    You obviously don't have a life, and you are a total @$$.  Your setup and beliefs about computer use are apparently your own, and have no basis in reality for thousands of us out here.

  • Seijo Level 1 Level 1 (0 points)

    That might help, if SpinRight can take a drive located inside an external enclosure, and just find the bad sectors.  It would take more than a day for each drive, as the only PCs we have right now are laptops, and none of them can read eSATA via port multiplier. Also, the drive headers and catalog will primarily be only on a single one of the drives (of 4), while the file pointers and actual data will be on all of them.  So if SpinRight needs to look at and find TOCs and other structure data, it won't be able to work with the drives in this form.

     

    As far as replacing the drive, that's why I've made this query... I just have to figure which one is failing, then it's as simple as replacing it, running a restore from the backup drive, and returning to normal operation.

     

    I'll check out SpinRight, and see if it can do this.  I was hoping for a proper piece of software that would do this.  You'd think it would be covered under Disk Utility, since it builds the various software RAIDs, but no, it can't even tell when there are structural problems with normal single partition drives.

     

    Thanks.

  • etresoft Level 7 Level 7 (25,620 points)

    Seijo wrote:

     

    Unbelievable! You are worse than useless, and a condescending twit, as well!

    And you are insulting someone who is trying to help you save your data before it is too late.

     

    You clearly didn't read this, and you provided nothing but your senseless and ******** derision... here's something prime you ignored:

    "I don't need lectures on spotlight or the use/build of JBODs/RAIDs, so if, like on many of the searches I made here, that is your  goal... please don't."

    If you wipe the froth from your mouth and re-read my reply, you will see that I did read your plea and decided it was more important to help you save your data than help you fiddle with disk sectors while your files were being lost forever.

     

    1. If you don't know the answer to any of the questions, *** did you even post for?

    To help you save your data.

     

    2. If you only want to work on the big picture, why are you even looking at low-level help requests?

    Because such low-level help requests are almost always misguided. People go down a rabbit hole and get stuck on minutiae and miss easy, higher-level solutions.

     

    3. Why do only Linux people have to worry about sector issues?

    Because that's what they do with their spare time.

     

    4. What's wrong with spotlight? There's over 20,000 posts here telling you why!  Go read them!

    This is a user-to-user tech support forum. Most post in this discussion forum are about problems. Usually those problems are created by the user from misconfiguration. Sometimes they are hardware failures. Your problem seems like some of both. In any event, the forums are a poor way to judge the relative quality of any particular technology. Anything used as much as Apple devices is bound to have many reports of problems. In the big picture of hundreds of millions of users, thousands of such problem reports are insignificant.

     

    6. We've had over 8TB running constantly under this system for over 3 years, there is nothing wrong with it.

    If there is nothing wrong with it then why are you posting here?

     

    You obviously don't have a life, and you are a total @$$.  Your setup and beliefs about computer use are apparently your own, and have no basis in reality for thousands of us out here.

    You're welcome. Enjoy what files you have left.

  • Christopher Murphy Level 3 Level 3 (555 points)

    What you're looking for is smartmontools. That's the set of tools that includes smartctl for polling the SMART information off ATA disks. So long as the drives are directly connected to the motherboard or to a PCI controller, it's easy to use. If the drives are in an enclosure of any type, then it gets a lot trickier because most of the bridge chipsets fail to pass through the SMART command set.

     

    Getting smartmontools on OS X is not easy. If you're familiar with Macports you can get it from there, but of course they provide source, not binaries so you have to have xcode to compile it for 10.4 or 10.5. I have a DMG/PKG installer of smartmontools ppc for 10.5. I doubt it will work on 10.4.

     

    The least invasive way to get access to smartmontools, requiring no installation, is to download a linux distribution for PPC. I'd vote for Fedora 17 because I know it contains smartmontools already, at least for i386 and x86_64, and I don't see why it wouldn't include smartmontools, but even if it doesn't it's straightforward to install it, even though your computer is booted with a CD you can install software (the installed software goes away every time you reboot as it's installed in RAM) a huge plus of a LiveCD. I can't tell you if you need the ppc64 or 32-bit version however. I think for the G5 the 64-bit is what you want.

     

    In any case, smartctl needs to be pointed to the actual physical disk. An array is presented to the user by the OS as a sort of logical volume. That logical volume is made up of multiple physical disks. If you are on OS X you use:

     

    diskutil list


    You will see the individual disks and you will see the array, each will have different /dev/diskX designations.

     

    From a linux LiveCD you'd get your listing of disks using:

     

    parted -l

     

    That's a lower case L. And the designation there is /dev/sdX. You have to run smartctl on each individual physical disk making up the concat array. I do this both on OS X on mounted volumes, as well as on mounted or unmounted volumes in linux. So that part doesn't matter. It's a read only command in any case. The basic command to get attributes on the disk and find out which one has problems is:

     

    smartctl -a /dev/diskX

    or

    smartctl -a /dev/sdX

     

    The first is OS X's disk designation and the 2nd is linux. OS X uses numbers for devices, e.g. disk0, disk1; and linux uses letters, e.g. sda, sdb, etc.

     

    But now that I've written all of this, the bad news is I don't think you can remove a disk within an OS X software linear array, unless it's a RAID 1+linear nested array. As far as I know, there is no command like LVM's pvmove to migrate the data from the bad disk to a new disk, whether the array is online or offline. I think your only choice is going to be to backup, find the bad drive, replace it and recreate the array from scratch, then restore. This is a LOT easier to deal with on linux using LVM and pvmove for this exact situation.

     

    BTW, for DAS and video you're better off with RAID 0 than linear/concat. At least with RAID 0 you get scalable performance, whereas with linear/concat your performance is limited to that of a single disk.

  • Christopher Murphy Level 3 Level 3 (555 points)

    Aha, so now I see that this is in an external enclosure. You have to know what chipset it is, then you have to read the exhaustive documentation for smartctl to find out what parameters to use to get through the chipset, assuming it's even possible. You're better off pulling the drives from the enclosure, direct connecting them to the SATA port on the motherboard, and polling the drives individually that way.

     

    Also, I think your storage expectations are flawed. You're talking enterprise storage expectations, but the hardware you've got isn't enterprise. The way to do it correctly is direct connect SATA to a SATA PCI card, no chipset in between. Or you build a NAS and run smartctl/smartd on the NAS to directly poll the drives. Enclosure bridge chipsets almost universally are junk. So expecting any utility to automagically deal with junk is not a good expectation. It isn't the fault of Disk Utility when bridge chipsets don't pass through all ATA commands, including SMART commands, to/from the drives.

  • Christopher Murphy Level 3 Level 3 (555 points)

    Wait wait wait. I should be smarter than this. If this is a Disk Utility software concat RAID, then obviously it sees individual disks. And if you're getting a URE on a bad sector, then dmesg will report the device and sector of the failed read. So you should reproduce the read failure by reading the file causing the problem, and then go to terminal and type

    sudo dmesg

    And now sift through that for a read error. You could even try:

    sudo dmesg | grep error

    And see if any lines show up. Other variations are possible too:

    sudo dmesg | grep sector

    sudo dmesg | grep bad

    sudo dmesg | grep read

     

    Although that's just a guess. I'm not sure how XNU reports bad sectors. But in any case, you can match the device it complains about getting an error from with the results from Apple System Profiler. There you will find the BSD disk name as disk0, disk1, disk2, and also what the serial number is. And you can then correlate the disk with bad sectors with a disk serial number and then remove the offending disk.

  • Christopher Murphy Level 3 Level 3 (555 points)

    Another thing to do is to simply overwrite the offending file with a known good copy. This will cause the file system to want to replace those sectors. When the disk attempts to write to the bad sector, if there is a persistent write failure, then the disk firmware will remove that sector from use and replace it with a reserve sector.

     

    Bad sectors are somewhat normal and not inherently a reason for replacing a drive. But only looking at the full SMART attribute data is it possible to determine if there are bigger problems going on with a drive than just a few bad sectors that need to be written to, to force reallocation.

     

    Consumer SATA disks really should be zero'd on a regular basis (once a year or two), or ideally issue the ATA Secure Erase command (which can be done with a linux liveCD with the hdparm command) which is quite a bit faster than writing zeros, and can be done to multiple disks at once. The firmware itself is doing the erasure so bus/controller bandwidth is a non-factor.

  • Seijo Level 1 Level 1 (0 points)

    etresodt wrote:

    And you are insulting someone who is trying to help you save your data before it is too late.

     

    Bull****.  Your condescending tone from the start belies that!  On top of that it's obvious we have backups.  You point, again, here, proves you are full of it.  We lost NOTHING but a little time to deal with this.  It didn't even impinge on work-time for projects.

     

    etresodt wrote:

    If you wipe the froth from your mouth and re-read my reply, you will see that I did read your plea and decided it was more important to help you save your data than help you fiddle with disk sectors while your files were being lost forever.

     

    Again, a full load of ****.  See the first response.

     

    1. If you don't know the answer to any of the questions, *** did you even post for?

    etresodt wrote:

    To help you save your data.

     

    Again, see the first response.

     

    2. If you only want to work on the big picture, why are you even looking at low-level help requests?

    etresodt wrote:

    Because such low-level help requests are almost always misguided. People go down a rabbit hole and get stuck on minutiae and miss easy, higher-level solutions.

     

    Again, a total load.  You obviously think you are a computer ghod, and everyone else in the world is a total idiot.  You're just like any other bigot.  And if you had really read the query, you would know your premise was completely erroneous.

     

    3. Why do only Linux people have to worry about sector issues?

    etresodt wrote:

    Because that's what they do with their spare time.

     

    WOW!  See the previous response.  You are a total twit.

    4. What's wrong with spotlight? There's over 20,000 posts here telling you why!  Go read them!

    etresodt wrote:

    This is a user-to-user tech support forum. Most post in this discussion forum are about problems. Usually those problems are created by the user from misconfiguration.

     

    Wow, again the bigot insults everyone in the community.  And again if you had really read the query, you would know your bigoted premise was not applicable.

     

    6. We've had over 8TB running constantly under this system for over 3 years, there is nothing wrong with it.

    etresodt wrote:

    If there is nothing wrong with it then why are you posting here?

     

    Because, you total idiot, the configuration isn't the issue (and individual HDDs fail all the time before 3 years are up), it's the tools and identification methods (or lack thereof).

     

    You obviously don't have a life, and you are a total @$$.  Your setup and beliefs about computer use are apparently your own, and have no basis in reality for thousands of us out here.

    etresodt wrote:

    You're welcome. Enjoy what files you have left.

     

    Again, you are worse than useless, and apparently only troll here to get numbers that make you feel like a worthy humanoid.  It's really too bad you can't get penalized for being such, as well as just being a bad representation of the community... unlike the other responders here.

    Your responsed proved every single point of mine, and your secondary responses prove what type of a troll you are.  Oh, well, on to those who actually want to be of help...

     

    BTW, no files were lost or permanently damaged, so your taunts and insults are having the opposite of your intended effects.

  • Seijo Level 1 Level 1 (0 points)

    @Bob Harris

    I tried SpinRite first, but unfortunately something about it didn't like our HP hardware, and attempting to boot it, gave an inappropriate command, at which point it hung... we know it works on other hardware since one of our other employees had a copy for his own use.  He later showed the program to me at home, and if I had actually used it, it would have completely wrecked the drive (like it would have mattered anyway :-)  ).  Later experimentation on the drive showed that the issue was due to some form of a hardware error during writing, and the longer a file was the more damage was being done to the drive.  Unfortunately any attempt to read the trashed sectors dismounts the drive. However, no actual errors were occurring from reading good sectors.  Even when we later hooked the drive to the motherboard SATA, reading a "bad" sector dismounted the drive.

     

    Thanks for the info, though... that was very informative, and an option for later possible problems.

     

    @Christopher Murphy

    Smartmontools was the way I eventually went...

    Bizarrely, a Windows guru (now working at a Windows Store), pointed me to "The Mac Shop," which is only about 3 miles from my house.  A quick call to them pointed the only simple way to do this, was to plug the drive into a motherboard with direct SATA connection and a form of SMART reeading software installed or launched off CDROM.  One of the HP towers had an empty secondary drive spot open, so we plugged it in and launched linux and smartmontools (later we added a Windows based SMART reader).  It took a while as we swapped out drive after drive (putting the drive in and taking them out of the trays took the longest time, the test was lightning in comparison).

    Turned out the bad drive was the fourth and final one, and showed up passing on all but a few of the tests, but those were catastrophic.

    For experimentation, when one of the G5s later became free, we added SMARTReporter (very weak compared to Smartmontools, but sufficient) to the machine, yanked the Data drive (each G5 has a small system drive and a huge data drive) and stuck the bad drive in... on powering up, OSX via Disk Utility immediately notified us that the drive failed SMART, but had NO details whatsoever, SMARTReporter gave a few details, but nothing to hint what was really happening to the drive.

     

    As to your third response, I did try looking into log files for anything, but found nothing...  however, I did not try dmesg, so that might work... though I'm not sure that it would work, if the first response to hitting the trashed sectors was to completely and unceremoniously eject the drive (OSX saw it as pulling the drive from the system, just like you would get if you pulled a USB stick out without dismounting it).  I'll keep this in mind if something similar happens.

     

    As to your fourth response ("simply overwrite with a good copy").  This doesn't really work... The file system here would remove the original file from the system, then write the new one. The new file, even if identical, does not usually land in the same place, afterwards (Don't know if that's true for all of OSX, or just the systems we've been using the last four or five years).  Besides, the first wrecked sector encountered dismounts the drive.  It might have been nice if it didn't dismount, as I could just find a very large file with the error and the drive with the strongest activity light would be the obvious choice. ^_^

     

    Well, the drive has already been recycled, and about 8 hours after replacing it, we were back to situation normal.

     

    Thanks to both of you.

     

    ==========

     

    In conclusion, if YOU have the same issue... A JBOD array in either individual external cases or a single combo case, where one drive is failing and you cannot determine visually (from the activity LEDs) which one:

    Locate a computer with a direct SATA connection and some form of SATA monitoring software and insert the drives one-by-one until you find the one that shows the failures.

     

    If your Mac has two drive bays, you can use the non-system drive bay as your test, and if SMART reports "FAILURE" Tiger or later OSX should reveal it right after booting up (I'm assuming this function was not removed in later versions of Disk Utility).

    If you have a Mac with only one drive bay, you can try booting from an OSX install disc (which has Disk Utility available on the disc).  If it doesn't notify you immediately, run Disk Utility manually from the install disc after boot.

    If your Mac doesn't have an accessible drive bay, try another computer with SMART monitoring software installed, or use a Linux CD with smartmontools.

    Just be prepared to spend quite a bit of time swapping out drives, and don't stop just because one of the drive shows a failure... TEST ALL YOUR DRIVES in your array.  The last thing you want to do is repeat this in a month or so, if another of the drives is near to failing.

     

    --左様なら と 有り難う