13 Replies Latest reply: May 28, 2011 12:26 PM by R C-R
Steve Darden Level 2 (155 points)

With four terrabyte drives on my MacPro I am searching for a way to validate every bit on every drive. Steve Gibson's SpinRite utility will do this, but sadly Spinrite will not run on Intel/EFI (it uses the old BIOS to do the I/O).


What is the best way to accomplish the same job on the MacPro SATA drives?


I've run TechTool Deluxe, but I cannot find any documentation on what the "Surface Scan" test actually does. I'm 99% sure it only reads, probably once per block.

Mac Pro, Mac OS X (10.6.7), 23" Apple Cinema HD, ATI Radeon 3870
  • C. D. Tavares Level 1 (120 points)

    The only way I know of is to erase the entire drive with "write zeroes once."  That triggers bad-block reallocation behavior in the drive.  Any block that fails to write properly will be spared out and replaced with a working block.

  • Steve Darden Level 2 (155 points)

    Thanks C. D. I agree that the zero-erase forces bad-block remapping. That is why I prepare new drives using Disk Utility seven-pass erase.


    But, how do we know that our two layers of backup drives are actually any good should the master drive have an OOPS?


    DiskWarrior is good for validating directories, but what about the actual data? As drive density increases the probability of block failure keeps going up. The Google and Carnegie Melon studies are frightening - and based on our experience, indicative of the problem.

  • R C-R Level 6 (17,400 points)

    There is no practical way to do this besides writing some distinctive data pattern to every sector of the drive & then making sure that you get that same pattern on reads. Some utilities will do this non-destructively by shuffling the data from sectors in use to known good ones before writing anything to those sectors.


    But this is just about pointless, at least for modern drives.


    In the first place, modern drives internally monitor the raw results from reads, using CRC codes & analysis of the analog signals coming from the drive's heads to detect read errors before passing anything to the system. No drive, even a flawless one, reads every sector perfectly on the first pass 100% of the time (nor does it read anything less than an entire sector at a time, including the CRC info in the part of the sector reserved for its use).


    The drive has its own sophisticated algorithms to determine if a sector is bad, based on the internal data it collects about the retries required to get good data from it. Some utilities assume any retry, or an arbitrarily determined low number of them, indicates a so-called "weak" sector, even if the drive (with its full access to the raw process) deems it OK. Such utilities are likely to "detect" perfectly good sectors as weak or bad ones.


    In the second place, drives do this internal checking on every read because sectors do go bad from time to time over the life of the drive. You could run the most exhaustive tests possible on every sector (which would involve at least writing two complementary data patterns to each sector to make sure both the high & low binary states are stored correctly) & still have no assurance that the next time the sector is used it might go bad. In fact, if you ran such tests you would be subjecting the drive to more wear & tear than it gets in normal use, which could in itself contribute to an earlier than normal drive failure.


    Regarding the zero data security erase provided by Disk Utility, it does not force the drive to read back data from any sector, only to write zeros to the appropriate securely erased ones (which does not include every sector on the drive). The drive might do a read-after-write ("RAW") if it has been in service only for a short time, but you cannot rely on this to detect bad sectors, in part because an all-zero data pattern is not a particularly good one for detecting bad sectors for the reasons given above.


    What it all boils down to in simplified terms is that contemporary high density hard drives are smarter than the external software tools used to check them. They have direct access to internal data the external tools can access either not at all or only indirectly with the help of the drive itself. The drive's internal checks operate independently of & are not reliant on anything supported by the OS, the CPU, the file system, etc. -- which includes any external software utilities you might use.


    You can certainly run any of these utilities to add a second layer of reliability testing that you want, but the only practical way to safeguard your data is redundancy, which means storing it in at least two physically separate & distinct places.

  • R C-R Level 6 (17,400 points)

    As I mentioned in my other reply, none of the security erase options provided by Disk Utility (or the underlying diskutil process) actually forces bad block remapping. That is up to the drive to do, & it does that only if it reads back the data pattern(s) the secure erase puts on the sectors. The 7 & 35 pass secure erases don't do this any better than the one pass one.


    In fact, since most contemporary high density HD's turn off RAW shortly after they are put into service for performance reasons, the longer multi-pass erase procedures may actually be less effective than the quickest possible one if bad block detection in new drives is your goal.


    Regarding validating the actual data stored on a drive, other than what the drive does itself there is no way to do that except by comparing multiple copies of the files byte by byte. The developer tools you can install from your system discs include a utility called FileMerge that has file & directory comparison capabilities, but it isn't well suited to this.


    BTW, it is not directly related to this issue, but if you are interested in the details of security erases, you might want to check out this seminal paper by Peter Gutmann, the originator of the 35 pass erase.

  • C. D. Tavares Level 1 (120 points)

    The question as posed seems to be chasing one's tail, since data could go bad at any time, including right after it has been checked.


    But if you're mainly interested in verifying the integrity of backup data, you could consider a backup solution like Retrospect, which allows you to configure a post-backup verification pass that re-compares the backed-up data with the original.  As you would expect, it finds a lot of discrepancies in files that change in real time, and again, it says nothing about the integrity of the data beyond that time, but from your description it may be what you're looking for.

  • Steve Darden Level 2 (155 points)

    Thanks heaps R C-R for taking the time to write such a briefing. Much appreciated. And thanks for the link to the Peter Gutmann article - which I've archived as a useful reference.


    We are using Seagate Barracuda 7200.12 drives, 500GB and 1000GB. I have had 3 failures in Mac Pro service over the past 18 months - two on mirror backup volumes, one on the Time Machine volume. The recent terrabyte drive failed after only 2 months Time Machine service. It seems to have a damaged partition table that TestDisk cannot recover.


    That experience is motivation to look into methods for detecting incipient data loss - if possible. I may have been mislead by all the hours of listening to Steve Gibson on Security Now. Certainly Steve and Leo Laporte convey the impression that SpinRite is useful not just for recovery, but for assessing "drive health" and via the Surface Refresh function, extending the error-free life of a drive.


    For any readers that are motivated, Steve offers a 19pg document describing SpinRite's Technology "What's Under the Hood" available as PDF here: http://www.grc.com/files/technote.pdf.


    Back to bad block remapping. Running the Disk Utility zero-erase may not be totally voodoo. E.g., on the new 1000GB Barracuda I'm about to put into service, here are the six kernel.log errors generated during the 7-pass erase: 


    May 26 21:38:14 MacPro kernel[0]: disk2s2: I/O error.

    May 27 00:28:57 MacPro kernel[0]: disk2s2: I/O error.

    May 27 03:18:59 MacPro kernel[0]: disk2s2: I/O error.

    May 27 06:08:08 MacPro kernel[0]: disk2s2: I/O error.

    May 27 08:57:21 MacPro kernel[0]: disk2s2: I/O error.

    May 27 11:46:13 MacPro kernel[0]: disk2s2: I/O error.


    Is there anything to be learned from such errors?


    FWIW, I think my backup strategy is consistent with your recommendations. I was considering adding a Drobo or similar NAS to get RAID5/6 redundancy. But I concluded that a simpler, cheaper and possibly more robust Mac Pro solution is four 1000GB Barracuda drives configured thusly:


    A: Master

    B: Time Machine incremental backup

    C: nightly SuperDuper! bootable mirror #1

    D: nightly SuperDuper! bootable mirror #2


    That provides 4 copies of the data. The C, D mirrors are rotated monthly with the 5th Barracuda offsite drive. When we outgrow the terrabyte space I will replace the set with e.g. 3 terrabyte drives.


    While I'm not sure it makes any real difference I have decided to avoid partitioning. Now that VMware Fusion works so well we have no compelling reason to partition.


    We continue to run Disk Warrior on all drives except Time Machine. Any thoughts on that?


    Does TechTool cause bad block remapping via the "Surface Scan" function? The utility of remapping on a RAW error, given the correct data are still in hand. It is less clear what is accomplished on detection of a read error - except where rereading is able to fetch the data with valid CRC? Is roughly how the drive remapping logic works?


    As a thank-you I will mark your last as "Correct Answer".

  • Steve Darden Level 2 (155 points)

    Thanks C. D. Perhaps it is tail chasing. See my reply below on SpinRite -- I am 98% sure that Steve Gibson believes that sectors often get progressively "weaker" until the drive reports the data "failed".


    The Gibson/SpinRite view is certainly appealing. And if it is valid, it sure would be nice to be able to run SpinRite on Intel Macs.


    I will speculate that R C-R would object to using a validating backup utility like Retrospect or ChronoSync - the objection being that the drive is already doing a validation on every write.


    What I would like to see is an empirical study on these questions - a study of the quality of the Google and Carnegie Melon studies, which requires a massive number of drives and duty cycle hours. Here are the recent studies I mentioned on hard drive reliability: Google and Carnegie Mellon.

  • R C-R Level 6 (17,400 points)

    Before I say anything else, please understand that while I am very interested in hard drive technology & its evolution, & because of that have spent considerable time studying many white papers & related info over the years, I am not an expert on the subject. In particular, I don't have any "insider" knowledge of proprietary techniques drive makers don't make public.


    Also, please don't mark a reply as the 'correct answer' unless you think it deserves it on its own technical merits. Saying "thank you" is always appreciated, but words are enough for that.


    That said, I am somewhat skeptical of the claims made for SpinRite's abilities to increase the error-free life of modern high density HD's. The main reason is the "What's Under the Hood" description of how it does that more or less describes what these drives already do internally, just not all at once. For example, refer to the Error Management and Recovery pages from StorageReview.com, in particular the Error Notification and Defect Mapping page. (You also might want to check out the Data Encoding and Decoding pages for an overview of that topic.)


    Anyway, as it describes, modern drives attempt to detect unreliable sectors, move their data to spare sectors reserved for that purpose, & map out the unreliable ones, all without any external help. Typically, in these drives the sectors mapped out this way become permanently inaccessible externally (which can be a security concern if the drive is discarded, but that's another issue).


    To do a better job of this utilities like SpinRite or TechTools must somehow both detect unreliable sectors better than the drive itself can do & map out those sectors so the drive won't use them. SpinRite claims its "Flux Synthesizer" lays down a data pattern that does the first, but a notable lack of any mention of encoding technologies like PRML makes me wonder if this way of tricking the drive really can be effective for drives using that kind of technology. TechTools does not to my knowledge document how its surface scan works, but I suspect from observing how long it takes that it just lays down one or two data patterns & notes how long it takes to read them back or something like that.


    Neither utility seems to document much about how it maps out whatever sectors it considers weak. I once read something a few years ago that suggests TechTools might do this in the file system by marking the sectors as in use, but I have never verified that for myself. SpinRite seems to work at a DOS/BIOS level, so I have no idea how that would apply to Intel Macs, since they have no BIOS.


    Likewise, I don't know what the 7 pass I/O errors mean, or even if they are anything to be concerned about. The logs tend to be "chatty" & report as errors things that are of no real consequence & well as things that are.


    AFAIK, Disk Warrior is a fine utility, especially for problems caused by directory damage. But that is a file system problem so I don't know how effective it would be for sector-level ones.


    I hope some of this helps. For what it doesn't maybe someone else will reply with their own better contribution. Collectively, we are a pretty knowledgeable group!

  • Steve Darden Level 2 (155 points)

    Again, thanks for your comments. I should clarify that personally I'm not very interested in the data recovery capabilities of tools like SpinRite. Given a proper backup procedure it should be much more efficient to use the backup than attempt recovery. Of course recovery is unreliable by definition, and we want reliability. If a utility exists that really detects incipient failures that should contribute to reliability. E.g., I would like to be able to verify that a drive being sent to offsite was good when dispatched. And most people who do regular backups have only a single backup volume. If that drive has serious errors they would be happy to know before they need the backup "in anger".


    To do a better job of this utilities like SpinRite or TechTools must somehow both detect unreliable sectors better than the drive itself can do & map out those sectors so the drive won't use them.

    A couple of tidbits on SpinRite: First, as you noted the s/w uses the BIOS for I/O, which means it will not run on EFI Macs. For my testing and recovery attempts I had to borrow a recent HP tower with SATA support.


    Second, the I/O rates are sufficiently high that a bus-connected DMA interface is required to get job-times that are practical. Because of that I have not attempted to use such as a USB external enclosure. I don't have accurate timings from my tests, but a simple surface scan seems to run at about 20 MB/minute. The full data recovery mode is MUCH slower, reports indicated a day or 2 or 3 for relatively small drives like 250GB. The reason apparently is that SpinRite's DynaStat data recovery can spend remarkable elapsed time until it either succeeds to recover a sector or decides to give up and mark as failed (I gather it does not give up until it has exhausted all its counters).


    Anecdotal testimonials, including from a couple of consultants I know, are that Spinrite succeeds where no other utility works. Consultants love SpinRite for it's ability to recover grandma's only copy of her digital photographs.


    Neither utility seems to document much about how it maps out whatever sectors it considers weak.


    My recall is that SpinRite's method is to progressively dial down the effective read flux until the sector fails. If above some threshold it's good else bad.

    AFAIK, Disk Warrior is a fine utility

    DW has no knowledge of anything beyond directory structure. But directory damage will continue to plague us until we get bug-free software. I've been happy for 15+ years - but how would I know if the new DW-written directory contained subtle flaws?


    Anecdotally, DW finds directory errors that are not report by Disk Utility. How important are these errors....?

  • R C-R Level 6 (17,400 points)
    My recall is that SpinRite's method is to progressively dial down the effective read flux until the sector fails. If above some threshold it's good else bad.

    That is for detecting a (supposedly) weak sector. How it then gets mapped out (to make the drive more reliable by avoiding the use of that sector) is a different subject. AFAIK, the only practical way to do this with HFS+ volumes is by adding the allocation block the sector is in to the bad block file. Obviously, since this is part of the file system this would not survive a reformat of the drive or erasure & replacement of a partition.

    But directory damage will continue to plague us until we get bug-free software.

    Directory damage can also be caused by abrupt loss of power to the drive, or a bad data cable or other hardware-related issues. I don't use DW these days but I periodically run Disk Utility's disk checks on all my drives since directory problems tend to get worse as more & more file segments are overwritten.

    Anecdotally, DW finds directory errors that are not report by Disk Utility. How important are these errors....?

    It isn't so much directory errors as less than optimal directory B-tree structures in HFS or HFS+ volumes, especially noncontiguous (fragmented) allocation, catalog, extents overflow, or attribute files. (See the "Broad Structure" section of Technical Note TN1150 for more about HFS+ structures.)


    Alsoft claims that an optimized directory is both faster & more reliable than an un-optimized one. That was probably quite true in the OS 9 days (slower drives & CPU's, no journalling, etc.) but I suspect it is not that important now. Still, if you are looking for the most reliable HD storage, it might be worth periodically running DW.

  • R C-R Level 6 (17,400 points)

    Sorry, one other (maybe obvious) comment relevant to SpinRite's data recovery abilities I forgot to add to my previous post:


    Any utility that attempts data recovery by repeatedly reading a sector should be used with care. If the sector has become so unreliable that the drive can't recover the data without help, chances are that section of the platter has been damaged, for instance because some of its magnetic particles have flaked off due to age, thermal fatigue, mechanical shock, or a manufacturing defect.


    At the microscopic flying height of a drive's head, even a tiny speck of dust is like a boulder in the road, & wherever it gets lodged between head & platter it is likely to gouge the surface, causing more particles to be dislodged in what becomes a cascading, run away failure mode all over the drive.


    So it is always a good idea to get what data you can off the drive with the 'gentlest' possible methods before trying to recover anything else with something more stressful.


    For the same reason, it probably isn't worth trying to return the drive to service after using SpinRite type more intensive recovery techniques.

  • The hatter Level 9 (60,930 points)

    Actually I would not rely on zero all from Disk Utility, nor does MicroMat. When I have had bad sectors, zero with DU did not. 7-way is more reliable and what Micromat recommended, but still not perfect.


    You are better off with the manufacturer's own disk utility to do long format / zero / extended test and restore a drive.

  • R C-R Level 6 (17,400 points)

    There really isn't any such thing as a "long format" (if by that you mean a low level format) for modern drives using embedded servo technology.


    The drive's heads are not capable of writing the servo tracks that supply the servo signals to position the head in the track with the required precision. For this reason, the drives hardware will not allow writing in this area. If it did, the drive would be rendered useless.