spare blocks, bad blocks (BB) and console messages

Question

Level 2

375 points

spare blocks, bad blocks (BB) and console messages

i have a few questions regarding spare blocks;

please read:
Topic : How worried should I be? "The spare blocks... appear to be exhausted" ;
http://discussions.apple.com/thread.jspa?messageID=11906767#11906767

in that thread, the OPs DWarriors console message was reporting, a total of 36 available spare blocks and 232 use attempts. Apparently, the OP has been able to use the drive to recuperate his data. We conclude that that particular drive was fonctioning on borrowed time. And that the OP could have just as easily lost his data.

*Q1: how is the drive functioning around those bad blocks (BB) and the lesser number of spares?* Copying and pasting to accomodate the bad blocks?

- - -

my console messages report on the 4 internal hard drives, yet i have many more eSata drives:

*Q2: the DW daemon only scans internal drives?*

my console message, like the OPs seagate, reports my seagate with : spare blocks (Total Available: 36). My three other drives are hitachis, and in those: spare blocks (Total Available: 5)

*Q3: am i correct to assume that the allocation of spare blocks is 'company procedure' as per the brand (and subsequently, model); 36 for seagate and 5 for hitachi, for example?*

*Q4: are the 5 in the hitachis statistically enough?*

- - -

*Q5: are BBs contagious, or are hard drives born with the BBs they got and thats it?*

- - -

theHatter, youve mentioned googles extensive statistical study of their HD population. Would you have a link to that study?

*Q6: are there any numbers on the statistical occurence of BBs?*

For example,
"when BBs do occur, here is the statistical bell curve on how many there usually are."
"25% chance of between 1 and 5 BB. 32% of between 6 and 10 . . . etc."

im wondering that if we knew these statistics, that the number of spare blocks per model would add itself to the list of caracteristics as to why we choose this model HD or that.

MP2007, Mac OS X (10.5.8)

Posted on Jul 15, 2010 12:41 PM

Reply

Answer 1

Jul 15, 2010 3:35 PM in response to l_elephant

Ever since we had magnetic media drives, there has been this vexing problem: How do you manufacture an absolutely perfect magnetic surface with a gazillion magnetic bits on it, on which to record the data.

The solution: Include some "spare" blocks, that the disk controller can substitute for Bad Blocks, so that you don't have to trash a near-perfect platter because one speck of dust got in it. It did not take long to realize that if you could also spare out Bad Blocks that developed in the field, your disks could last longer.

Disk Drives have a warranty. The manufacturer MUST take back a Disk Drive that has Bad Blocks and no more spares on board, it is clearly defective. So manufacturers do large studies to determine how many spares is enough. Too few spares means they will have to eat a lot of bad drives. It is not a clever idea to choose the drives with the largest or the smallest number of spares. (Perhaps the manufacturer know that lots of spares will never be needed.) The best way to choose drives is to choose ones with long warranties, (in hopes that long warranties mean reliable drive) and hope that you are not the poor slob who has to file for a replacement.

IDE/ATA and SATA drives keep a list of blocks that have needed corrections in use. They essentially "put them on probation". When the user supplies new data for one of those blocks, it can be thoroughly tested. If a block on probation cannot hold the new data, that data will be copied into a new "spare" block, and the Bad Blocks list in the drive will be semi-permanently adjusted to reflect the sparing.

Reply

Answer 2

Jul 16, 2010 12:31 PM in response to Grant Bennet-Alder

Q1: how is the drive functioning around those bad blocks (BB) and the lesser number of spares? Copying and pasting to accomodate the bad blocks?

The drive stores each block of data semi-redundantly, using extra bits in a code called a Hamming Code (after the mathematician who invented it). Most disk errors are of no consequence whatsoever, because a re-read yields enough good bits that the data can be corrected by the drive error-correction logic. The code is good at accommodating and correcting for single errors and small bursts of consecutive errors. It also gives an indication that the data read have too many errors to repair.

When the data for a particular block are messed up and uncorrectable, one or multiple re-reads may ensue. If it goes on for a long time, this can cause your Mac to hang, or if bad enough, this can cause an I/O Error. The disk controller will likely place that block "on probation". How many blocks are on probation can be read as part of the SMART status, a group of dozens to hundreds of statistics maintained by the drive firmware.

The list of 232 {potential} Bad Blocks means that is how many are giving trouble. (It is probably the number on the Probation list.) 32 Spares available MAY be enough, but you would have to re-write each one and see which were "cured" by re-writing and which were permanent errors.

A drive should always appear to be in "near-perfect" condition to hold your precious data. Anything less may be detrimental to the health of your data, and that is not OK.

Reply

Answer 3

The hatter

Level 9

61,031 points

Jul 16, 2010 3:34 PM in response to l_elephant

Google.

Not just search, of which there is a wealth of info, but also to read the pdf results of Google's study. MIT also. Heck, any guesses as to how many 1000s of drives and (petrabytes??) they use throughout in all their locations?

Reminded me of how to plan for when to replace florescent bulbs before they fail.

And SMART data was not helpful or predictive, but use of spares was.

Like other things, if it doesn't fail in first 6 months, maybe old age will.

Some drives like 250GB were the same as the 320GB model of same, only there were a higher number of failed sectors.

While it would be nice to automatically remap a weak sector's data to new spare - and the old sector is placed into deactive list... in real world sometimes you just have to do 7-way write erase or something else.
In fact, I recommend, if you really want to be on the same side, just break in any and most all new drives for a couple days, and give it a couple zero-all, 7-ways, fill the whole drive to over 90% with a lot of large files and folders full of 10s of thousands of files. Torture to death.

Preventative maintenance is better than waiting for a problem.

SoftRAID 4, it captures and reports I/O errors so you know when writing or reading from a block takes more than x-attempts.

Reply

Answer 4

Jul 16, 2010 5:40 PM in response to Grant Bennet-Alder

Q5: are BBs contagious, or are hard drives born with the BBs they got and thats it?

"Factory" defects are often caused by surface imperfections that are a fact of life of the manufacturing process. They generally do not spread. A drive that has factory defects that are all spared out is a "perfect" drive. It is considered top quality, and as long as all are spared out, the "factory" Bad Blocks are not an issue.

"User" defects that occur after the drive has left the factory may be caused by a number of different mechanisms. These different mechanism would produce different answers to the "Will it spread?" question.

Head Crashes: A Hard Drive head floats on a cushion of air induced by spinning the very smooth surface of the disk platter(s) at fairly high speeds. 7200 RPM is about 120 spins a second. The head-to-disk clearance makes a human hair look huge, and a fingerprint, dust speck, or smoke granule enormous. If impurities find their way inside, or the drive is dropped while spinning you could cause the head to fly up, then come crashing down onto the surface of the magnetic platter. At its worst, this can dig a divot in the magnetic media and spray magnetic dust inside the drive. That will cause one or more Bad Blocks, and in that case, the Bad Blocks will probably creep into adjacent areas over time.

"Just Because": The magnetic regions are very small, and in today's drives, are arranged "standing up" to pack more in. One of the quirks of magnetic material is that over time, marginal material may sometimes squirm a bit and what would magnetize just fine last week is just under the detection threshold today. These kinds of changes do not tend to spread.

Sometimes defects are limited to a single Block, sometimes a group of Blocks on the same Track, and sometimes thy can cross several Tracks and hit what seem to be random blocks (that are not consecutive, but are physically near each other track-to-track).

Reply

Answer 5

Jul 17, 2010 9:44 AM in response to The hatter

This is the paper itself. There is plenty of commentary available as well, by searching for terms like:

google drive study

using any search engine.

Failure Trends in a Large Disk Drive Population

Reply

Answer 6

Jul 17, 2010 10:19 AM in response to Grant Bennet-Alder

"Our results conﬁrm the ﬁndings of previous smaller population studies that suggest that some of the SMART parameters are well-correlated with higher failure probabilities. We ﬁnd, for example, that after their ﬁrst scan error \[errors detected by the drive reading in the background] drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in \[in-use] reallocations, offline \[as a result of running a "scrubbing" pass offline] reallocations, and probational counts are also strongly correlated to higher failure probabilities."

"Despite those strong correlations, we ﬁnd that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever. This result suggests that SMART models are more useful in predicting trends for large aggregate populations than for individual components. It also suggests that powerful predictive models need to make use of signals beyond those provided by SMART."

Reply

Answer 7

okevin

Level 1

15 points

Oct 12, 2010 8:21 AM in response to l_elephant

I have seen much discussion of whether Disk Utility will eliminate bad blocks.
Here is my experience:
Ran disk utility and wrote zeros once on my drive with bad blocks. (External Drive)
This appeared to trigger the drive to mask out these blocks.
(I am told the drive has to mask out it's own bad blocks and Disk Utility only triggers this by writing zeros.)
Further scan showed no bad blocks.
Some have said you need to write zeros 7 times to get the blocks, but once seemed to succeed here.

I can also tell you that just formatting and initializing the drive with Disk Utility did NOT result in the masking out of the bad blocks.

Reply

Answer 8

Oct 12, 2010 8:56 AM in response to okevin

just formatting and initializing the drive with Disk Utility...

To save you time, Disk Utility's Initialize only re-writes the area of the Disk that will hold the Directory (the first thousand blocks or so). That is why it can finish in a minute instead of several hours.

Ran disk utility and wrote zeros once on my drive with bad blocks. (External Drive)
This appeared to trigger the drive to mask out these blocks.

Once new data are available, the drive can re-write bad blocks. The drive controller makes a semi-permanent substitution of a spare block (that the controller has on hand for just this purpose). In the future, when you read or write data from a block that has been spared, the controller will see that block is on the Bad Block List, look up the substitute block, move the head to the substitute block, and provide or write the data to/from the substitute block. The controller does the whole process for you.

Some have said you need to write zeros 7 times to get the blocks \[fixed].

A bad block is not actually fixed in place. If it cannot hold a new pattern of Zeroes, it is abandoned and a spare block is substituted. Writing more times may improve YOUR confidence, but the substitution will happen on the first re-write with Zeroes. More passes just takes longer.

Reply

Answer 9

The hatter

Level 9

61,031 points

Oct 12, 2010 9:50 AM in response to okevin

The reason is that to this day, the vendor's own utilites are better than Apple's Disk Utility for remapping and recertifying a drive's sectors.

You might be successful with zero all, you may not, in which case further measures are needed. That was the advice given by MicroMat Tech Support 3 (I miss when they were on MacFixit and we had such educational discussions all the time).

A quick format only tests the areas that the partition tables use, that are hidden and, those can't be changedk, a bad block there can only be mapped out during a full format vs quick, and can 'bring down' a hard drive, which is why first and last 1000 sectors I believe are tested during even a quick format or initialize process.

But if you want to be sure, before you initialize, use the vendor's own utility to do a full scan and zero of the drive, that does work.

Reply

Quick Links

spare blocks, bad blocks (BB) and console messages