Smart Utility: What does "LBA of First Error" really mean?

Question

Level 1

10 points

Smart Utility: What does "LBA of First Error" really mean?

I have a 300-GB (300069052416-byte, 586072368-sector) WD IDE drive. Although no other software indicates a SMART failure, *SMART Utility* always tells me that it's failing.

It reports that there are 25 total errors, but the last errors reported were 675 +Power-On Hours+ ago. It reports 23 Pending, 5 Removed, and 7 Reallocated Bad Sectors.

The drive now has 17819 +Power On Hours+. Back when that figure was in the range of 10515 to 13931, a total of five tests all gave a result with *LBA of First Error* 506133194. From Power On Hours 14231 to now, 12 of 13 tests all gave +LBA of First Error+ as between 548453258 and 548453380. The lowest +LBA of First Error+ was 456051244 at 17094 hours.

Do these consistently high numbers for +LBA of First Error+ mean that there is only a problem with the drive surface in a limited area, rather than with the mechanism? If that is the case, I can regard just the third of the three volumes on the drive as problematic, copy its contents elsewhere, and perhaps create a new, smaller third volume. Of course, I will not be using this drive, and especially the third volume, for anything irreplaceable.

I should mention that i bought this drive over five years ago and it's possible that, being naive in such matters, I did a low-level format! Also, in case it matters, the drive is partitioned with MBR, though all the volumes are HFS+! (I meant to make the drive bootable, but I was a bit ignorant at the time.)

Dual 1.42 GHz G4 MDD, Mac OS X (10.5.8), I have or use several other Macs, mostly MDD's.

Posted on Mar 19, 2011 12:07 AM

Reply

Answer 1

BDAqua

Level 10

238,979 points

Mar 19, 2011 10:59 AM in response to aarons510

I think it means the Disk is running out of hidden replacement Sectors for the bad ones.

Modern storage devices tend to handle the simple cases automatically, for example by writing a disk sector that was read with difficulty to another area on the media. Even though such a remapping can be done by a disk drive transparently, there is still a lingering worry about media deterioration and the disk running out of spare sectors to remap.

http://smartmontools.sourceforge.net/badblockhowto.html

Do these consistently high numbers for LBA of First Error mean that there is only a problem with the drive surface in a limited area, rather than with the mechanism?

Most likely the Surface, but not 100%.

More info on SMART...

http://www.noah.org/wiki/Disk_errors

Reply

Answer 2

Mar 19, 2011 1:56 PM in response to BDAqua

That message means:

I found some errors.
The Logical Block Address for the FIRST of the errors I found was block 506133194 ( at the 253,066,597K mark). They only gave me one little spot to write down a block number, so that's all I can tell you.

Reply

Answer 3

aarons510 Author

Level 1

10 points

Mar 19, 2011 9:27 PM in response to Grant Bennet-Alder

I understand that 'LBA of First Error' means 'First Error' and not 'All Errors'. But does 'First Error' mean 'Error with lowest LBA' or something else?

Reply

Answer 4

Mar 20, 2011 7:32 AM in response to aarons510

Do these consistently high numbers for LBA of First Error mean that there is only a problem with the drive surface in a limited area, rather than with the mechanism?

Yes. That is one of the most common modes of failure. If the mechanism had a failure, the drive would probably stop working entirely (and possibly quite suddenly.)

Don't celebrate yet. Those sorts of problems are sometimes caused by magnetic regions that just spontaneously "go bad". But they are more often caused by hard contact between the flying heads and the surface of the rapidly-spinning platters. This can be a contact so severe that it sprays magnetic debris from the platters inside the quasi-sealed portion of the drive. This turns what was a "near clean room" environment into an area strewn with magnetic dust.

Data is stored using a semi-redundant code (from a class of codes called Hamming codes) that allow the drive controller to do error correction for small bursts of errors in a block. If the block cannot be corrected, it is retried, and retried, and retried (dozens to hundreds of times) before the controller gives up and adds it to the possible bad blocks list. Since the correct data for that block are unknown (since the block cannot be read and corrected) the drive cannot do anything more about it until you supply new data to be written to the block.

When you supply new data for a block on the possible bad blocks list, the controller holds onto the data, writes it to the block, and checks for a read with no correction needed. IF the new data "sticks", the block is not permanently bad. If the new data cannot be read back without correction (or at all) immediately after being written, the block is declared permanently bad and the controller writes the retained data to a spare block and makes a permanent substitution.

Reply

Answer 5

Mar 20, 2011 7:34 AM in response to Grant Bennet-Alder

Most drives have only a handful of spared-out blocks assigned at the factory to bypass miniscule manufacturing defects. When operating properly, they do not develop bad blocks (or even suspected bad blocks) in the field. Compared to what we expect from a drive, this drive is in terrible shape.

Google did a large study to see what happened in their always-on server farms of large numbers of consumer-quality drives. Over their enormous sample, they determined that a drive that developed ANY Bad Blocks would be swapped out due to failure in six months or less.

You do not have a huge sample, you have one drive. It may last for years at light duty, or it may fail this afternoon. But the area with suspected bad blocks is troubling.

One way to proceed is, as you suggested, create a smaller Volume that encompasses only the known-good area, and never send the head out beyond that. Then use the drive for non-essential data or backup duty.

Another way to proceed would be to re-write the entire surface with good data (Initialize with Zero all data option). This would substitute up to 10 spares (a limit in Drive Setup) for blocks found to be permanently bad, then give up with "Initialization failed!", The process can be run again, and will substitute more spares until the drive has no more spares. But starting with 25 suspects, this is not likley to yield a working drive.

Reply