What are hard disk ECCs?

The basis of all error detection and correction in hard disks is the inclusion of redundant information and special hardware or software to use it. Each sector of data on the hard disk contains 512 bytes, or 4,096 bits, of user data. In addition to these bits, an additional number of bits are added to each sector for the implementation of error correcting code or ECC (sometimes also called error correction code or error correcting circuits). These bits do not contain data; rather, they contain information about the data that can be used to correct any problems encountered trying to access the real data bits.

Reed-Solomon codes are widely used for error detection and correction
There are several different types of error correcting codes that have been invented over the years, but the type commonly used on PCs is the Reed-Solomon algorithm, named for researchers Irving Reed and Gustave Solomon, who first discovered the general technique that the algorithm employs. Reed-Solomon codes are widely used for error detection and correction in various computing and communications media, including magnetic storage, optical storage, high-speed modems, and data transmission channels. They have been chosen because they are easier to decode than most other similar codes, can detect (and correct) large numbers of missing bits of data, and require the least number of extra ECC bits for a given number of data bits.

How does the ECC work?
When a sector is written to the hard disk, the appropriate ECC codes are generated and stored in the bits reserved for them. When the sector is read back, the user data read, combined with the ECC bits, can tell the controller if any errors occurred during the read. Errors that can be corrected using the redundant information are corrected before passing the data to the rest of the system. The system can also tell when there is too much damage to the data to correct, and will issue an error notification in that event. The sophisticated firmware present in all modern drives uses ECC as part of its overall error management protocols. This is all done “on the fly” with no intervention from the user required, and no slowdown in performance even when errors are encountered and must be corrected.

The capability of a Reed Solomon ECC implementation is based on the number of additional ECC bits it includes. The more bits that are included for a given amount of data, the more errors that can be tolerated. There are multiple tradeoffs involved in deciding how many bits of ECC information to use. Including more bits per sector of data allows for more robust error detection and correction, but means fewer sectors can be put on each track, since more of the linear distance of the track is used up with non-data bits. On the other hand, if you make the system more capable of detecting and correcting errors, you make it possible to increase areal density or make other performance improvements, which could pay back the “investment” of extra ECC bits, and then some. Another complicating factor is that the more ECC bits included, the more processing power the controller must possess to process the Reed Solomon algorithm. The engineers who design hard disks take these various factors into account in deciding how many ECC bits to include for each sector.

More useful information about Hard disk ECC

All modern hard disk drives are ATA (Advanced Technology Attachment) compliant. Part of this compliancy means that drives must have the ability to detect errors while reading data from individual sectors on the drive. This is to prevent corrupted data from being propagated through to the operating system which would lead to system crashes.

In order to accomplish this, every sector has a built in checksum and error correction code that is written at the time that data is written to the sector. Upon reading the sector, the drive recalculates the checksum and compares it to the one previously written. If it does not match, the error correction code will attempt to correct the data. Every sector has a standard 512 bytes of user data. A typical ECC is capable of correcting between 10 and 12 bytes. If repairing the corruption is beyond the capability of the ECC, the data will not be returned to the operating system. The drive will then return an error. This is typically a UNC (uncorrectable) error.

These types of errors occur when data is written to the sector improperly or inadvertently. Additionally, these errors can be due to read instability in the drive where the data itself is not actually corrupted but the drive is incapable of reading it correctly. This can be due to factors such as minute mechanical wear of the parts inside the head disk assembly.