Explained: “Bad Block” Management
17/07/2021 by Tim Niggemeier
How do “bad blocks” occur in NAND flash and what’s the consequential impact on storage reliability? What do flash and storage medium manufacturers have to pay attention to during production in order to avoid early failures? Moreover, why does the firmware play a key role in dealing with bad blocks during the life of the storage medium? Let's have a look!
NAND flash is very inexpensive compared to other electric storage devices such as NOR flash or DRAM. One reason for this is that with NAND Flash not every single bit needs to be error free. Consequently, yield is high for production. For example, 512 Gbit NAND flash typically yields in excess of 90 %. Correspondingly, an error correction is absolutely necessary for NAND flash in order to be able to correct bit errors. While bit errors seldomly occur in NOR flash and DRAM negating the need for optional error correction for most applications, hard bit errors already exist with NAND flash since production.
Factory Bad Blocks
Flash memory components are tested straight after production. The first tests are already performed at wafer level before dividing the wafer into individual dies. Various test patterns and temperatures are applied. Ideally, the entire temperature range is already covered here with a write/read test.
As semiconductor manufacturing processes move in the nanometer range, imperfections are inevitable (lattice defects, tolerances in lithography, positioning errors, dust, etc.). When these faults are in active areas of the die and occur, they mainly appear as bit errors of the flash cells. If the bit error rate of a section is above a certain limit during the write/read test, the block is marked as a bad block.
In the unlikely event of the periphery of a flash block being affected by an error (e.g. address decoder or register); the block would also be marked as a bad block because it is not functional. If too many blocks are rejected, there no longer are enough blocks available to reach the nominal capacity and/or to ensure sufficient reserve blocks. In this case, the entire die is marked as bad and discarded after dicing the wafer.
Our internal tests have shown that for the flash types used by Swissbit the reliability of the flash memory does not correlate with the number of factory bad blocks. An above average number of bad or defective blocks does not increase the probability of an early failure.
Grown Bad Blocks
During the lifetime of a storage medium, additional bad blocks may occur (so-called “Grown Bad Blocks” or “Runtime Bad Blocks”). This is completely normal, and flash manufacturers cannot avoid this in advance.
The most common cause is a failure of the isolation layer of a NAND transistor when the high voltage needed to erase a cell is applied. Should, during its lifetime, the bit error ratio of a block increase beyond a certain limit after an erase or programming operation, then such a block is hidden by the firmware and replaced by another block. Without this, data loss would occur if the number of bit errors exceeded the number of correctable bit errors. Some precaution is necessary however since soft errors can also occur due to loss of charge (long storage or ionizing radiation).
Safe handling of Grown Bad Blocks is an important function of the storage medium’s firmware, since such a block, upon detection, requires both the data already stored in the block to be reliably copied as well as caching the new data intended for it. The appearance of Grown Bad Blocks follows a so-called bathtub curve:
At the beginning of the life cycle, early failures can increasingly occur if the insulation layer fails after only a few erase cycles. After that, a long period with very low probability follows. After exceeding the specified number of write/erase cycles, the probability of Grown Bad Blocks continues to increase again. With the continuous improvement of manufacturing processes, established flash technologies will typically have a probability in the single-digit percentage range for at least a single Grown Bad Block during their lifetime.
What about 3D NAND?
In contrast to the typical planar structure of NAND memory, with 3D NAND, depending on the memory type, further failure modes can occur, which can lead to the failure of one block or multiple pages. Due to the high voltages during erase and programming, short or open circuits to conductor tracks may occur. In such cases, the affected blocks can also be disabled and replaced without affecting the other blocks. However, the failure of a conductor track may affect not only the currently programmed page, but the entire block. Then all written data already in this block is lost. To prevent this data loss, two different techniques that both significantly increase the complexity of the firmware can be applied:
With block parity, a parity block is calculated over several blocks and stored in flash. If a block fails, it can be restored using the parity information.
With read verify, after each write command to a block, the content of that block is read back and compared to a copy in the controller’s RAM. This requires a lot of RAM, because for each open block – meaning a block that is no longer erased and not yet completely – a complete copy must be kept in RAM.
Combinations from both techniques can also be used.
Conclusion
The presence of factory bad blocks in NAND flash is as much based on technology as is the occurrence of further bad blocks during the lifetime. Neither have any (negative) quality implications. However, when bad blocks occur during the lifetime, correct handling by the firmware is critical in order avoid losing new data as well as the existing stored data of those blocks.