Flexible and efficient reliability in memory systems

Date

2011-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Future computing platforms will increasingly demand more stringent memory resiliency mechanisms due to shrinking memory cell size, reduced error margins, higher capacity, and higher reliability expectations. Traditional mechanisms, which apply error checking and correcting (ECC) codes uniformly across all memory locations, are inefficient -- Uniform protection dedicates resources to redundant information and demand higher cost for stronger protection, a fixed (worst-case based) error tolerance level, and a fixed access granularity.

The design of modern computing platforms is a multi-objective optimization, balancing performance, reliability, and many other parameters within a constrained power budget. If resiliency mechanisms consume too many resources, we lose an opportunity to improve performance. Hence, it is important and necessary to enable more efficient and flexible memory resiliency mechanisms.

This dissertation develops techniques that enable efficient, adaptive, and dynamically tunable memory resiliency mechanisms.

First, we develop two-tiered protection, apply it to the last-level cache, and present Memory Mapped ECC (MME) and ECC FIFO. Two-tiered protection provides low-cost error detection or light-weight correction in the common case read operations, while the uncommon case error correction overhead is off-loaded to main memory namespace. MME and ECC FIFO use different schemes for managing redundant information in main memory. Both achieve 15-25% reduction in area and 9-18% reduction in power consumption of the last-level cache, while performance is degraded by only 0.7% on average.

Then, we apply two-tiered protection to main memory and augment the virtual memory interface to dynamically adapt error tolerance levels according to user, system, and environmental needs. This mechanism, Virtualized ECC (V-ECC), improves system energy efficiency by 12% and degrades performance only by 1-2% for chipkill-correct level protection. V-ECC also supports ECC in a system with no dedicated storage for redundant information.

Lastly, we propose the adaptive granularity memory system (AGMS) that allows different access granularities, while supporting ECC. By not wasting off-chip bandwidth for transferring unnecessary data, AGMS achieves higher throughput (by 44%) and power efficiency (by 46%) in a 4-core CMP system. Furthermore, AGMS will provide further gains in future systems, where off-chip bandwidth will be comparatively scarce.

Description

text

Citation