Efficient modeling of soft error vulnerability in microprocessors

Date

2012-05

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Reliability has emerged as a first class design concern, as a result of an exponential increase in the number of transistors on the chip, and lowering of operating and threshold voltages with each new process generation. Radiation-induced transient faults are a significant source of soft errors in current and future process generations. Techniques to mitigate their effect come at a significant cost of area, power, performance, and design effort. Architectural Vulnerability Factor (AVF) modeling has been proposed to easily estimate the processor's soft error rates, and to enable the designers to make appropriate cost/reliability trade-offs early in the design cycle. Using cycle-accurate microarchitectural or logic gate-level simulations, AVF modeling captures the masking effect of program execution on the visibility of soft errors at the output. AVF modeling is used to identify structures in the processor that have the highest contribution to the overall Soft Error Rate (SER) while running typical workloads, and used to guide the design of SER mitigation mechanisms.

The precise mechanisms of interaction between the workload and the microarchitecture that together determine the overall AVF is not well studied in literature, beyond qualitative analyses. Consequently, there is no known methodology for ensuring that the workload suite used for AVF modeling offers sufficient SER coverage. Additionally, owing to the lack of an intuitive model, AVF modeling is reliant on detailed microarchitectural simulations for understanding the impact of scaling processor structures, or design space exploration studies. Microarchitectural simulations are time-consuming, and do not easily provide insight into the mechanisms of interactions between the workload and the microarchitecture to determine AVF, beyond aggregate statistics.

These aforementioned challenges are addressed in this dissertation by developing two methodologies. First, beginning with a systematic analysis of the factors affecting the occupancy of corruptible state in a processor, a methodology is developed that generates a synthetic workload for a given microarchitecture such that the SER is maximized. As it is impossible for every bit in the processor to simultaneously contain corruptible state, the worst-case realizable SER while running a workload is less than the sum of their circuit-level fault rates. The knowledge of the worst-case SER enables efficient design trade-offs by allowing the architect to validate the coverage of the workload suite and select an appropriate design point, and to identify structures that may potentially have high contribution to SER. The methodology induces 1.4X higher SER in the core as compared to the highest SER induced by SPEC CPU2006 and MiBench programs.

Second, a first-order analytical model is proposed, which is developed from the first principles of out-of-order superscalar execution that models the AVF induced by a workload in microarchitectural structures, using inexpensive profiling. The central component of this model is a methodology to estimate the occupancy of correct-path state in various structures in the core. Owing to its construction, the model provides fundamental insight into the precise mechanism of interaction between the workload and the microarchitecture to determine AVF. The model is used to cheaply perform sizing studies for structures in the core, design space exploration, and workload characterization for AVF. The model is used to quantitatively explain results that may appear counter-intuitive from aggregate performance metrics. The Mean Absolute Error in determining AVF of a 4-wide out-of-order superscalar processor using model is less than 7% for each structure, and the Normalized Mean Square Error for determining overall SER is 9.0%, as compared to cycle-accurate microarchitectural simulation.

Description

text

Citation