Tuesday January 27, 2009
Hamerschlag Hall D-210
Carnegie Mellon University
The emergence of multi-core architectures—driven by continued technology scaling—has led to concerns about increasing soft- and hard-error rates in commodity designs. Because modern chip designs can no longer operate redundant modules in lockstep, we propose an asynchronous approach to redundant execution. In asynchronous redundancy, processor pairs independently execute an instruction stream and treat any differences like soft errors, invoking rollback recovery.
In this talk, I present REDAC, our proposal for distributed, asynchronous redundancy within a shared-memory multiprocessor. REDAC provides scalable buffering for unchecked state updates, permitting the distribution of redundant execution across multiple nodes of a shared-memory server. We evaluate REDAC using cycle-accurate full-system simulation of common enterprise workloads and show that performance overheads average just 10% when compared to a non-redundant system.
Brian Gold is a 6th (and final) year PhD student at Carnegie Mellon (ECE), advised by Babak Falsafi. His research interests include reliable/available computing, power and temperature in server architectures, and performance modeling.