Chip multi-processors (CMP), where a processor core is replicated multiple times on a single chip, are widely recognized as a means to exploit the exponentially growing transistor budgets predicted by Moore’s Law, while respecting design complexity and physical constraints. While some workloads (e.g., server, scientific) already exhibit sufficient thread-level parallelism, most software does not. For these other programs, the additional complexity of manual parallelization is unattractive, and traditional automatic parallelization techniques have yet to be widely adopted.
In this talk, I will present two hardware/software techniques to exploit thread-parallel resources to improve sequential program performance. The first technique uses “helper threads” to avoid stalls (due to cache misses and branch mispredictions) that would be incurred by a single-threaded execution of the program. The second technique uses a “master processor” to compute future checkpoints of program state to achieve a (speculative) parallel execution of the program. Both techniques exploit a common theme: the use of approximate versions of code to generate accurate value predictions.