150 likes | 326 Views
Adaptive Single-Chip Multiprocessing. Dan Gibson degibson@wisc.edu University of Wisconsin-Madison Department of Electrical and Computer Engineering. Introduction. Moore’s Law continues to provide more transistors Devices are getting smaller Devices are getting faster
E N D
Adaptive Single-Chip Multiprocessing Dan Gibson degibson@wisc.edu University of Wisconsin-Madison Department of Electrical and Computer Engineering
Introduction • Moore’s Law continues to provide more transistors • Devices are getting smaller • Devices are getting faster • Leads to increases in clock frequency • Memories are getting bigger • Large memories often require more time to access • RC Circuits continue to charge exponentially • Long-wire signal propagation time is not improving as rapidly as switching speed • On-chip communication time is slower relative to processor clock speeds ECE Qualifying Exam
The Memory Wall • Processors grow faster, memory grows slower • Off-chip cache misses can halt even aggressive out-of-order processors • On-chip cache accesses are becoming long-latency events • Latency can sometimes be tolerated • Caching • Perfecting • Speculation • Out-of-order execution • Multithreading ECE Qualifying Exam
The “Power” Wall • More devices, faster clocks => More power • Power supply accounts for lots of pins in chip packaging (3,057 of 5,370 pins on the POWER5) • Heat dissipation increases total cost of ownership (~34W cooling power required to remove 100W of heat) • Dynamic Power in CMOS • Devices get smaller, faster, and more numerous • More Capacitance • Higher Frequency • Architects can constrain α, CL, and f ECE Qualifying Exam
Enter Chip Multiprocessors (CMPs) • One chip, many processors • Multiple cores per chip • Often multiple threads per core Dual-Core AMD Opteron Die Photo From: Microprocessor Report: Best Servers of 2004 ECE Qualifying Exam
CMPs • CMPs can have good performance • Explicit thread-level parallelism • Related threads experience constructive prefetching • CMPs can tolerate long-latency events well • Many concurrent threads => long-latency memory accesses can be overlapped • CMPs can be power-efficient • Enables use of simpler cores • Distributes “hot spots” ECE Qualifying Exam
CMPs • CMPs are very specialized • Assumes (highly) threaded workload • Parallel machines are difficult to use • Parallel programming is not (yet) commonplace • Many problems similar to traditional multiprocessors • Cache coherence • Memory consistency • Many new opportunities • Cache sharing • More integration ECE Qualifying Exam
Adaptive CMPs • To combat specialization, adapt a CMP dynamically to its current workload and system: • Adapt caching policy ( Beckmann et. al., Chang et. al., and more ) • Adapt cache structure ( Alameldeen et. al., and more ) • Adapt thread scheduling ( Kihm et. Al., in the SMT space) • Current idea: • Adaptive thread scheduling from the space of un-stalled and stalled threads • A union of single-core multithreading and runahead execution in the context of CMPs ECE Qualifying Exam
Single-Core Multithreading • Allow multiple (HW) threads within the same execution pipeline • Shares processor resources: FUs, Decode, ROB, etc. • Shares local memory resources: L1 caches, LSQ, etc. • Can increase processor and memory utilization Sun’s Niagara pipeline block diagram ( Kongetira et. al.) ECE Qualifying Exam
Runahead Execution • Continue execution in the face of a cache miss • “Checkpoint” architectural state • Continue execution speculatively • Convert memory accesses to prefetches • “Runahead” prefetches can be highly accurate, and can greatly improve cache performance ( Mutlu, et. al.) • It is possible to issue useless prefetches • Can be power-inefficient (Mutlu, et. al.) ECE Qualifying Exam
Runahead/Multithreaded Core Interaction • Similar Hardware Requirements: • Additional register files • Additional LSQ entries • Competition for Similar Resources: • Execution time (Processor pipeline, Functional units, etc) • Memory bandwidth • TLB Entries, cache space, etc. ECE Qualifying Exam
Runahead/Multithreaded Core Interaction • A multithreaded core in a CMP, with runahead, must make a difficult scheduling decisions: • Thread scheduling considerations: • Which thread should run? • Should the thread use runahead? • How long should the thread run/runahead? • Scheduling implications: • Is an idle thread making foreword progress at the expense of a useful thread? • Is a thread spinning on a lock held by another thread? • Is runahead effective for a given thread? • Is a given thread causing performance problems elsewhere in the CMP? ECE Qualifying Exam
Proposed Mechanism • Track per-thread state on: • Runahead prefetching accuracy • High accuracy favors allowing thread to runahead • HW-assigned thread priority • Highly “useful” threads are preferred • Selection criteria: • Heuristic-guided • Select the best priority/accuracy pair • Probabilistically-guided • Select a thread with likelihood proportional to its priority/accuracy • Useful-first • Select non-runahead threads first, then select runahead threads ECE Qualifying Exam
Future Directions • Dynamically Adaptable CMPs offer several future areas of research: • Adapt for power savings / heat dissipation • Computation relocation, load balancing, automatic low-power modes, etc. • Adapt to error conditions • Dynamically allocate backup threads • Automatically relocate threads to improve resource sharing • Combined HW/SW/VM approach ECE Qualifying Exam
Summary • Latency now dominates off-chip communication • On-chip communication isn’t far behind • Many techniques to tolerate latency, including multithreading • CMPs provide new challenges and opportunities to computer architects • Latency tolerance • Potential for power savings • Can adapt a CMP’s behavior to its workload • Dynamic management of shared resources ECE Qualifying Exam