Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah School of Computing

Introduction Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: • Process level – SOI • Circuit level – Rad-hard cells, ECC, BISER • Architecture level – • Redundant Multithreading • Reducing the time useful state spends in unprotected structures • Software assisted fault tolerance School of Computing

Introduction CMPs/SMTs enable redundant multi-threading (RMT) • Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002 • 2 processors/threads execute the same program School of Computing

Chip-level Redundant Multi-threading (CRTR) Branch Outcomes Processor 1 Processor 2 OoO OoO Leading thread 1 Trailing thread 2 Loads Trailing thread 1 Leading thread 2 Lags behind leading thread by some slack Register Values Stores School of Computing

Motivation • Register file is already a critical resource: • impacts ILP • impacts cycle time • impacts peak temperature • Multiple threads increase pressure on register file School of Computing

Motivation • Out-of-order processors are "conservative" since they must preserve correctness • Example: registers are de-allocated conservatively • Having a trailing thread allows the leading thread to be aggressive • improves the performance of the leading thread • trailer state can be used for ensuring correctness • some errors may go undetected School of Computing

Processor 1 RVQ Processor 2 Leading 1 Trailing 1 R1 R1 R2 R1 lr5 = …. lr5 mapped to R1 Branch lr5 = … lr5 mapped to R2 Mispredict School of Computing

Processor 1 RVQ Processor 2 Leading 1 Trailing 1 R1 R1 R1’ Soft error Mispredict Recovery Fault Propagates Very few errors slip through: Slack is most of the times less than RVQ size School of Computing

Our Approach • RMT processor has duplicate register value state in RVQ/trailer’s state • Improve Register file efficiency using Eager Register Release • Smaller Register file size can deliver same performance using above technique • Reduced power • Increased reliability – ECC less expensive • Potentially faster clock speed School of Computing

Outline • Background on RMT design space • Proposed technique • Evaluation • Conclusions & Future Work School of Computing

Redundant Multi-threading • Fault model • Trailer’s state used for recovery • Does not provide complete recovery • Caches and Load Value Queue (LVQ) ECC protected • Can detect all single event upset faults • Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR School of Computing

Baseline RMT Model Leading Thread 1 Trailing Thread 1 Out-of-Order Processor • SRTR – SMT level RMT • CRTR –Chip level RMT • Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003 Out-of-order Out-of-order Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 2 Trailing 1 Leading 2 School of Computing

Power-efficient RMT model Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005) • Observations • Trailing thread doesn’t suffer from D-cache misses and branch mispredictions • Trailing thread bound to have higher IPC • High Trailer IPC enables power reduction • Techniques proposed for power-efficiency: • Dynamic Frequency Scaling • In-order execution of trailer School of Computing

Dynamic Frequency Scaling • High Trailer IPC enables frequency reduction • Reduce Trailer’s frequency to match the leader’s throughput • Reduction in Trailer’s dynamic power • Does not impact Trailer’s leakage power School of Computing

In-order Execution of Checker • Our approach • Send all register values computed by leading core to the trailer (Register value prediction 100% accuracy if no fault) • Trailer reads source operands from RVQ • Trailer verifies source operands at commit • RVP enables perfect IPC – no stalls • Cost : Extra communication overhead • Benefit : Overall reduced dynamic and leakage power School of Computing

ST-P-CRTR • Single thread workloads Out-of-order In-order Processor 1 LVQ, BOQ, RVQ Processor 2 Leading 1 Trailing 1 School of Computing

MT-P-CRTR • Multi-threaded Workloads Processor 2 Trailing 1 Out-of-order LVQ, BOQ, RVQ Processor 1 In-order Leading 1 Leading 2 LVQ, BOQ, RVQ Processor 3 Trailing 2 In-order School of Computing

Eager Register Release Original Code lr3= lr1,lr2 lr5= lr3, lr4 Branch to x lr3=… Renamed Code pr21= pr8,pr11 pr15= pr21, pr12 Branch to x pr29=… • Eager Register Release • Involves releasing older physical register after the value is rewritten and used by all consumers • Requires a mechanism to store the released state elsewhere lr3 has 2 mappings – new pr29 and old pr21 pr21 cannot be released until branch resolves School of Computing

Implementation Details • Need to keep track of various states for each physical register in Usage Table • Bit that tracks if logical register value is overwritten • RVQ address/register id in trailing thread • Counters for each physical register • To track pending consumers • Modification in ROB to initiate recovery upon mispredict • Non-trivial complexity and overheads School of Computing

Evaluation Methodology • Simplescalar-3.0 (Modified for CMP/SMT) for performance analysis and wattch for processor power • eCacti-3.0 to model register file power and area overheads • Spec2k Int, FP benchmark suite • 16 benchmarks for single thread experiments • 10 pairs of High/Low IPC/ Int/FP combinations for multi-thread experiments • Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads • RVQ size = 600 entries School of Computing

Performance Evaluation School of Computing

Effect of Register File Size - SRTR ROB size 160 School of Computing

Effect of Register File Size ST-P-CRTR School of Computing

Effect of Register File Size CRTR School of Computing

Effect of Register File Size MT-P-CRTR School of Computing

Effect of Register File Size • For SRTR, CRTR, MT-P-CRTR: • Performance of 100 size RF with ER same as baseline with 160 size (37.5% size reduction) • Performance improvement of 34% in 100 size RF with ER compared to baseline with 100 size • For ST-P-CRTR • Performance of 50 size register file with ER same as baseline with 80 size (37.5% size reduction) • Performance improvement of 12% in 100 size RF with ER compared to baseline with 100 size School of Computing

Observations • More favorable to models where leading thread co-executes with another leading/trailing thread • Most FP benchmarks perform better with ER (greater than 20% improvement) • Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%) School of Computing

Performance Overheads • For 100 million single thread execution • 70 million registers are released eagerly • 6% copied back upon mispredict recovery • Cost of copying back dependent upon program mispredict rate • Each mispredict requires 6.6 copy back values • Cost of copying can be possibly hidden with branch recovery time School of Computing

Performance Overheads Max IPC loss for 5-cycle overhead is 4% School of Computing

Power/Area Analysis 8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF School of Computing

Power/Area Analysis • Single thread RF size 50 with ER compared to baseline RF size 80 can • Improve Clock speed by 19% • Consumes 11% less energy and 25% less area • If SEC-DED ECC is implemented on baseline register file • 6% Energy increase and 16% area increase • Smaller RF can help afford ECC for even multiple bit soft error resilience School of Computing

Fault-Injection Analysis • Modified Simplescalar for fault analysis • Conservative analysis as masking effects cannot be modeled • Every 1000 cycles, register bit is flipped in trailing register file • Only 0.0004% of faults go undetected • On average 99% of time logical register is rewritten in less than 100 instruction interval • Ensures that slack is less than RVQ size School of Computing

Conclusions and Future Work • RMT model very suitable for Eager Register Release • A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline • Fault-coverage reduction marginal ~0.0004% • Enables smaller RF for lower power, higher clock speed, lower area overheads • Enables reliability by making ECC affordable • Nontrivial implementation overheads • Need to explore complexity-effective solution School of Computing

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Presentation Transcript

A user level multi-threaded Particle simulator

Multi-threaded RTOS

Multi-threaded Active Objects

Multi-Core/Processor

Multi-threaded Active Objects

Multi-threaded applications

Multi-threaded Reachability

Tera MTA (Multi-Threaded Architecture)

Multi-core Processor

Multi-processor SoCs

Multi-processor Scheduling

Multi Threaded Chat Server

Multi-threaded Reachability

Multi-Threaded Transactions

Multi-threaded programming with NSPR

Parallelism (Multi-threaded)

Multi-threaded RTOS

Multi-Threaded Video Rendering

Multi-threaded ROOT

Multi-core Processor

Multi-Threaded Systems with Queues

Lecture 17: Multi-threaded Applications