1 / 33

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor. Niti Madan Rajeev Balasubramonian University of Utah. Introduction. Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: Process level – SOI

waylon
Download Presentation

Exploiting Eager Register Release in a Redundantly Multi-threaded Processor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Exploiting Eager Register Release in a Redundantly Multi-threaded Processor Niti Madan Rajeev Balasubramonian University of Utah School of Computing

  2. Introduction Rising soft error rates due to shrinking transistor sizes and lower supply voltages Existing Solutions: • Process level – SOI • Circuit level – Rad-hard cells, ECC, BISER • Architecture level – • Redundant Multithreading • Reducing the time useful state spends in unprotected structures • Software assisted fault tolerance School of Computing

  3. Introduction CMPs/SMTs enable redundant multi-threading (RMT) • Detailed Design and Evaluation of Redundant Multithreading Alternatives, ISCA 2002 • 2 processors/threads execute the same program School of Computing

  4. Chip-level Redundant Multi-threading (CRTR) Branch Outcomes Processor 1 Processor 2 OoO OoO Leading thread 1 Trailing thread 2 Loads Trailing thread 1 Leading thread 2 Lags behind leading thread by some slack Register Values Stores School of Computing

  5. Motivation • Register file is already a critical resource: • impacts ILP • impacts cycle time • impacts peak temperature • Multiple threads increase pressure on register file School of Computing

  6. Motivation • Out-of-order processors are "conservative" since they must preserve correctness • Example: registers are de-allocated conservatively • Having a trailing thread allows the leading thread to be aggressive • improves the performance of the leading thread • trailer state can be used for ensuring correctness • some errors may go undetected School of Computing

  7. Processor 1 RVQ Processor 2 Leading 1 Trailing 1 R1 R1 R2 R1 lr5 = …. lr5 mapped to R1 Branch lr5 = … lr5 mapped to R2 Mispredict School of Computing

  8. Processor 1 RVQ Processor 2 Leading 1 Trailing 1 R1 R1 R1’ Soft error Mispredict Recovery Fault Propagates Very few errors slip through: Slack is most of the times less than RVQ size School of Computing

  9. Our Approach • RMT processor has duplicate register value state in RVQ/trailer’s state • Improve Register file efficiency using Eager Register Release • Smaller Register file size can deliver same performance using above technique • Reduced power • Increased reliability – ECC less expensive • Potentially faster clock speed School of Computing

  10. Outline • Background on RMT design space • Proposed technique • Evaluation • Conclusions & Future Work School of Computing

  11. Redundant Multi-threading • Fault model • Trailer’s state used for recovery • Does not provide complete recovery • Caches and Load Value Queue (LVQ) ECC protected • Can detect all single event upset faults • Baseline RMT models include SRTR, CRTR, ST-P-CRTR, MT-P-CRTR School of Computing

  12. Baseline RMT Model Leading Thread 1 Trailing Thread 1 Out-of-Order Processor • SRTR – SMT level RMT • CRTR –Chip level RMT • Proposed by Mukherjee et al ISCA 2002, Gomaa et al ISCA 2002, ISCA 2003 Out-of-order Out-of-order Processor 1 Processor 2 LVQ, BOQ, RVQ Leading 1 Trailing 2 Trailing 1 Leading 2 School of Computing

  13. Power-efficient RMT model Our Earlier Work explores Power-efficient RMT model P-CRTR (Selse-2, Tech Report 2005) • Observations • Trailing thread doesn’t suffer from D-cache misses and branch mispredictions • Trailing thread bound to have higher IPC • High Trailer IPC enables power reduction • Techniques proposed for power-efficiency: • Dynamic Frequency Scaling • In-order execution of trailer School of Computing

  14. Dynamic Frequency Scaling • High Trailer IPC enables frequency reduction • Reduce Trailer’s frequency to match the leader’s throughput • Reduction in Trailer’s dynamic power • Does not impact Trailer’s leakage power School of Computing

  15. In-order Execution of Checker • Our approach • Send all register values computed by leading core to the trailer (Register value prediction 100% accuracy if no fault) • Trailer reads source operands from RVQ • Trailer verifies source operands at commit • RVP enables perfect IPC – no stalls • Cost : Extra communication overhead • Benefit : Overall reduced dynamic and leakage power School of Computing

  16. ST-P-CRTR • Single thread workloads Out-of-order In-order Processor 1 LVQ, BOQ, RVQ Processor 2 Leading 1 Trailing 1 School of Computing

  17. MT-P-CRTR • Multi-threaded Workloads Processor 2 Trailing 1 Out-of-order LVQ, BOQ, RVQ Processor 1 In-order Leading 1 Leading 2 LVQ, BOQ, RVQ Processor 3 Trailing 2 In-order School of Computing

  18. Eager Register Release Original Code lr3= lr1,lr2 lr5= lr3, lr4 Branch to x lr3=… Renamed Code pr21= pr8,pr11 pr15= pr21, pr12 Branch to x pr29=… • Eager Register Release • Involves releasing older physical register after the value is rewritten and used by all consumers • Requires a mechanism to store the released state elsewhere lr3 has 2 mappings – new pr29 and old pr21 pr21 cannot be released until branch resolves School of Computing

  19. Implementation Details • Need to keep track of various states for each physical register in Usage Table • Bit that tracks if logical register value is overwritten • RVQ address/register id in trailing thread • Counters for each physical register • To track pending consumers • Modification in ROB to initiate recovery upon mispredict • Non-trivial complexity and overheads School of Computing

  20. Evaluation Methodology • Simplescalar-3.0 (Modified for CMP/SMT) for performance analysis and wattch for processor power • eCacti-3.0 to model register file power and area overheads • Spec2k Int, FP benchmark suite • 16 benchmarks for single thread experiments • 10 pairs of High/Low IPC/ Int/FP combinations for multi-thread experiments • Evaluated all RMT models for comprehensive analysis of all combinations of leading/trailing threads • RVQ size = 600 entries School of Computing

  21. Performance Evaluation School of Computing

  22. Effect of Register File Size - SRTR ROB size 160 School of Computing

  23. Effect of Register File Size ST-P-CRTR School of Computing

  24. Effect of Register File Size CRTR School of Computing

  25. Effect of Register File Size MT-P-CRTR School of Computing

  26. Effect of Register File Size • For SRTR, CRTR, MT-P-CRTR: • Performance of 100 size RF with ER same as baseline with 160 size (37.5% size reduction) • Performance improvement of 34% in 100 size RF with ER compared to baseline with 100 size • For ST-P-CRTR • Performance of 50 size register file with ER same as baseline with 80 size (37.5% size reduction) • Performance improvement of 12% in 100 size RF with ER compared to baseline with 100 size School of Computing

  27. Observations • More favorable to models where leading thread co-executes with another leading/trailing thread • Most FP benchmarks perform better with ER (greater than 20% improvement) • Int benchmarks that have poor bpred rates do not benefit much (gcc, equake, eon etc upto 3%) School of Computing

  28. Performance Overheads • For 100 million single thread execution • 70 million registers are released eagerly • 6% copied back upon mispredict recovery • Cost of copying back dependent upon program mispredict rate • Each mispredict requires 6.6 copy back values • Cost of copying can be possibly hidden with branch recovery time School of Computing

  29. Performance Overheads Max IPC loss for 5-cycle overhead is 4% School of Computing

  30. Power/Area Analysis 8 Rd/4 Wr ports assumed for ST RF 16 Rd/8 Wr ports assumed for MT RF School of Computing

  31. Power/Area Analysis • Single thread RF size 50 with ER compared to baseline RF size 80 can • Improve Clock speed by 19% • Consumes 11% less energy and 25% less area • If SEC-DED ECC is implemented on baseline register file • 6% Energy increase and 16% area increase • Smaller RF can help afford ECC for even multiple bit soft error resilience School of Computing

  32. Fault-Injection Analysis • Modified Simplescalar for fault analysis • Conservative analysis as masking effects cannot be modeled • Every 1000 cycles, register bit is flipped in trailing register file • Only 0.0004% of faults go undetected • On average 99% of time logical register is rewritten in less than 100 instruction interval • Ensures that slack is less than RVQ size School of Computing

  33. Conclusions and Future Work • RMT model very suitable for Eager Register Release • A 100 entry RF can match the throughput of 160 entry file and shows 34% improvement over baseline • Fault-coverage reduction marginal ~0.0004% • Enables smaller RF for lower power, higher clock speed, lower area overheads • Enables reliability by making ECC affordable • Nontrivial implementation overheads • Need to explore complexity-effective solution School of Computing

More Related