1 / 29

Fault Tolerance and Performance Enhancement using Chip Multi-Processors

Fault Tolerance and Performance Enhancement using Chip Multi-Processors. Işıl ÖZ. Outline. Introduction Related Work Dual-Core Execution(DCE) DCE for Fault Tolerance DCE with Energy Optimization Experimental Results Conclusion. CMP. Single-chip multi-core or chip multiprocessor

jacinta
Download Presentation

Fault Tolerance and Performance Enhancement using Chip Multi-Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fault Tolerance and Performance Enhancement using Chip Multi-Processors Işıl ÖZ

  2. Outline • Introduction • Related Work • Dual-Core Execution(DCE) • DCE for Fault Tolerance • DCE with Energy Optimization • Experimental Results • Conclusion

  3. CMP • Single-chip multi-core or chip multiprocessor • High system throughput • Only explicit parallelism • No single-thread performance • Idle processor cores if insufficient parallel tasks • Dual core execution • Utilize multi-cores to improve the performance for single-thread workloads • Computation redundancy

  4. Run-Ahead Execution • Blocked by long latency cache miss • Checkpoint the processor state • Enter run-ahead mode • Blocking completes • Return normal mode • Re-execution using warmed up caches • Limitation • Re-execution even run-ahead is correct • Multiple executions for miss-dependent misses

  5. CFP (Continual Flow Pipelines) • Similar to Run-ahead execution • Store dependent (slice) instructions • Execute independent instructions speculatively • Commit speculative results • Limitation • Requires a large centralized load/store queue

  6. Leader-Follower Architectures • Running a program on two processors • One leader • One follower using leader’s results to make faster progress • Limitation • Leader may be slower, follower cannot use the results • Follower may be slower, leader has to wait to retire

  7. Dual-Core Execution (DCE)

  8. Front Superscalar Core • Execute instructions in normal way, except • For long-latency cache misses (L2 miss) • Substitute the data fetched with invalid value • INV bit is set in the physical register • Invalidate the dependent instructions • Propagate INV flag through data dependency • Retire instructions in-order, except • Store instructions • No data cache or memory update • Update run-ahead cache to use in subsequent loads • Exceptions

  9. Result Queue • First-in first-out structure • Keeps the retired instruction stream from the front processor • Provides continuous instruction stream to the back processor

  10. Back Superscalar Core • Instructions are fetched from the result queue • Processes instructions in normal way, except • Mispredicted branches • All the instructions are squashed in back and front processor • The result queue is emptied • The back processor’s register values are copied into the front processor’s physical registers • Run-ahead cache is invalidated • Retires instructions in-order • Store instructions update data caches • Precise state for exception handling

  11. Memory Hierarchy • Seperate L1 data caches for back and front processor • Shared L2 cache • L1 D-cache miss in the front processor -> prefecth request for L1 D-cache in the back • The back processor updates both L1 D-caches at the store instruction retirement

  12. Simulation Methodology • Simulator infrastructure • SimpleScalar toolset • Baseline • MIPS-R10000-style superscalar processor • SPEC CPU 2000 benchmarks • Memory-intensive benchmarks

  13. DCE_R • DCE for Transient Fault Tolerance • DCE with redundancy check • Compare results of front processor that are not invalid and results of back processor • In case of discrepancy • Branch misprediction recovery mechanism provide fault tolerance by rewinding the processors • Only partial redundancy coverage

  14. Redundancy Checking Results The percentage of retired instructions with redundancy checking

  15. DCE_FR • DCE_R with Full Redundancy Coverage • F_INV flag to each instruction to show whether it’s validated by the front processor • If invalidated, the back processor fetches the same instruction twice for normal and redundancy • If validated, the front processor result is used as redundancy • Changes in renaming logic • Redundant execution • Source operands access rename table as usual • Destination registers obtain new physical register, not update • At the retire stage, dest.registers are freed after the comparison

  16. DCE_FR_t • DCE_FR with Renaming Scheme • Additional renaming table (A_table) to the original renaming table (R_table) • Invalidated normal execution, accesses and updates • R_table • Invalidated redundant execution, accesses and updates • A_table • Validated execution, accesses • R_table • Validated execution, updates • both R_table and A_table

  17. Performance Impact • DCE_R and DCE_FR better than Base, except benchmarks having many branch mispredictions • DCE_R and DCE_FR not much better than DCE • DCE_FR 23.5% performance improvement

  18. Energy Consumption • DCE_R and DCE_FR have high energy overhead

  19. Energy Overhead Problems • Wrong-path instructions • Large instruction window • Branch misprediction results in fetching and executing large wrong-path instructions • Redundant execution for invalidated instructions • Need to access some structures (register file, access table etc.) although producing no useful results • DCE_FR has to dual-execute

  20. Energy Overhead Solutions-1 • FR_rs • Adapting instruction window size • Reduce for high misprediction rated workloads • Keep large to exploit large-window benefits for others • FR_rs_tl • Selective invalidation • Not invalidate traversal address load • Only special “load ra, x(ra)” instructions • Due to requiring compiler support to decide load types

  21. Energy Overhead Solutions-2 • FR_rs_tl_in • Adaptive Enable/Disable the invalidation • Based on workload’s dynamic behavior • Invalidate • Memory-intensive with moderate mispredictions, or • Memory-intensive with low mispredictions, or • Moderate memory-intensive with extremely low mispredictions • Otherwise no invalidate

  22. Performance Impact • Not much performance improvement over DCE_FR

  23. Energy Consumption • Significantly reduce the energy overhead

  24. Energy Overhead Solutions-3 • Reducing redundant execution • No redundant execution for not invalidated instruction • Reexecute only loads and invalidated instructions • Switching between DCE and single-core • Workloads with high misprediction rates • Switch from the dual-core mode to single-core mode

  25. Performance Impact • Executed instructions/ Retired instructions in the back processor • 41% in average

  26. Enery Optimization Results-1

  27. Enery Optimization Results-2

  28. Conclusion • DCE • Improves the performance of single-threaded applications using CMPs • Works best with memory-intensive workloads with a low misprediction rate • Dynamic scheme which enables/disables DCE • DCE with full redundancy checking • 24.9% speedup, 87% energy overhead • DCE without reliability requirement • 34% speedup, 31% energy overhead

  29. References • H. Zhou, “A Case for Fault-Tolerance and Performance Enhancement Using Chip Multiprocessors”, Computer Architecture Letters, Sept. 2005. • H. Zhou, “Dual-Core Execution: Building a Highly Scalable Single-Thread Instruction Window,” Proc. 14th Int’l Conf. Parallel Architectures and Compilation Techniques (PACT ’05), 2005. • Yi Ma, Hongliang Gao, Martin Dimitrov, and H.Zhou, “Optimizing Dual-Core Execution for Power Efficiency and Transient-Fault Recovery”, IEEE Transactions on Parallel and Distributed Systems, vol. 18, No. 2007.

More Related