1 / 38

The potential for Software-only thread-level speculation

The potential for Software-only thread-level speculation. Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman   Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005.

jerrod
Download Presentation

The potential for Software-only thread-level speculation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The potential for Software-only thread-level speculation Depth Oral Presentation Co-Supervisors: Prof. Greg. Steffan Prof. Cristina Amza Committee Members Prof. Tarek. Abdelrahman   Prof. Michael Voss Prof. Ken Sevick By: Chuck (Chengyan) Zhao April 25, 2005

  2. From all major companies: IBM: Power 4 Power 5 … Intel: Montecito Smithfield … AMD: Dual-core Opteron Sun: MAJC Sony, Toshiba, IBM: Cell … … Chip Multi-Processor (CMP) is now everywhere Power 4 Dual-core Intel chip Dual-core Opteron Cell Abundant Chip Multiprocessors

  3. P P P P P C C C C C C C Improving Throughput with a Chip Multi-Processor Multiprogramming Workload: Applications Execution Time Processor Caches improve throughput

  4. P P P P P P P P P C C C C C C C C C C C C Improving Single Application Performance with a Chip Multi-Processor Single Application:  Exec. Time need parallel threads to reduce execution time

  5. Using Chip Multi-Processor for improvements • Improve throughput for multi-programming workload • Easy • CMP behaves like a normal MP • Improve single-application performance • Hard • Control and Data Dependence • Proposed approach: Thread-Level Speculation (TLS) CMP trade-offs

  6. Run Time Compile Time Parallelize without dependency detection Commit Modification No Detect Violation Squash And Re-execute Yes Thread-Level Speculation (TLS) • Enable compiler to create parallel threads despite the existence of ambiguous data dependence • Optimistically parallelize at compile time • Detect violations and recover at runtime Optimistic at compile time, detect and recover at runtime

  7. Example of Thread-Level Speculation Code to parallelize for ( …){ … *p = …; … … = … … *q; … } Un-parallelizable through paralleling compilers • Uncertain dependence between *p and *q • Might be runtime or user-input dependent Break loop iterations into threads, explore uncertainty in each thread

  8. …*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism How Thread-Level Speculation works 

  9. Thread-Level Speculation quick summary • Benefits • Reduce inter-thread communication time among cores • Scale • New parallel programming model • Types of implementations • Hardware only • Combined with hardware and software • Software only Thread-Level Speculation is good for Chip Multi-Processor

  10. Thread-Level Speculation SW-only approach HW-only approach Our approach Thread-Level Speculation Implementation Diagram Overall picture of Thread-Level Speculation

  11. Thread-Level Speculation Implementation Comparison • Hardware-only approach • Lots of research • Good speed up through simulation • Nobody builds it yet • cost, risky, • need both HW + SW at the same time • Outcome • HW-only TLS looks promising • Significant hardware changes • Software-only approach: limited work, limited progress • Major problem: high overhead • Buffer memory for speculative states • Track each memory read + write: violation detection • Recover from failed speculation: re-execution Quick summary on HW-only and SW-only approaches

  12. Outline for the rest of the talk • Hardware TLS schemes • Software TLS schemes • Our scheme • Our goals • Starting point • Potential applications • Conclusion

  13. Thread-Level Speculation SW-only approach HW-only approach Our approach Hardware-only Thread-Level Speculation Overall picture of HW-only TLS approach

  14. Hardware Thread-Level Speculation Schemes • Lots of hardware TLS research • CMU Stampede • Stanford Hydra • Wisconsin Multiscalar • UIUC IA-COMA • UMN Super-threaded architecture • … • Convergence of hardware schemes • Use cache to buffer speculative state • Extend cache coherence protocol to track data dependence Convergence of HW-only Thread-Level Speculation

  15. Result TLS is promising SPEC int improvement: 30% - 100% Depends on aggressiveness of the hardware support P P P P C C C C C (non-speculative) Hardware TLS Schemes: quick summary Sp-state Sp-state Sp-state Sp-state CMP with hardware speculative buffer and enhanced cache consistence protocol Convergence of HW-only Thread-Level Speculation

  16. Thread-Level Speculation SW-only approach HW-only approach Our approach Software-only Thread-Level Speculation Overall picture of SW-only TLS approach

  17. Software-only Thread-Level Speculation Schemes • LRPD Test: UIUC • VM for dependence tracking: Spiros’s, CMU • Cintra’s SW TLS: U Edinburgh • Problem of software-only approach: high overhead • Try to reduce it overview of SW-only TLS approach

  18. software dependence tracking was parallel execution safe? LRPD Test (UIUC) + implemented entirely in software – applies only to array-based code – no partial parallelism entire loop will re-execute sequentially if there is any dependence Exec. Time Pros + Cons of LRPD

  19. Dependence tracking using Virtual Memory Exec. Time Software dependence tracking through VM pages Virtual Memory Synchronize: transfer VM pages ? Pros + Cons of VM Tracking

  20. CMU Spiros’s approach -- Dependence tracking using Virtual Memory • Coarse-grain, software-only • Based on memory tracking • virtual memory page protection mechanism • use software DSM (TreadMarks) • Synchronization through VM pages through cost analysis • Overhead is prohibitive • 2 sec (seq) / 5 min (par) • Not a viable approach on this level of coarse granularity SW-TLS through VM Tracking is not attractive

  21. Cintra’s SW TLS: Memory tracking tuned for performance Exec. Time Efficient tracking for array references Efficient but custom-made for array only

  22. Cintra’s software-only Thread-Level Speculation: quick summary • Features • Software simulation for extended cache coherence protocol • Provide speculative state transition table • Violation detection through speculate state comparison • Instrument on each load and store • Pros + Cons: • + advanced implementation of LRPD test • + implement entirely in software • + cover partial parallelism • – hand-crafted code for performance • – apply only to array-based code Summary of Cintra’s work

  23. Problems with Software Thread-Level Speculation • High overhead • Buffer speculative state • Track data dependence for all memory reference • Re-execute in case of failed speculation • Potential speedup • largely unexplored • Possible directions for future research • Reduce overhead • Achieve speedup from TLS parallelism Summary of Software TLS

  24. Thread-Level Speculation SW-only approach HW-only approach Our approach Our current Thread-Level Speculation approach Overall position for our SW TLS approach

  25. Long term future plan • Goals • Target • Chip Multi-Processors • Tightly-coupled MPs • Apply to general-purpose code: not only arrays • Minimize overhead • Capitalize on compiler analysis and optimizations • Idempotency analysis <done> • Synchronization and communications <done> • PPA: Probabilistic pointer analysis Framework (Jeff’s work) <progressing> • Minimal backup and buffer retrieval analysis <progressing> • … more analysis we will invent <todo> • SW-only approach: room to improve • Starting point: highly efficient software checkpointing Goals and Plans

  26. Starting point: efficient software checkpointing program execution • Some program points in source code • Buffer state change between current execution point and its latest check point • Execution can always efficiently rewind to its latest checkpointing  Buffer memory changes Buffer more memory changes  Software checkpointing Introduce software checkpointing

  27. Potential use of Software checkpointing • Software Rollback • automatic software TLS support • foundation of future automatic TLS parallelization • Debug • controlled rewind • Enhance application reliability • Speculative optimizations in uni-processor program • larger window size • deep branch speculation • speculative code motion what can software checkpointing do

  28. Software checkpointing schemes • Compiler analysis • Local: Basic Block level • Backup only needed memory writes • Optimize to minimize • number of backup • Number of buffer retrieval • Global: procedural level • Populate buffers through control-flow graph • Iterate until buffer stabilizes • Inter-procedural level • Potential approaches for software backup • Undo backup • Todo backup build software checkpointing

  29. Undo backup • Compile-time analysis • Backup once • per distinct memory write • per Basic Block • Program continue to operate on non-backup memory • Action upon execution completion • Commit: trash buffer • Rollback: restore from buffer undo backup properties

  30. Undo backup example Program, Basic Block level Undo backup memory Undo backup action (&a, [a]) (&b, [b]) (&c, [c]) … a = 10; b = 12; … c = a + b; … conflicts check Y restore undo memory N trash undo memory Next Basic Block … undo backup process

  31. Todo backup • Perform at runtime • Happen on each single memory write inside Basic Block • Each following read might need to retrieve from buffer • Action upon completion (reverse of Undo type) • Commit: write-back from buffer • Rollback: trash buffer todo backup properties

  32. Todo backup example Program, Basic Block level todo backup memory (p, a) (q, b) … *p = a; *q = b; … …*p + *q; … conflicts check Y trash todo backup N write todo backup to memory Next Block … todo backup process

  33. Backup Comparison • Undo • Pro: fast • Few number of backups • No need to retrieve from buffer for read • Con: Memory address needs to be known statically • Scalar • Pointer to fixed location • Todo • Pro • Handle both scalar and general-purpose pointer cases • Con: slow • Backup once per memory write • Need to retrieve each following read from buffer • In reality: both types are used pros + cons of undo and todo

  34. An example in reality: mixed mode Code to execute Undo buffer int a, b, c; int * p, * q; … (d) a = 1; (d) b = 2; (d) *p = 5; … … (u) c = a + b; … … (u) … = * q; … (&a, [a]) (&b, [b]) (&c, [c]) Todo buffer (p, 5) combined-backup process in reality

  35. Selection of backups in reality • Combined approach • Undo: memory address known • Scalars • Pointers to fixed address • Compile-time analysis • Todo: memory address unknown • Normal pointers • Run-time analysis • Plan for implementation • put into SUIF, as a optimization pass • Minimize performance drop use both types together in reality

  36. Conclusion • Thread-Level Speculation is compelling • Potential large performance gains • Challenge • Software overhead • Limited SW TLS work • No previous SW TLS working on general-purpose programs • Killer advantage: compiler analyses • Modest starting point • efficient software checkpointing summary

  37. Questions and Answers

  38. Concurrent HW-only Related Work An other view of HW-only Thread-Level Speculation Schemes

More Related