1 / 57

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University. Moore’s Law: the Original Version. Log transistors on a chip. Time.  exponentially increasing resources. Moore’s Law: the Popular Interpretation.

johnna
Download Presentation

Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Thread-Level Speculation: Towards Ubiquitous Parallelism Greg Steffan School of Computer Science Carnegie Mellon University

  2. Moore’s Law: the Original Version Log transistors on a chip Time exponentially increasing resources

  3. Moore’s Law: the Popular Interpretation Log performance Time increase resources  increase performance?

  4. Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b)  ILP is running out of steam A Superposition of Innovations Log of Performance Time

  5. Why ILP is Running Out of Steam Cross-chip wire latency (in cycles): Development cost: Power density: Probability of a defect: these problems must be addressed

  6. ? we are here Instruction-Level Parallelism (ILP) Datapath Size (8b, 16b, 32b, 64b) now How Do We Sustain the Performance Curve? Log of Performance Time  what is the next big win for micro-architecture?

  7. P P C C C A New Path: Thread-Level Parallelism Tolerate cross-chip wire latency: • localized wires Lower development cost: • stamp out processor cores Lower power: • turn off idle processors Tolerate defects: • disable any faulty processor Processors Caches Chip Multiprocessor (CMP) many advantages

  8. Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) (ALPHA 21464, Intel Xeon) (IBM Power4, SUN MAJC, Sibyte SB-1250) multithreading on a chip! Multithreading in Every Scale of Machine Threads Supercomputers

  9. P P P P P C C C C C C C Improving Performance with a Chip Multiprocessor Multiprogramming Workload: Applications Execution Time Processor Caches improves throughput

  10. P P P P P P P P P C C C C C C C C C C C C Improving Performance with a Chip Multiprocessor Single Application:  Exec. Time need parallel threads to reduce execution time

  11. How Do We Parallelize Everything? 1) Programmers write parallel code from now on • time-consuming and frustrating • very hard to get right • not a broad solution 2) System parallelizes automatically • no burden on the programmer • parallelize any application automatic parallelization is preferred

  12. for (i = 1;i < N;i++) A[i] = A[i-1]; Current Technique: Prove Independence Independent Dependent A[0]0 for (i = 0;i < N;i++) A[1]0 A[i] = 0; A[2]0 A[1]A[0] A[2]A[1] A[3]A[2] need to fully understand data access pattern

  13. Ubiquitous Parallelization: How Close Are We? Compiler can parallelize portions of numeric programs • scientific, floating-point, array-based codes • usually written in fortran What about everything else? • general-purpose, integer codes • written in C, C++, Java, etc. • little (if any) success so far parallelize by proving independence proving independence is infeasible

  14. for (i = 0;i < N;i++) A[i] = A[B[i]]; while (...){ ... = *q; *p = ...; } The Main Culprit: Indirection Indirect array references A[0]A[B[0]] ? A[1]A[B[1]] ? A[2]A[B[2]] need to know the values of B[] Pointers …  *q *p  … ? …  *q *p  … need to know the targets of p and q

  15. Summary We need the next big performance win • instruction-level parallelism will run out of gas Multithreading will soon be everywhere • we need automatically-parallelized programs The scope of current techniques is extremely limited • proving independence is infeasible  A solution: Thread-Level Speculation (TLS)

  16. …*q violation *p…    Recover TLS Exec. Time …*q  exploit available thread-level parallelism Thread-Level Speculation: the Basic Idea 

  17. Outline The Software/Hardware Sweet Spot • Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions

  18.   Support for TLS: What Do We Need? Break programs into speculative threads • to maximize thread-level parallelism Track data dependences • to determine whether speculation was safe Recover from failed speculation • to ensure correct execution three key elements of every TLS system

  19. Compiler Researchers do it in Software

  20. software dependence tracking was parallel execution safe?   LRPD Test (Illinois at UC) + implemented entirely in software – applies only to array-based code – no partial parallelism Exec. Time

  21. Architects do it in Hardware

  22. P P P P P P P P ARB   Multiscalar (Wisconsin) • compiler breaks program into threads • Address Resolution Buffer (ARB) + – highly specialized for speculation

  23. Our Approach: Find the Sweet Spot Compiler: + global view of control flow – hard/impossible to understand data dependences Hardware: – operates on a small window of instructions + observes dynamic memory accesses leverage their respective strengths

  24.   The Sweet Spot • Compiler: • break programs into speculative threads • why: compiler has a global view of control flow • Hardware: • track data dependences • why: software comparison of all addresses infeasible • recover from failed speculation • why: software buffering of all writes infeasible important: minimize additional hardware

  25. Outline The Software/Hardware Sweet Spot Compiler Support • Industry-Friendly Hardware • Improving Value Communication • Conclusions

  26. Compiler Support for TLS profile information inserts TLS instructions which loops? Transformation and Optimization Sequential SourceCode MIPS Executable Region Selection

  27. P P P P Simple Performance Model Dependence Tracking • 4 processors • Each processor issues one instruction per cycle • No communication latency between processors  shows potential performance benefit

  28. Potential Improvement significant impact on execution time

  29. Outline  The Software/Hardware Sweet Spot  Compiler Support  Industry-Friendly Hardware • Improving Value Communication • Conclusions

  30. Goals 1) Handle arbitrary memory accesses • i.e. not just array references 2) Preserve single-thread performance • keep hardware support minimal and simple 3) Apply to any scale of multithreaded architecture • within a chip and beyond effective, simple, scalable

  31.  Requirements 1) Recover from failed speculation • buffer speculative writes from memory 2) Track data dependences • detect data dependence violations  each has several implementation options

  32. store buffer Proc Recover From Failed Speculation: Option 1 Augment the store buffer: + common device in superscalar processors • facilitates non-blocking stores – too small

  33. Proc Recover From Failed Speculation: Option 2 Add a new dedicated buffer + can design an efficient speculation mechanism – want to avoid large speculation-specific structures

  34. Proc Cache Recover From Failed Speculation: Option 3 Augment the cache + very common structure + relatively large  just maintain single-thread performance

  35. violation detected   Tracking Data Dependences: Option 1 Load X Add a dedicated “3rd-party” entity – want to avoid large speculation-specific structures – does not scale Store X P P C C  Dependence Tracker

  36. P P C C load address violation  detected   Tracking Data Dependences: Option 2 Load X Detection at the producer • producer informed of all addresses consumed – awkward: producer must notify consumer of any violation Store X Consumer Producer

  37. P P C  violation detected store address   Tracking Data Dependences: Option 3 Load X Detection at the consumer • consumers informed of all addresses produced Store X Consumer Producer C similar to invalidation-based cache coherence!

  38. P - - - - - - - - - - - -   Augmenting the Cache Cache Data State Tag

  39. P SM SL Tag - - - - - - - - - - - - - - - - - - - -   Augmenting the Cache Cache Speculatively Loaded Data State Speculatively Modified modest amount of extra space

  40. P SM SL Tag Y V X valid valid valid Z # # # # 0 0 0 0 1 1 0 1   Augmenting the Cache Cache Data State valid when speculation fails…

  41. P SL SM Tag 0 invalid invalid Y valid # 0 0 0 0 0 0 0 - - - - - -   Augmenting the Cache Cache Data State invalid …can quickly discard speculative state

  42. 4 5 P X is speculatively P loaded C  violation detected (4<5) invalidate X; from4   Extending Cache Coherence Load X Store X C straightforward extension of cache coherence

  43. C C C P P Crossbar Detailed Performance Model Underlying architecture • single-chip multiprocessor • implements speculative coherence Simulator • superscalar, a modernized MIPS R10K • models all bandwidth and contention detailed simulation!

  44. Will it Work at All of These Scales? Threads Proc Proc Proc Cache Cache Desktops Simultaneous- Multithreading Chip Multiprocessor (CMP) Supercomputers yes: coherence scales up and down

  45. Performance on Multi-Chip Systems our scheme is scalable

  46. Performance on General-Purpose Applications significant performance improvements

  47. Outline  The Software/Hardware Sweet Spot  Compiler Support  Industry-Friendly Hardware Improving Value Communication • Conclusions

  48. Memory Speculate Load *q Store *p good when p != q

  49. Memory Synchronize (and forward) Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q good when p == q

  50. Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

More Related