1 / 50

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University. P. P. C. C. C. Multithreaded Machines Are Everywhere. Threads. P. C. C. ALPHA 21464, Intel Xeon.

zaza
Download Presentation

Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry Sc

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Value Communication for Thread-Level Speculation Greg Steffan, Chris Colohan, Antonia Zhai, and Todd Mowry School of Computer Science Carnegie Mellon University

  2. P P C C C Multithreaded Machines Are Everywhere Threads P C C ALPHA 21464, Intel Xeon SUN MAJC, IBM Power4, Sibyte SB-1250 How can we use them? Parallelism!

  3. Automatic Parallelization Proving independence of threads is hard: • complex control flow • complex data structures • pointers, pointers, pointers • run-time inputs How can we make the compiler’s job feasible? Thread-Level Speculation (TLS)

  4. E1 E2 E3 Load Store   TLS  Retry Load  Thread-Level Speculation Epoch1 Time Epoch2 Epoch3 exploit available thread-level parallelism

  5. E1 E2 Load *q Store *p Memory Speculate good when p != q

  6. E1 E1 E2 E2 Load *q Store *p Wait Store *p Memory (stall) Signal (Speculate) Load *q Memory Synchronize (and forward) good when p == q

  7. Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

  8. E1 E1 E1 E2 E2 E2 Wait Load *q Store *p Signal (stall) Store *p Value Predictor Load *q Load *q Memory Store *p Memory (Speculate) Memory (Synchronize) Predict good when p == q and *q is predictable

  9. reduce critical forwarding path reduce critical forwarding path is there any potential benefit? Improving on Compile-Time Decisions Compiler Hardware Speculate Speculate Predict Synchronize Synchronize

  10. Potential for Improving Value Communication U=Un-optimized, P=Perfect Prediction (4 Processors) efficient value communication is key

  11. Outline Our Support for Thread-Level Speculation • Compiler Support • Experimental Framework • Baseline Performance • Techniques for Improving Value Communication • Combining the Techniques • Conclusions

  12. Compiler Support (SUIF1.3 and gcc) 1) Where to speculate • use profile information, heuristics, loop unrolling 2) Transforming to exploit TLS • insert new TLS-specific instructions • synchronizes/forwards register values 3) Optimization • eliminate dependences due to loop induction variables • algorithm to schedule the critical forwarding path compiler plays a crucial role

  13. C C C P P Crossbar Experimental Framework Benchmarks • from SPECint95 and SPECint2000, -O3 optimization Underlying architecture • 4-processor, single-chip multiprocessor • speculation supported by coherence Simulator • superscalar, similar to MIPS R10K • models all bandwidth and contention detailed simulation!

  14. Compiler Performance S=Seq., T=TLS Seq., U=Un-optimized, B=Compiler Optimized compiler optimization is effective

  15. Outline  Our Support for Thread-Level Speculation Techniques for Improving Value Communication • When Prediction is Best • Memory Value Prediction • Forwarded Value Prediction • Silent Stores • When Synchronization is Best • Combining the Techniques • Conclusions

  16. Value Predictor E1 E2 Load *q With Value Store *p  Prediction   Memory Memory avoid failed speculation if *q is predictable Memory Value Prediction E1 E2 Load *q Store *p 

  17. >? Value Predictor Configuration no prediction Aggressive hybrid predictor • 1K x 3-entry context and 1K-entry stride • 2-bit, up/down, saturating confidence counters Context predicted value Stride load PC Confidence Confidence predict only when confident

  18. Throttling Prediction E1 E2 Only predict exposed loads • hardware tracks which words are speculatively modified • use to determine whether a load is exposed Store X Load X Load X exposed not exposed predict only exposed loads

  19. Memory Value Prediction exposed loads are fairly predictable

  20. Memory Value Prediction B=Baseline, E=Predict Exposed Lds, V=Predict Violating Loads effective if properly throttled

  21. Value Predictor E1 E2 Load X With Value Store X  Prediction  avoid synchronization stall if X is predictable Forwarded Value Prediction E1 E2 Store X Wait Signal (stall)  Load X 

  22. Forwarded Value Prediction forwarded values are also fairly predictable

  23. Forwarded Value Prediction B=Baseline, F=Predict Forwarded Val’s, S=Predict Stalling Val’s only predict loads that have caused stalls

  24. E1 E2 (Store X=5) Load X Load X==5?    Exploiting Silent Stores Memory (X=5) Memory (X=5) avoid failed speculation if store is silent Silent Stores E1 E2 Load X Store X=5 

  25. Silent Stores silent stores are prevalent

  26. Impact of Exploiting Silent Stores B=Baseline, SS=Exploit Silent Stores most of the benefits of memory value prediction

  27. Outline  Our Support for Thread-Level Speculation Techniques for Improving Value Communication  When Prediction is Best • When Synchronization is Best • Hardware-Inserted Dynamic Synchronization • Reducing the Critical Forwarding Path • Combining the Techniques • Conclusions

  28. E1 E2 Store *p  (stall)  Load *q With  Dynamic Sync. Memory Memory Hardware-Inserted Dynamic Synchronization E1 E2 Load *q Store *p  avoid failed speculation

  29. Hardware-Inserted Dynamic Synchronization B=Baseline, D=Sync. Violating Ld.s, R=D+Reset, M=R+Minimum overall average improvement of 9%

  30. Overview Big Critical Path Small Critical Path Wait stall Load X critical path Store X execution time execution time Signal decreases execution time Reduce the Critical Forwarding Path

  31.  Load r1=X   op r2=r1,r3 critical path  Store r2,X  Signal With  op r5=r6,r7 Prioritization  op r6=r5,r8 Prioritizing the Critical Forwarding Path • mark the input-chain of the critical store • give marked instructions high issue priority Load r1=X op r2=r1,r3 op r5=r6,r7 critical path op r6=r5,r8 Store r2,X Signal

  32. Critical Path Prioritization some reordering

  33. Impact of Prioritizing the Critical Path B=Baseline, S=Prioritizing Critical Path not much benefit, given the complexity

  34. Outline Our Support for Thread-Level Speculation  Techniques for Improving Value Communication Combining the Techniques • Conclusions

  35. Combining the Techniques Techniques are orthogonal with one exception: Memory value prediction and dynamic sync. • only synchronize memory values that are unpredictable • dynamic sync. logic checks prediction confidence • synchronize if not confident

  36. Combining the Techniques B=Baseline, A=All But Dyn. Sync.,D=All, P=Perfect Prediction close to ideal for m88ksim and vpr

  37. Conclusions Prediction • memory value prediction: effective when throttled • forwarded value prediction: effective when throttled • silent stores: prevalent and effective Synchronization • dynamic synchronization: can help or hurt • hardware prioritization: ineffective, if compiler is good      prediction is effective  synchronization has mixed results

  38. BACKUPS

  39. Goals 1) Parallelize general-purpose programs • difficult problem 2) Keep hardware support simple and minimal • avoid large, specialized structures • preserve the performance of non-TLS workloads 3) Take full advantage of the compiler • region selection, synchronization, optimization

  40. Potential for Further Improvement point

  41. Pipeline Parameters

  42. Memory Parameters

  43. When Prediction is Best Predicting under TLS • only update predictor for successful epochs • cost of misprediction is high: must re-execute epoch • each epoch requires a logically-separate predictor Differentiation from previous work: • loop induction variables optimized by compiler • larger regions of code, hence larger number of memory dependences between epochs

  44. Benchmark Statistics: SPECint2000

  45. Benchmark Statistics: SPECint95

  46. Memory Value Prediction exposed loads are quite predictable

  47. On a dependence violation: Load PC Load PC Exposed Load Table Violating Loads List Load PC Load PC Load PC Load PC Load PC Load PC Load PC cache tag Load PC only predict violating loads Throttling Prediction Further On an exposed load: Exposed Load Table cache tag

  48. Forwarded Value Prediction synchronized loads are also predictable

  49. Silent Stores silent stores are prevalent

  50. Critical Path Prioritization significant reordering

More Related