1 / 39

Compiler Support for Trace-Level Speculative Multithreaded Architectures

λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spain antoniox.gonzalez@intel.com. ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,jordit}@ac.upc.edu. ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili

borunda
Download Presentation

Compiler Support for Trace-Level Speculative Multithreaded Architectures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spainantoniox.gonzalez@intel.com ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,jordit}@ac.upc.edu ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spaincarlos.molina@urv.net INTERACT-9, San Francisco (USA) - February 13, 2005 Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф

  2. Trace Level Speculation With Live Output Test With Live Input Test Trace Level Speculation • Avoids serialization caused by data dependences • Skips in a row multiple instructions • Predicts values based on the past • Introduces penalties due to misspeculations

  3. Live Output Update & Trace Speculation BUFFER BUFFER INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION Trace Level Speculation with Live Output Test ST NST Trace Miss Speculation Detection & Recovery Actions

  4. ST I Window NST I Window ST Ld/St Queue Branch I Fetch Decode & Functional NST Ld/St Queue Predictor Rename Engine Cache Units ST Reorder Buffer NST Reorder Buffer Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA BlockDiagram Trace Speculation Engine Look Ahead Buffer

  5. Motivation • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values • TSMA • does not introduce significant misspeculation penalties • does not impose constraints to build or predict traces • This work focuses on • developing effective trace selection schemes for TSMA • based on static analysis that uses profiling data

  6. Outline • Trace Selection • Graph Construction • Graph Analysis • Performance Evaluation • Conclusions

  7. Graph Construction • Test input set of the analyzed benchmarks • Abstract data structure is built based on • control flow graph • data dependences graph • predictability of values • Each node represents each static instruction • type of instruction, number of dynamic executions • pointers and frequencies to succeeding instructions • pointers and frequencies to preceding instructions • predictability of live output values and dead values

  8. Graph Analysis • Two important issues • initial and final point of a trace • maximize trace length & minimize control flow misspeculations • predictability of live output values • prediction accuracy and utilization degree • Three basic heuristics • Procedure Trace Heuristic • Loop Trace Heuristic • Instruction Chaining Trace Heuristic

  9. Procedure Trace Heuristic • Procedures relatively frequent • Computations that follow a subroutine • fairly independent of the subroutine • except return values and some memory locations • Quite easy to predict the end of a trace

  10. If it does not achieve a certain threshold, the trace is discarded I5 I2 I4 I6 I7 I1 I12 I3 I11 I12 I11 I14 I10 I13 I11 Branch Call NT I12 T I3 I13 I14 I11 T Branch NT 6 2 1 4 3 5 Return Return address is marked as final point of the trace Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure. Call instruction is marked as initial point of the trace In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated. N instructions after the final point of the trace are checked. Only significant paths are considered. Procedure Trace Heuristic

  11. Loop Trace Heuristic • Traditional source of parallelization and speculation • We consider the whole execution of a loop as a trace • The objective is to detect loops whose live-output values after their whole execution are predictable

  12. I8 I4 I6 I5 I7 I2 I1 I3 Branch NT T I2 I8 T Backward Branch 3 2 1 NT Backward branch target is marked as initial point of the trace N instructions after the final point of the trace are checked. Same behaviour as procedure trace heuristic Fall-throughinstruction of the same backward branch is marked as final point of the trace Loop Trace Heuristic

  13. Ichaining Trace Heuristic • Goal • to identify large sequences of dynamic instructions • besides procedures and loops • A trace is identified by: • initial point • final point • behaviour of conditional branches within the trace

  14. Conditional Branch T NT I9 I4 I3 I8 I11 I10 I5 I1 I12 I2 I6 I7 Conditional Branch NT T I8 I10 I3 I9 I7 I2 Conditional Branch T NT 1 Taken and not taken targets of all conditional branches are considered as initial points of a trace IChaining Trace Heuristic

  15. Conditional Branch T NT I6 I9 I3 I8 I11 I10 I5 I1 I12 I2 I4 I7 Conditional Branch NT T I3 I5 Conditional Branch T NT 2 3 Every time a conditional branch is found, the trace is split into two. Given an initial point, a trace is extended adding successive instructions IChaining Trace Heuristic

  16. Conditional Branch T NT I6 I4 I3 I9 I8 I10 I7 I5 I1 I12 I11 I2 Conditional Branch NT T I5 I3 I7 I12 I11 Conditional Branch T NT IChaining Trace Heuristic

  17. Conditional Branch T NT I6 I4 I3 I9 I8 I10 I7 I5 I1 I12 I11 I2 Conditional Branch NT T I5 I3 I7 I12 I11 Conditional Branch T NT IChaining Trace Heuristic

  18. Conditional Branch T NT I9 I4 I3 I8 I11 I10 I5 I1 I12 I2 I6 I7 Conditional Branch NT T I12 I7 I3 I11 I5 Conditional Branch T NT 4 I12 Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump. IChaining Trace Heuristic

  19. Conditional Branch T NT If not, final instruction is removed and process starts again. (until trace reaches a minimum size) Trace is considered predictable, if the multiplication of percentages of all live output-values is above certain threshold I1 I12 I2 I7 I11 I8 I5 I4 I6 I10 I3 I9 I12 Conditional Branch NT T I5 I3 I11 I12 I7 Conditional Branch T NT 5 6 7 Live-output values are determined and its predictability is checked for every trace candidate (highest between prediction accuracy and utilization degree) IChaining Trace Heuristic

  20. Trace Speculation Engine • Traces are communicated to the hardware • at program loading time • filling a special hardware structure (trace table) • Each entry of the trace table contains • initial PC • final PC • branch history • live-output values information • frequency counter

  21. Experimental Framework • Simulator • Alpha version of the SimpleScalar Toolset • Benchmarks • Spec2000, ref input • Maximum Optimization Level • DEC C & F77 compilers with -non_shared -O5 • Statistics Collected for 250 million instructions • Skipping an initial part of 500 million

  22. Simulation Parameters • Base microarchitecture • out of order machine, 4 instructions per cycle • I cache: 16KB, D cache: 16KB, L2 shared: 256KB • bimodal predictor • TSMA additional structures • each thread: I window, reorder buffer, register file • speculative data cache: 1KB • verification engine: up to 8 instructions per cycle • trace table: 128 entries, 4-way set associative • look ahead buffer: 128 entries

  23. Profiling Analysis Parameters • Value Predictors: Stride & Context • Minimum size of trace: 16 • Maximum size of trace: 1024 • Maximum number of live-outputs: 32 • Threshold to consider a set of LO predictable: 25% • Significative path (mimimum frequency): 10%

  24. Type of Speculated Instructions Loop Heuristic Procedure Heuristic Ichaining Heuristic 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  25. Type of Speculated Instructions • Procedure and loop traces are relatively low • But sizes are significantly larger than Ichain • Some statistics: • procedure trace size: 97.3 • loop trace size: 215.8 • Ichaining trace size: 36.4 • average size of speculated traces: 65.7 • average number of live output values: 16.4 • branches within a trace (Ichaining): 5.3 • traces with same initial PC (Ichaining): 1.57

  26. Type of Speculations Spec OK, Path KO Spec KO, Path KO Spec KO, Path OK Spec OK, Path OK 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  27. Type of Speculations • Correct speculations: up to 70% • 65% for correctly predicted paths • 7% for incorrectly predicted paths (positive missprediction) • Incorrect speculations: close to 30% • 20% for correctly predicted paths • 8% for incorrectly predicted paths • These confirms that mechanism proposed to predict paths and final points provides significant accuracy

  28. Speedup 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00

  29. Speedup • Average speedup close to 38% • In spite of misspeculating close to 30%

  30. Type of Cycles of ST ST can speculate ST can not speculate 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  31. Type of Cycles of ST • 25% of the time ST can speculate but does not find a trace to be speculated • performance could be improved with further analysis • 75% of the time ST can not speculate because NST is executing and verifying a speculated trace • speculation may be performed only when NST catches up ST

  32. Type of Cycles of NST NST is executing instructions NST is verifying instructions 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  33. Type of Cycles of NST • 65% of the time NST is executing traces speculated by ST • more speculated instructions imply more time executing instructions • 35% of the time NST is verifying instructions from the look ahead buffer • verifying instructions is faster than executing them

  34. Useless Cycles of ST 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  35. Useless Cycles of ST • Up to 20% of the time ST is executing instructions beyond the misspeculation point • ST is wasting up to 20% of the time executing instructions that will be discarded • Ideal scenario would be when this percentage is negligible

  36. Branch Behaviour Distribution 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

  37. Branch Behaviour Distribution • Instruction chanining heuristic does not provide many traces with the same initial point • despite the significant number of branches within a trace (5.3on average) • The study concludes that the majority of branches take almost always the same direction • Close to 80% of the branches take the same direction more than 90% of the times

  38. Conclusions • Profile guided analysis to support TSMA • identifieslarge and highly predictable traces • reducing hardware complexity • Three basic heuristics are proposed • procedure trace heuristic • loop trace heuristic • instruction chaining heuristic • Results show • speedup of 38% with a 30% of missprediction rate • Future work • aggressive trace level predictors • generalization to multiple threads

  39. INTERACT-9, San Francisco (USA) - February 13, 2005 Questions & Answers

More Related