Compiler Support for Trace-Level Speculative Multithreading

λ Intel Barcelona Research Center Intel Labs - UPC Barcelona, Spainantoniox.gonzalez@intel.com ф Dept. Arquitectura de Computadors Universitat Politècnica de Catalunya Barcelona, Spain {antonio,jordit}@ac.upc.edu ψ Dept. Enginyeria Informàtica Universitat Rovira i Virgili Tarragona, Spaincarlos.molina@urv.net INTERACT-9, San Francisco (USA) - February 13, 2005 Compiler Support for Trace-Level Speculative Multithreaded Architectures Antonio González λ,ф Carlos Molina ψ Jordi Tubella ф

Trace Level Speculation With Live Output Test With Live Input Test Trace Level Speculation • Avoids serialization caused by data dependences • Skips in a row multiple instructions • Predicts values based on the past • Introduces penalties due to misspeculations

Live Output Update & Trace Speculation BUFFER BUFFER INSTRUCTION EXECUTION NOT EXECUTED LIVE OUTPUT VALIDATION Trace Level Speculation with Live Output Test ST NST Trace Miss Speculation Detection & Recovery Actions

ST I Window NST I Window ST Ld/St Queue Branch I Fetch Decode & Functional NST Ld/St Queue Predictor Rename Engine Cache Units ST Reorder Buffer NST Reorder Buffer Data L1SDC Cache NST Arch. Verification ST Arch. Register File Engine Register File L1NSDC L2NSDC TSMA BlockDiagram Trace Speculation Engine Look Ahead Buffer

Motivation • Two orthogonal issues • microarchitecture support for trace speculation • control and data speculation techniques • prediction of initial and final points • prediction of live output values • TSMA • does not introduce significant misspeculation penalties • does not impose constraints to build or predict traces • This work focuses on • developing effective trace selection schemes for TSMA • based on static analysis that uses profiling data

Outline • Trace Selection • Graph Construction • Graph Analysis • Performance Evaluation • Conclusions

Graph Construction • Test input set of the analyzed benchmarks • Abstract data structure is built based on • control flow graph • data dependences graph • predictability of values • Each node represents each static instruction • type of instruction, number of dynamic executions • pointers and frequencies to succeeding instructions • pointers and frequencies to preceding instructions • predictability of live output values and dead values

Graph Analysis • Two important issues • initial and final point of a trace • maximize trace length & minimize control flow misspeculations • predictability of live output values • prediction accuracy and utilization degree • Three basic heuristics • Procedure Trace Heuristic • Loop Trace Heuristic • Instruction Chaining Trace Heuristic

Procedure Trace Heuristic • Procedures relatively frequent • Computations that follow a subroutine • fairly independent of the subroutine • except return values and some memory locations • Quite easy to predict the end of a trace

If it does not achieve a certain threshold, the trace is discarded I5 I2 I4 I6 I7 I1 I12 I3 I11 I12 I11 I14 I10 I13 I11 Branch Call NT I12 T I3 I13 I14 I11 T Branch NT 6 2 1 4 3 5 Return Return address is marked as final point of the trace Each instruction in a significant path it is checked whether any of its operands are produced by any instruction of the procedure. Call instruction is marked as initial point of the trace In this case, utilization degree of the value produced and predictability of the producer instruction is evaluated. N instructions after the final point of the trace are checked. Only significant paths are considered. Procedure Trace Heuristic

Loop Trace Heuristic • Traditional source of parallelization and speculation • We consider the whole execution of a loop as a trace • The objective is to detect loops whose live-output values after their whole execution are predictable

I8 I4 I6 I5 I7 I2 I1 I3 Branch NT T I2 I8 T Backward Branch 3 2 1 NT Backward branch target is marked as initial point of the trace N instructions after the final point of the trace are checked. Same behaviour as procedure trace heuristic Fall-throughinstruction of the same backward branch is marked as final point of the trace Loop Trace Heuristic

Ichaining Trace Heuristic • Goal • to identify large sequences of dynamic instructions • besides procedures and loops • A trace is identified by: • initial point • final point • behaviour of conditional branches within the trace

Conditional Branch T NT I9 I4 I3 I8 I11 I10 I5 I1 I12 I2 I6 I7 Conditional Branch NT T I8 I10 I3 I9 I7 I2 Conditional Branch T NT 1 Taken and not taken targets of all conditional branches are considered as initial points of a trace IChaining Trace Heuristic

Conditional Branch T NT I6 I9 I3 I8 I11 I10 I5 I1 I12 I2 I4 I7 Conditional Branch NT T I3 I5 Conditional Branch T NT 2 3 Every time a conditional branch is found, the trace is split into two. Given an initial point, a trace is extended adding successive instructions IChaining Trace Heuristic

Conditional Branch T NT I6 I4 I3 I9 I8 I10 I7 I5 I1 I12 I11 I2 Conditional Branch NT T I5 I3 I7 I12 I11 Conditional Branch T NT IChaining Trace Heuristic

Conditional Branch T NT I9 I4 I3 I8 I11 I10 I5 I1 I12 I2 I6 I7 Conditional Branch NT T I12 I7 I3 I11 I5 Conditional Branch T NT 4 I12 Final point is reached if: new instruction already belongs to the trace, trace reaches a maximum size or new instructions is an indirect jump. IChaining Trace Heuristic

Conditional Branch T NT If not, final instruction is removed and process starts again. (until trace reaches a minimum size) Trace is considered predictable, if the multiplication of percentages of all live output-values is above certain threshold I1 I12 I2 I7 I11 I8 I5 I4 I6 I10 I3 I9 I12 Conditional Branch NT T I5 I3 I11 I12 I7 Conditional Branch T NT 5 6 7 Live-output values are determined and its predictability is checked for every trace candidate (highest between prediction accuracy and utilization degree) IChaining Trace Heuristic

Trace Speculation Engine • Traces are communicated to the hardware • at program loading time • filling a special hardware structure (trace table) • Each entry of the trace table contains • initial PC • final PC • branch history • live-output values information • frequency counter

Experimental Framework • Simulator • Alpha version of the SimpleScalar Toolset • Benchmarks • Spec2000, ref input • Maximum Optimization Level • DEC C & F77 compilers with -non_shared -O5 • Statistics Collected for 250 million instructions • Skipping an initial part of 500 million

Simulation Parameters • Base microarchitecture • out of order machine, 4 instructions per cycle • I cache: 16KB, D cache: 16KB, L2 shared: 256KB • bimodal predictor • TSMA additional structures • each thread: I window, reorder buffer, register file • speculative data cache: 1KB • verification engine: up to 8 instructions per cycle • trace table: 128 entries, 4-way set associative • look ahead buffer: 128 entries

Profiling Analysis Parameters • Value Predictors: Stride & Context • Minimum size of trace: 16 • Maximum size of trace: 1024 • Maximum number of live-outputs: 32 • Threshold to consider a set of LO predictable: 25% • Significative path (mimimum frequency): 10%

Type of Speculated Instructions Loop Heuristic Procedure Heuristic Ichaining Heuristic 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Type of Speculated Instructions • Procedure and loop traces are relatively low • But sizes are significantly larger than Ichain • Some statistics: • procedure trace size: 97.3 • loop trace size: 215.8 • Ichaining trace size: 36.4 • average size of speculated traces: 65.7 • average number of live output values: 16.4 • branches within a trace (Ichaining): 5.3 • traces with same initial PC (Ichaining): 1.57

Type of Speculations Spec OK, Path KO Spec KO, Path KO Spec KO, Path OK Spec OK, Path OK 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Type of Speculations • Correct speculations: up to 70% • 65% for correctly predicted paths • 7% for incorrectly predicted paths (positive missprediction) • Incorrect speculations: close to 30% • 20% for correctly predicted paths • 8% for incorrectly predicted paths • These confirms that mechanism proposed to predict paths and final points provides significant accuracy

Speedup 1.45 1.40 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.00

Speedup • Average speedup close to 38% • In spite of misspeculating close to 30%

Type of Cycles of ST ST can speculate ST can not speculate 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Type of Cycles of ST • 25% of the time ST can speculate but does not find a trace to be speculated • performance could be improved with further analysis • 75% of the time ST can not speculate because NST is executing and verifying a speculated trace • speculation may be performed only when NST catches up ST

Type of Cycles of NST NST is executing instructions NST is verifying instructions 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Type of Cycles of NST • 65% of the time NST is executing traces speculated by ST • more speculated instructions imply more time executing instructions • 35% of the time NST is verifying instructions from the look ahead buffer • verifying instructions is faster than executing them

Useless Cycles of ST 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Useless Cycles of ST • Up to 20% of the time ST is executing instructions beyond the misspeculation point • ST is wasting up to 20% of the time executing instructions that will be discarded • Ideal scenario would be when this percentage is negligible

Branch Behaviour Distribution 100 % 90 % 80 % 70 % 60 % 50 % 40 % 30 % 20 % 10 % 0 %

Branch Behaviour Distribution • Instruction chanining heuristic does not provide many traces with the same initial point • despite the significant number of branches within a trace (5.3on average) • The study concludes that the majority of branches take almost always the same direction • Close to 80% of the branches take the same direction more than 90% of the times

Conclusions • Profile guided analysis to support TSMA • identifieslarge and highly predictable traces • reducing hardware complexity • Three basic heuristics are proposed • procedure trace heuristic • loop trace heuristic • instruction chaining heuristic • Results show • speedup of 38% with a 30% of missprediction rate • Future work • aggressive trace level predictors • generalization to multiple threads

INTERACT-9, San Francisco (USA) - February 13, 2005 Questions & Answers

Compiler Support for Trace-Level Speculative Multithreading

Compiler Support for Trace-Level Speculative Multithreading

Presentation Transcript

VLIW Speculative Trace Scheduling

Compiler Support for Distributed Systems

High-Level Interconnect Architectures for FPGAs

Compiler Support for Superscalar Processors

Multithreaded Architectures

Multithreaded Clustering for Multi-level Hypergraph Partitioning

Toward Efficient Support for Multithreaded MPI Communication

Compiling for EDGE Architectures: The TRIPS Prototype Compiler

Compiler Support for Multithreaded Software

Trace-Level Speculative Multithreaded Architecture

Motivation for Multithreaded Architectures

Mixed Speculative Multithreaded Execution Models

High-Level Interconnect Architectures for FPGAs

Compiler Optimizations for Modern VLIW/EPIC Architectures

Compiler Speculative Optimizations

Lecture 11 Multithreaded Architectures

Programming Models for Multithreaded Architectures: The EARTH Threaded-C Experience

Compiler Challenges for High Performance Architectures

Lecture 11 Multithreaded Architectures

Multithreaded architectures

Motivation for Multithreaded Architectures