Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) Value Prediction and Instruction Reuse An Hong han@ustc.edu.cn Fall 2007 University of Science and Technology of China Department of Computer Science and Technology

Outline • What’s Data Hazards and Solution? • What Makes Data Speculation possible? • Value Prediction (VP) • Instruction Reuse (IR) CS of USTC AN Hong

A Taxonomy of Speculation Execution Techniques What can we speculate on? Speculative Execution Control Speculation Data Speculation Data Location Branch Direction (binary:taken/not-taken) Aliased (binary) Branch Target (multi-valued:anywhere in the address space) Address(multi-valued) (prefetching) Data Value (multi-valued) What makes speculation possible? CS of USTC AN Hong

Content of $2 10 10 10 10 10 -20 -20 -20 -20 Time T1 T2 T3 T4 T5 T6 T7 T8 T9 sub $2, $1,$3 IF ID EX ME WB and $12, $2, $3 IF ID EX ME WB or $13, $6, $2 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB Data Hazards: $2 data is needed before it is written back in WB stage to the register file What’s the Problem? :Data Hazards CS of USTC AN Hong

数据相关(又称数据依赖) 在程序的一个基本块中存在的数据相关有以下几种情形： • 真数据依赖：两条指令之间存在数据流，有真正的数据依赖关系 • RAW（Read After Write）相关：对于指令i和j，如果 (1) 指令j使用指令i产生的结果，则称指令j与指令i为RAW相关；或者 (2) 指令j与指令i存在RAW相关，而指令k与指令j存在RAW相关，则称指令k与指令i为RAW相关 • 伪数据依赖（又称名相关）：指令使用的寄存器或存储器称为名。两条指令使用相同名，但它们之间不存在数据流，则它们之间是一种伪数据依赖关系，包括两种情形： • WAR（Write After Read）相关：对于指令i和j，如果指令i先执行，指令j写的名是指令i读的名，则称指令j与指令i为WAR相关（又称反相关，anti-dependence） • WAW（ Write After Write）相关:对于指令i和j，如果指令i与指令j写相同的名，则称指令j与指令i为WAW相关（又称输出相关,output-dependence） CS of USTC AN Hong

Data Hazard on r1: Read after write hazard (RAW) add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 CS of USTC AN Hong

Im ALU Im ALU Im Dm Reg Reg ALU Data Hazard on r1: Read after write hazard (RAW) • Dependencies backwards in time are hazards Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11 CS of USTC AN Hong

TimeT1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 sub $2, $1,$3 IF ID EX ME WB and $12, $2, $3 IF ** ** ** ID EX ME WB or $13, $6, $2 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB Data Hazard Solution(1): Stall Pipeline • Introduce bubbles into the pipeline Stall • Stallthe dependent instructions until the instruction that causes the dependence leaves the pipeline CS of USTC AN Hong

Im ALU Im ALU Im Dm Reg Reg ALU Data Hazard Solution（2）: Forwarding (or Bypassing) • “Forward” result from one stage to another Time (clock cycles) IF ID/RF EX MEM WB add r1,r2,r3 Reg Reg ALU Im Dm I n s t r. O r d e r sub r4,r1,r3 Dm Reg Reg Dm Reg Reg and r6,r1,r7 Im Dm Reg Reg or r8,r1,r9 ALU xor r10,r1,r11 CS of USTC AN Hong

Im ALU Forwarding: What about Loads? • Dependencies backwards in time are hazards • Can’t solve with forwarding • Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm sub r4,r1,r3 Dm Reg Reg CS of USTC AN Hong

Im Dm Reg Reg ALU Forwarding (or Bypassing): What about Loads ? • Dependencies backwards in time are hazards • Can’t solve with forwarding • Must delay/stall instruction dependent on loads Time (clock cycles) IF ID/RF EX MEM WB lw r1,0(r2) Reg Reg ALU Im Dm Stall sub r4,r1,r3 CS of USTC AN Hong

Data Hazard Solution(3):Out-of-Order Execution • Need to detect data dependences at run time • Need of precise exceptions: • Out-of-order execution, in-order completion Time T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 sub $2, $1,$3 IF ID EX ME WB add $14, $5, $4 IF ID EX ME WB sw $15, 100($6) IF ID EX ME WB and $12, $2, $3 IF ** ID EX ME WB or $13, $6, $2 IF ID EX ME WB CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation • In a wide-issue processors, e.g. 8 ~ 12 instructions per clock cycle • Larger than a basic block (5 ~ 7 instructions) • Multiple branches – use multiple-branch prediction (e.g. trace cache) • Multiple data dependence chains – very hard to execute them in the same clock cycle • Value speculation is primarily used to resolve data dependences: • In the same clock cycle • Long latency operations (e.g. load operations) CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation • Why is Speculation Useful? • Speculation lets all these instruction run in parallel on a superscalar machine. addq $3 $1 $2 addq$4$3$1 addq$5$3$2 • What is Value Prediction? • Predict the value of instructions before they are executed • Cp. • Branch Prediction • eliminates the control dependences • Prediction Data are just two values( taken or not taken) • Value Prediction • eliminates the data dependences • Prediction Data are taken from a much larger range of values CS of USTC AN Hong

Data Hazard Solution(4): Data Speculation • Value Locality: likelihood of a previously-seen value recurring repeatedly within a storage location • Observed in any storage locations • Registers • Cache memory • Main memory • Most work focussing on value stored in registers to break potential data dependences: register value locality • Why Value Prediction? • Results of many instructions can be accurately predicted before they are issued or executed. • Dependent instructions are no longer bound by the serialization constraints imposed by data dependences. • More parallelism can be explored. • Prediction of values for dependant instructions can lead to beneficial speculative execution CS of USTC AN Hong

冗余指令 • 若将程序执行期间生成的每条静态指令的动态实例进行缓存，则每条生成结果的动态指令可归为以下三种类型： • 新结果指令：首次生成新值的动态指令 <5% • 重复结果指令：生成结果与对应静态指令的其它动态实例相同的动态指令 80%~90% • 可推导型指令：生成结果能用先前的结果推导出来的动态指令 <5% • 冗余指令 • 重复型指令 • 和可推导指令 CS of USTC AN Hong

Source of Value Locality（Sources of value predictability） How often does the same value result from the same instruction twice in a row? Question: Where does value localityoccur? Somewhat Yes No Yes Yes No Somewhat Somewhat Yes Yes Single-cycle Arithmetic (i.e. addq $1 $2) Single-cycle Logical (i.e bis $1 $2) Multi-cycle Arithmetic (i.e. mulq $1 $2) Register Move (i.e. cmov $1 $2) Integer Load (i.e. ldq $1 8($2)) Store with base register update FP Multiply FP Add FP Move FP Load CS of USTC AN Hong

Source of Value Locality（Sources of predictability） • Data redundancy: text files with white spaces, empty cells in spreadsheets • Error checking • Program constants • Computed branches • Virtual function calls • Glue code: allow calling from one compilation unit to another • Addressability: pointer tables store constant addresses loaded at runtime • Call contexts: caller-saved/callee saved registers • Memory alias resolution: conservative assumptions from compiler regarding aliasing • Register spill code • …… CS of USTC AN Hong

Load Value Locality CS of USTC AN Hong

Why Value Prediction is possible? Value Locality CS of USTC AN Hong

Why Value Prediction is possible? • Register value locality • (No. of times each static instruction writes a register value that matches a previously-seen value for that static instruction) / (Total no. of dynamic register writes in the program) • With history depth of one: average 49% • With history depth of four: average 61% CS of USTC AN Hong

Register Value Locality by Instruction Type (Table 2, Figure 3) • Integer and floating-point double loads are the most predictable frequently-occurring instructions • Single-cycle instructions • fewer input operands -> higher value locality • Multi-cycle instructions • more input operands -> lower value locality CS of USTC AN Hong

Register Value Locality by Instruction Type (Table 2, Figure 3) CS of USTC AN Hong

Register Value Locality by Instruction Type(Table 2, Figure 3) CS of USTC AN Hong

Value Sequences Types • Basic Sequences • Constant: 3 3 3 3 3 3 ……Δ=0 • Sources: surprisingly often • Stride: 1 2 3 4 5 6 7 ……Δ=1 • Sources: most common case, an array being accessed in a regular fashion, loop induction variables • Non-Stride: 29 31 12 34 56 …… • Composing Sequences • Repeated Stride: 1 2 3 1 2 3 1 2 3 …… • Repeated Non-Stride: 1 13 35 41 1 13 35 41 …… • Sources: Nested Loop…… CS of USTC AN Hong

Classification of Value predictors • Computational predictors: Uses previous values to make predictions • Last Value Predictors • Predicts previous value • Saturating counter can be used, with values only being predicted above a threshold • New values only predicted if it happens successively • Stride Predictors • VN = VN-1+(VN-1-VN-2) • 2-delta • Two strides; s1 is VN-1-VN-2, s2 computes predictions. If s1 is repeated, then s2 is updated. • Context-based predictors(History-Based or Pattern Based Predictors): Matches recent value history to previous value history • Finite Context Method Predictors • k-th order predictor uses the previous k values to make the prediction • Counts occurrences of a prediction following a pattern and predicts the one with the maximum count. • Hybrid Predictors CS of USTC AN Hong

The performance measure of value predictor • Three factors determine the efficacy • Accuracy • ability to avoid mispredictions • Coverage • ability to predict as many instruction outcomes as possible • Scope • The set of instructions that the predictor targets • Relationships between factors • Accuracy ↔ Coverage trade-off • Scope • Low implementation cost • Achieve better accuracy and coverage • Mispredictions for useless predictions are eliminated CS of USTC AN Hong

Fetch Decode Issue Execute Commit if mispredicted Predict Value Verify Value Prediction (VP) Instruction Reuse (IR) Fetch Decode Issue Execute Commit Check for previous use Verify arguments are the same if reused Exploiting Value Locality “predict the results of instructions based on previously seen results” “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Fetch Decode Issue Execute Commit Predict Value Verify if mispredicted Value prediction • Speculative prediction of register values • Values predicted during fetch and dispatch, forwarded to dependent instructions. • Dependent instructions can be issued and executed immediately. • Before committing a dependent instruction, we must verify the predictions. If wrong: must restart dependent instruction w/ correct values. CS of USTC AN Hong

Value Prediction Units PC PC Should I predict? CS of USTC AN Hong

How to predict values? Value Prediction Table (VPT) • Cache indexed by instruction address (PC) • Mapped to one or more 64-bit values • Values replaced (LRU) when instruction first encountered or when prediction incorrect. • 32 KB cache: 4K 8-byte entries Classification Table (CT) Value Prediction Table (VPT) PC Pred History Value History PC Prediction CS of USTC AN Hong

Estimating prediction accuracy Classification Table (CT) • Cache indexed by instruction address (PC) • Mapped to 2-bit saturating counter, incremented when correct and decremented when wrong. • 0,1 = don’t use prediction • 2 = use prediction • 3 = use prediction and don’t replace value if wrong • 1K entries sufficient Classification Table (CT) Value Prediction Table (VPT) PC Pred History Value History PC Predicted Value Prediction CS of USTC AN Hong

Fetch Decode Issue Execute Commit if mispredicted Predict Value Verify Verifying predictions • Predicted instruction executes normally. • Dependent instruction cannot commit until predicted instruction has finished executing. • Computed result compared to predicted; if ok then dependent instructions can commit. • If not, dependent instructions must reissue and execute with computed value. Miss penalty = 1 cycle later than no prediction. CS of USTC AN Hong

Fetch Decode Issue Execute Commit Check for previous use Verify arguments are the same if reused Instruction Reuse • Obtain results of instructions from their previous executions. • If previous results still valid, don’t execute the instruction again, just commit the results! • Non-speculative, early verification • Previous results read in parallel with fetch. • Reuse test in parallel with decode. • Only execute if reuse test fails. CS of USTC AN Hong

How to reuse instructions? • Reuse buffer • Cache indexed by instruction address (PC) • Stores result of instruction along with info needed for establishing reusability: • Operand register names • Pointer chain of dependent instructions • Assume 4K entries (each entry takes 4x as much space as VPT: compare to 16K VP) • 4-way set-associative. CS of USTC AN Hong

Reuse Scheme • Dependent chain of results (each points to previous instruction in chain) • Entry is reusable if the entries on which it depends have been reused (can’t reuse out of order). • Start of chain: reusable if “valid” bit set; invalidated when operand registers overwritten. • Special handling of loads and stores. • Instruction will not be reused if: • Inputs not ready for reuse test (decode stage) • Different operand registers CS of USTC AN Hong

Value Prediction (VP) Instruction Reuse (IR) Comparing VP and IR “predict the results of instructions based on previously seen results” “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Comparing VP and IR • IR can’t predict when: • Inputs aren’t ready • Same result follows from different inputs • VP makes a lucky guess “predict the results of instructions based on previously seen results” Which captures more redundancy? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Comparing VP and IR “predict the results of instructions based on previously seen results” Which handles misprediction better? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? IRis non-speculative, so it never mispredicts “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Comparing VP and IR “predict the results of instructions based on previously seen results” Which integrates best with branches? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? • IR • Mispredicted branches are detected earlier • Instructions from mispredicted branches can be reused. • VP • Causes more misprediction “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Comparing VP and IR “predict the results of instructions based on previously seen results” Which is better for resource contention? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? IR might not even need to execute the instruction “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Comparing VP and IR “predict the results of instructions based on previously seen results” Which is better for execution latency? Value Prediction (VP) Instruction Reuse (IR) Which captures more redundancy? VP causes some instructions to be executed twice (when values are mispredicted), IR executes once or not at all. “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Value Prediction (VP) Instruction Reuse (IR) Possible class project: Can we get the best of both techniques? “predict the results of instructions based on previously seen results” “recognize that a computation chain has been previously performed and therefore need not be performed again” CS of USTC AN Hong

Summary • 84-97% of redundant instructions reusable. • Realistic configuration, on simulated (current and near-future) PowerPC, VP gave 4.5-6.8% speedups. • 3-4x more speedup than devoting extra space to cache. • VP’s Speedups vary between benchmarks (grep: 60%) • VP’s Potential speedups up to 70% for idealized configurations. • Can exceed dataflow limit (on idealized machine). • Are these really realistic? • Net performance: VP better on some benchmarks; IR better on some. All speedups typically 5-10%. • More interesting question: can the two schemes be combined? CS of USTC AN Hong

Lecture on High Performance Processor Architecture ( CS05162 )