Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era

Energy Efficient Latency Tolerance:Single-Thread Performance for the Multi-Core Era Andrew Hilton University of Pennsylvania adhilton@cis.upenn.edu Duke:: March 18, 2010

Multi-Core Architecture Atom Atom Atom Atom Single-thread performance growth has diminished • Clock frequency has hit an energy wall • Instruction level parallelism (ILP) has hit energy, memory, idea walls Future chips will be heterogeneous multi-cores • Few high-performance out-of-order cores (Core i7) for serial code • Many low-power in-order cores (Atom) for parallel code Core i7 Atom Atom Atom Atom

Multi-Core Performance Atom Atom Atom Atom Obvious performance key: write more parallel software Less obvious performance key: speed up existing cores • Core i7? Keep serial portion from becoming a bottleneck (Amdahl) • Atoms? Parallelism is typically not elastic Key constraint: energy • Thermal limitations of chip, cost of energy, cooling costs,… Core i7 Atom Atom Atom Atom

“TurboBoost” X X X X Existing technique: Dynamic Voltage Frequency Scaling? • Increase clock frequency (requires increasing voltage) • Simple • Applicable to both types of cores • Not very energy-efficient (energy ≈ frequency2) • Doesn’t help “memory bound” programs (performance < frequency) Clock Clock X X X X

Effectiveness of “TurboBoost” Higher Is better Lower is better • Example: TurboBoost 3.2 GHz  4.0 GHz (25%) • Ideal conditions: 25% speedup, constant Energy * Delay2 • Memory bound programs: far from ideal

“Memory Bound” Main Memory (250 cycles) Main memory is slow relative to core (~250 cycles) Cache hierarchy makes most accesses fast • “Memory bound” = many L3 misses • … or in some cases many L2 misses • … or for in-order cores many L1 misses • Clock frequency (“TurboBoost”) accelerates only core/L1/L2 L3$ (40 cycles) L2$ (10) L2$ (10) L2$ (10) L2$ (10) L1$ L1$ L1$ L1$ L1$ L1$ L1$ Core i7 Atom Atom Atom Atom Atom Atom

Goal: Help Memory Bound Programs Wanted: complementary technique to TurboBoost Successful applicants should • Help “memory bound” programs • Be at least as energy efficient as TurboBoost (at least ED2 constant) • Work well with both out-of-order and in-order cores Promising previous idea: latency tolerance • Helps “memory bound” programs My work: energy efficient latency tolerance for all cores • Today: primarily out-of-order (BOLT) [HPCA’10]

Talk Outline Introduction Background: memory latency & latency tolerance My work: energy efficient latency tolerance in BOLT • Implementation aspects • Runtime aspects Other work and future plans

LLC (Last-Level Cache) Misses What is this picture? Loads A & H miss caches This is an in-order processor • Misses serialize  latencies add  dominate performance We want Miss Level Parallelism (MLP): overlap A & H 250 (not to scale) 250 Time

Miss-Level Parallelism (MLP) One option: prefetching • Requires predicting address of H at A Another option: out-of-order execution (Core i7) • Requires sufficiently large “window” to do this 250 250 250 Time

Out-of-Order Execution & “Window” Fetch Rename LLC miss Reorder Buffer I$ Important “window” structures • Register file (number of in-flight instructions): 128 insns on Core i7 • Issue queue (number of un-executed instructions): 36 on Core i7 • Sized to “tolerate” (keep core busy for) ~30 cycle latencies • To tolerate ~250 cycles need order of magnitude bigger structures Latency tolerance big idea: scale window virtually D C B A completed Issue Queue Register File unexecuted A A B C D FU D$ D

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Prelude: Add slice buffer • New structure (not in conventional processors) • Can be relatively large: low bandwidth, not in critical execution core D C B A Issue Queue Register File A A B C D FU D$ D

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot D C B A Issue Queue Register File A A B C D FU D$ D

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot D C B A A Issue Queue Register File B C D FU D$ D

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents D C B A A Issue Queue Register File miss dependent B C D FU D$ D

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents • Pseudo-execute them too D C B A D A Issue Queue Register File B C FU D$

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #1: Long-latency cache miss  slice out • Pseudo-execute: copy to slice buffer, release register & IQ slot • Propagate “poison” to identify dependents • Pseudo-execute them too • Proceed under miss I H G F E D C B A H H E E D A Issue Queue Register File I I B C FU D$ F G

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in I H G F E D C B A H H E E D A Issue Queue Register File I I B C FU D$ F G

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers I H G F E D C B A H H E E D A Issue Queue Register File I I B C A FU D$ F G

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue I H G F E D C B A H H E E D A Issue Queue Register File I I B C A FU D$ A F G

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction I H G F E D C B A H H E E D Issue Queue Register File I I B C A FU D$ F G

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction • Problems with sliced in instructions (exceptions, mis-predictions)? I H G F E D C B A H H E E Exception! Issue Queue Register File I I B C A FU D$ E F D G E

Latency Tolerance Fetch Rename Slice Buffer Reorder Buffer Chk I$ Phase #2: Cache miss return  slice in • Allocate new registers • Put in issue queue • Re-execute instruction • Problems with sliced in instructions (exceptions, mis-predictions)? • Recover to checkpoint (taken before A) I H G F E D C B A H H E E Exception! Issue Queue Register File I I B C A FU D$ E F D G E

Slice Self Containment Important for latency tolerance: self-contained slices • A,D, & E have miss-independent inputs • Capture these values during slice out • This decouples slice from rest of program A B G C F D E H

Latency Tolerance Energy ≈ #Boxes Latency tolerance example • Slice out miss and dependent instructions  “grow” window • Slice in after miss returns Delay: 0.5x Energy: 1.5x ED2: 0.38x Combine into ED2 ED2 < 1.0 = Good … Time

Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but…

Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but also slowdowns

Previous Design: CFP Higher Is better Prior design: Continual Flow Pipelines [Srinivasan’04] • Obtains speedups, but also slowdowns • Typically not energy efficient Lower is better

Energy-Efficient Latency Tolerance? Efficient Implementation • Re-use existing structures when possible • New structures must be simple, low-overhead Runtime efficiency • Minimize superfluous re-executions Previous designs have not achieved (or considered) these • Waiting Instruction Buffer [Lebeck’02] • Continual Flow Pipeline [Srinivasan’04] • Decoupled Kilo Instruction Processor [Pericas ’06,’07]

Sneak Preview: Final Results Higher Is better This talk: my work on efficient latency tolerance • Improved performance • Performance robustness (do no harm) • Performance is energy efficient Lower is better

Talk Outline Introduction Background: memory latency & latency tolerance My work: energy efficient latency tolerance in BOLT • Implementation aspects • Runtime aspects Other work and future plans

Examination of the Problem Fetch Rename Slice Buffer Reorder Buffer Chk I$ Problem with existing design: register management • Miss-dependent instructions free registers when they execute I H G F E D C B A K L E H D A Issue Queue Register File I I B C FU D$ F G

Examination of the Problem Fetch Rename Slice Buffer Chk Chk I$ Problem with existing design: register management • Miss-dependent instructions free registers when they execute • Actually, all instructions free registers when they execute What’s wrong with this? • No instruction level precise state  hurts on branch mispredictions • Execution order slice buffer K L E H D A Issue Queue Register File I FU D$  hard to re-rename & re-acquire registers

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Youngest instructions: keep in re-order buffer • Conventional, in-order register freeing Miss-dependent instructions: in slice buffer • Execution based register freeing D C B A Issue Queue Register File A B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? D C B A Issue Queue Register File A B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers D C B A Issue Queue Register File A B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers D C B A Issue Queue Register File B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer D C B A Issue Queue Register File B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer • Completed instructions are done and simply removed D C B A Issue Queue Register File B C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ In-order speculative retirement stage • Head of ROB completed or poison? • Release registers • Poison instructions enter slice buffer • Completed instructions are done and simply removed D C A Issue Queue Register File C D FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s management • Youngest instructions (ROB) get conventional recovery (do no harm) V U T L K H E D A Issue Queue Register File T T U V FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) V U T L K H E D A Issue Queue Register File T T U V FU D$

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) • Scale single, conventionally sized register file V U T L K H E D A Issue Queue Register File T T U V FU D$ Register File Contribution #1: Hybrid register management—best of both worlds

BOLT Register Management Fetch Rename Slice Buffer Reorder Buffer Chk I$ Benefits of BOLT’s register management • Youngest instructions (ROB) get conventional recovery (do no harm) • Program order slice buffer allows re-use of SMT (“HyperThreading”) • Scale single, conventionally sized register file Challenging part: two algorithms, one register file • Note: two register files not a good solution V U T L K H E D A Issue Queue Register File T T U V FU D$

Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ Conventional algorithm (ROB) • In-order allocation/freeing from circular queue • Efficient squashing support by moving queue pointer V U T L K H E D A Issue Queue Register File T T U V FU D$

Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ Conventional algorithm (ROB) • In-order allocation/freeing from circular queue • Efficient squashing support by moving queue pointer Aggressive algorithm (slice instructions) • Execution driven reference counting scheme V U T L K H E D A Issue Queue Register File T T U V FU D$

Two Algorithms, One Register File Fetch Rename Slice Buffer Reorder Buffer Chk I$ How to combine these two algorithms? • Execution based algorithm uses reference counting • Efficiently encode conventional algorithm as reference counting • Combine both into one reference count matrix V U T L K H E D A Issue Queue Register File T T U V FU D$ Register File Contribution #2: Efficient implementation of new hybrid algorithm

Management of Loads and Stores Fetch Rename Slice Buffer Reorder Buffer Chk I$ Large window requires support for many loads and stores • Window effectively A-V now, what about the loads & stores ? • This could be an hour+ talk by itself… so just a small piece V U T L K H E D A Issue Queue Register File T T U V FU D$

Store to Load Dependences ? A Different from register state: cannot capture inputs • Store -> load dependences determined by addresses • Cannot “capture” like registers • Must be able to find proper (older, matching) stores B C D E F

Store to Load Dependences ? A Different from register state: cannot capture inputs • Store -> load dependences determined by addresses • Cannot “capture” like registers • Must be able to find proper (older, matching) stores • Must avoid younger matching stores (“write-after-read” hazards) B X C D E F

Energy Efficient Latency Tolerance: Single-Thread Performance for the Multi-Core Era