Implementing Optimizations at Decode Time

Implementing Optimizations at Decode Time Ilhyun Kim Mikko H. Lipasti Pharm Team University of Wisconsin—Madison http://www.ece.wisc.edu/~pharm

What this talk is about • It’s not about new optimizations • Memory reference combining • Silent store squashing • It’s not about decode • How to build an instruction decoder • It is about implementation • A way to implement dynamic optimizations in a pipeline w/ speculative scheduling “Implementing Optimizations at Decode time” Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 2

Outline • Speculative Scheduling • Why it causes problems with dynamic optimizations • Speculative Decode • Enables dynamic optimizations in the processor core • Case Study: Memory Reference Combining • Case Study: Silent Store Squashing • Conclusions Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 3

Instr cache Fetch Decode / Trace Cache fill Compiler Binary Translation / Optimization Decode Trace cache Execution Core Execution Core Can we achieve fully dynamic optimizations? Dynamic events affect execution for the very next clock cycle Virtual machine Host machine Processor Where do you want to put optimizations? • Optimization trade-offs Most Dynamic Most Global Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 4

Speculatively issued instructions Speculatively issued instructions Fetch Fetch Fetch Fetch Fetch Fetch Fetch Fetch Fetch Decode Decode Decode Decode Decode Decode Decode Decode Decode Schedule Schedule Schedule Schedule Schedule Schedule Schedule Schedule Schedule Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch RF RF RF RF RF RF RF RF RF Exe Exe Exe Exe Exe Exe Exe Exe Exe Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Writeback /Recover Commit Commit Commit Commit Commit Commit Commit Commit Commit Fetch Fetch Fetch Fetch Fetch Fetch Fetch Decode Decode Decode Decode Decode Decode Decode Schedule Schedule Schedule Schedule Schedule Schedule Schedule Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch Dispatch RF RF RF RF RF RF RF Exe Exe Exe Exe Exe Exe Exe Writeback Writeback Writeback Writeback Writeback Writeback Writeback Commit Commit Commit Commit Commit Commit Commit Re-schedule when latency mispredicted Re-schedule when latency mispredicted Wakeup /Select Wakeup /Select Fetch Decode non-atomic wakeup /select Issue /Exe Writeback Commit Speculatively issued instructions Atomic wakeup /select Re-schedule when latency mispredicted Speculatively issued instructions Speculatively issued instructions Re-schedule when latency mispredicted Speculatively issued instructions Latency Changed!! Re-schedule when latency mispredicted Speculatively issued instructions Invalid input value Re-schedule when latency mispredicted Speculatively issued instructions Spec wakeup /select Re-schedule when latency mispredicted Speculatively issued instructions Spec wakeup /select Re-schedule when latency mispredicted Speculative Scheduling • Overview • Unlike the original Tomasulo’s Algorithm, • Instructions are scheduled based on pre-determined latency • Resources are allocated at schedule time • Once instructions leave scheduler, it is impractical to change resource/execution scheduling • Pipeline CANNOT react to observed events immediately Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 5

Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit reduced load latency Now, it executes - no benefit Load lat 1!! NO BENEFIT!! Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Bubble ahead Load lat 1!! move the value cancel cache access Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Re-schedule when latency mispredicted Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit lw r1  4(r29) add r2  r1 + 1 Issue now value found in RF Re-schedule when latency mispredicted Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Wakeup Load latency 2 Re-schedule when latency mispredicted What becomes harder? • Fully dynamic optimization in execution stage is hard Optimization: Avoid the cache access if the value is available in RF (Load and store reuse using register file contents, ICS 2001) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 6

Speculative scheduling breaks fully dynamic optimizations • Optimizing a parent instruction is not enough • Benefits come from dependent (data, resource) instructions that execute sooner • Instructions cannot react immediately under speculative scheduling • Some techniques become less efficient, or even unavailable if they depend on: • Instant re-execution • Variable execution latency • Instant resource allocation/deallocation • The scheduler should know what will happen in advance • not fully dynamic – predictor required • How to communicate with the scheduler? Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 7

Predictor Optimization target Optimization target Optimization target Optimization target Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Optimization is invisible to scheduler (Since it’s ‘move’ instr) Transform this (e.g. loadmove) assuming that it will happen wakeup dependent instr(‘move’ lat 1) Appears to be optimized here Benefit from reduced latency Optimization target Optimization target Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit Fetch Decode Schedule Dispatch RF Exe Writeback /Recover Commit lw r1  4(r29) add r2  r1 + 1 lw r1  4(r29) move r1p1 add r2  r1 + 1 Our Solution Optimization: Avoid the cache access if the value is available in RF (Load and store reuse using register file contents, ICS 2001) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 8

lw r1 36(r29) p2 = predicted value bne r1 p2 softtrap add r3 p2 r5 lw r1 36(r29) add r3 r1 r5 Speculative Decode Speculative Decode (SD) • Decoding instructions into an optimistic sequence rather than one that works correctly in all cases (unsafe) • reaps benefits of fully dynamic optimization when correctly predicted • requires verification code for correctness • flushes the pipeline when mispredicted ex) Load Value Prediction Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 9

Predictor Original Instructions Transformed Instructions Fetch Decode OoO Execution Core Commit When mispredicted, Squashing & Refetching (same as branch mispredictions) Benefits of SD • Pre-schedule optimizations outside the OoO core • enables dynamic optimizations • more effectively eliminates resource contention than even fully dynamic optimizations – leads to better performance • Implementing optimization using existing I-ISA primitives • implement microarchitectural ideas with minimal core change • reuse existing data/control path in the core • minimize negative effects on the scheduler – invisible to the scheduler Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 10

Translation Layer for SD • Many decoders already have a translation layer between user-ISA and implementation-ISA • Because direct implementation of complex instructions is difficult • P6, Pentium 4, Power4, K7…… • Functionality required for SD • One-to-multiple instruction expansion (x86 decoders) • Dynamically variable mapping between U-ISA and I-ISA (experimental S/390) • Reducing the decode overhead • Trace cache / Decoded instruction cache (Pentium 4) • Instruction-path Coprocessors (Chou and Shen, ISCA 2000) • Performance drop is not drastic w/ extra decode stages (Sensitivity study in the paper) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 11

Outline • Speculative Scheduling • Speculative Decode • Case Study: Memory Reference Combining • Case Study: Silent Store Squashing • Conclusions Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 12

One cache access satisfies multiple loads (load all scheme) • Cache port / latency benefits LSQ 64-bit data buffer 104 100 404 400 Cache / Memory 64-bit datapath combinable load issues Load completed Byte selection LW 100 LW 404 LW 104 … LW 400 LW 104 Case Study: Memory Reference Combining • Discussed extensively in literature • Wilson et. al: Increasing cache port efficiency for dynamic superscalar microprocessors, ISCA 1996 • BUT, speculative scheduler should know if they can be combined • fails to achieve both benefits Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 13

Reference Combining via SD • Wide data paths in support of instruction set extensions • AMD Hammer project: x86-64bit • PowerPC 64bit implementations • Multimedia extensions (SSE, MMX, Altivec…) • Many programs are still written in 32-bit mode for backward compatibility • SD enables existing binaries to benefit from wider data paths w/o recompilation • Wider (128-bit) combining leads to more benefits (performance data in the paper) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 14

dlw r1, 0(r10) exthi r2, r1 dlw r3, 8(r10) exthi r4, r3 doubleword-aligned lw r1, 0(r10) dlw r2, 4(r10) exthi r3, r2 lw r4, 12(r10) word-aligned only Predict alignment of references Reference Combining via SD • Detecting combinable pairs statically • Same base register with word-size offset • Two adjacent word memory instructions in program order lw r1, 0(r10) lw r2, 4(r10) lw r3, 8(r10) lw r4, 12(r10) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 15

alignment history of loads/stores combining predicted Predictor To Execution Core dlw + extract merge + dsw lw+lw sw+sw Fetch Decode Sequence Detector adjacent loads/stores Reference Combining via SD • Pipeline front-end • When misaligned, • The memory system detects it (same as base case) • After the pipeline is drained, the original instructions are fetched again and decoded without transformation Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 16

Microarchitectural Assumptions • SimpleScalar PISA w/ Speculative scheduling • 4-wide; 8-stage pipeline • Hybrid branch predictor(Gshare + bimodal) • 64-entry RUU; 32-entry load / 16-entry store schedulers • 64KB I-DL1, 512KB unified L2 • 2 load / 1 store ports (mutually exclusive) • 2 store buffers outside the OoO core • HW memory reference combining (HWC) • Magic scheduler w/ perfect combining knowledge + store merging in store buffer (for store combining) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 17

HWC vs. SDC – cache access reductions • HWC reduces more cache accesses than SDC Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 18

Base HWC w/ oracle scheduler SDC HWC vs. SDC – LSQ contention reductions • SDC reduces LSQ contention more (fewer memory instructions) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 19

HWC vs. SDC – Speedups • SD can reap many of benefits of pure hardware implementations Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 20

Outline • Speculative Scheduling • Speculative Decode • Case Study: Memory Reference Combining • Case Study: Silent Store Squashing • Conclusions Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 21

A store is converted implicitly into 3 operations Memory Do explicit conversion Load Scheduler Store Scheduler V V (1) access memory Load (converted) Store Value = (3) nullify the store when silent (2) compare values Case Study: Silent Store Squashing(SSS) • Eliminates stores that do not change architectural states • Reducing core and memory system contention • Separate load/store schedulers imply replication  more contention Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 22

add r1, 1, r1 lw r5, 4(r10) lw p1, 16(r29) bne p1, r1, trap no RAW sw r1, 16(r29) • load + trap for store verify • the pipeline is drained when not silent • No store, No aliasing • silent stores do not change the value  no RAW • allowing later loads to bypass earlier unresolved stores even with true dependences Silent Store Squashing via SD • Explicitly removes predicted stores • reduces store scheduler contention sw r1, 16(r29) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 23

HWSSS vs. SDSSS – memory disambiguation • Better memory disambiguation achieved • no store  less store-to-load block cycles Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 24

HWSSS vs. SDSSS – Speedups • SDSSS outperforms HWSSS • Better memory disambiguation • HWSSS does not reduce contention in the store scheduler Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 25

Conclusions • Speculative scheduling makes optimizations in execution stage impractical  Pre-schedule optimization by transforming instructions • Advantages of SD-based implementations • Enabling execution stage optimizations • Reusing existing data/control path • No negative effect on instruction scheduling • Reduces contention inside the core better • Two case studies show that SD can reap many benefits of pure hardware implementations • Memory reference combining: less queue contention • Silent store squashing: less queue contention, better memory disambiguation Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 26

Backup slides Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 27

LSQ LSQ Schedule Magic Schedule Register File RF Register File RF Addrgen Exe1 Addrgen Exe1 Load unit Exe2 Load unit Exe2 Result bus Result bus Is HWC easy to integrate?? • Schedule • Detecting piggyback loads to be issued at the same clock cycle  Actual values are not involved in scheduling  Detecting them without effective addresses? • Register File & Result bus • More loads satisfied at the same time  more result bus bandwidth  more RF write ports Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 28

SD Combining Prediction • Over 80% of adjacent combinable references are captured (1024 entries, ~4KB) • Miss rates: ~0.1% of all references Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 29

Silence Prediction • 45% of silent stores detected (1024 entries, ~2.5KB) • Low miss rate (~1%) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 30

-1(silent) / +4(not silent) Store verify result Current Store Value (lower 8 bits) +1(same) / -1(different) PC (a) PC x: Store 100, [A] Store 100 to [A] Time Original Instruction Silence Predictor State Decoded Ops Memory Load from [B] Compare Store 100 to [B] Predict (b) PC x: Store 100, [B] 100 100 100 100 value 4 5 3 confid 0 3 4 7 4 thres (c) PC y: Store 50, [B] Load from [A] Compare (d) PC x: Store 100, [B] (e) Store 100 to [B] Last Store Value (lower n bits) Threshold counter Confidence counter A A A A A 100 100 100 100 100 B B B B B 100 100 50 100 50 =  Silence Predictor Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 31

is silent? last store value silent store predicted Predictor To Execution Core load+comp+store load+branch store Decode Fetch Silence Squashing via SD • Explicit store verify depending on predictor states: • store • load+compare+store • load+branch • Eliminate negative effects on scheduling logic • Explicit load issue for verify • SSS is virtually invisible to scheduling logic • Explicit compare/branch operation • Existing branch unit maintains correct machine states Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 32

Shift Is aligned? PC Next Target Register Alignment History(4) 0 1 2 3 +4 / -4 (1) Tag Predict = 1111 or 1010 ? = SD Combining Prediction • Predictor • Combining is predicted when: • 1111: aligned 4 times in a row • 1010: base is increasing by 4 • Over 80% of adjacent combinable references are captured (1024 entries, ~4KB) • Miss rates: ~0.1% of all references • Capturing up to 26% of all memory references Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 33

Silent Not Silent Different values Silent Not Silent No squash no SD The same value for several instances Squash load+trap Check load+compare+store Silence Prediction • Avoiding unnecessary store verify (load issue) • How do we train the predictor without silence outcome? • Silence outcome is achieved when we do store verify Correlating the last value information for training • 45% of silent stores detected (1024 entries, ~2.5KB) • Low miss rate (~1%) Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 34

Future Work • Spectrum of power / performance design points attainable by speculative decode • Single core, multiple marketing targets • Exposing complex control paths to I-ISA • Improving controllability on processor core  achieving more benefit from SD • Developing I-ISA for complexity-effective core design • And more…… Ilhyun Kim and Mikko Lipasti--PHARM Team, UW-Madison ISCA-29 35

Implementing Optimizations at Decode Time

Implementing Optimizations at Decode Time

Presentation Transcript

ADPCM Decode

Decode Long Words

Listen and Decode

Collimation optimizations

Architectural Optimizations

Decode ENCODE

Implementing Relationship Based Care: One Step at a Time

Local Optimizations

Architectural Optimizations

DECODE THE PUZZLE

Intraprocedural Optimizations

Interconnect Optimizations

Decode

Listen and Decode

Listen and Decode

Geometry Optimizations

Interprocedural Optimizations

Listen and Decode

Gaming Optimizations

Vector Optimizations

Interconnect Optimizations

Decode Immigration