Dynamically Collapsing Dependencies for IPC and Frequency Gain

Dynamically Collapsing Dependencies for IPC and Frequency Gain Peter G. Sassone D. Scott Wills Georgia Tech Electrical and Computer Engineering { sassone, scott.wills } @ ece.gatech.edu

Motivation • Outside of pipeline, global communication dominates • Memory wall is well studied • Inside, traditionally computation or logic dominated I cache fetch memory decode L2 cache rename issue D cache exec commit Sassone & Wills / Georgia Tech / Dynamic Strands

Motivation • Now dominated by local communication paths: • issue window • reorder buffer • register file • bypass network • Bottlenecks both IPC and frequency issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands

Motivation • RISC instruction sets create superfluous traffic • All instructions and operands are treated as equal • Little focus on exposing sequentiality issue logic reg file alu issue queue alu alu Sassone & Wills / Georgia Tech / Dynamic Strands

Contributions • Dynamic Strands: • collapse dependence-chains without fan-out • exploit properties for simple value precomputation • increase efficiency of critical resources • preserve binary compatibility • IPC improvements: • 17-20% speedup on Spec2000int and MediaBench • Frequency improvements: • 37% fewer in-flight instructions • reduced dependence on dependencies Sassone & Wills / Georgia Tech / Dynamic Strands

Outline • Motivation • Transient Operands and Strands • Instruction Replacement Hardware • Results • Conclusion Sassone & Wills / Georgia Tech / Dynamic Strands

Dyadic Dilemma R1 R2 + R1’ R3 + R1’’ R4 + R9 Performing any operation on more than two sources requires temporary values int sum( int a, int b, int c, int d ) { return a + b + c + d; } . . . add R1  R1, R2 add R1  R1, R3 add R9  R1, R4 . . . Sassone & Wills / Georgia Tech / Dynamic Strands

Transient Operands • We term these temporary values transient operands: • values produced by an ALU inst • values consumed only once, and only by an ALU inst • Common in modern integer workloads… On average, about 40% of all dynamic operands are transient Sassone & Wills / Georgia Tech / Dynamic Strands

Strands • Strands: • linear chains of instructions joined by transient operands • non-consecutive • span basic blocks • three instructions • only the final output needs to be committed • Strands are common • dyadic temporaries • compiler strategies • language semantics a b c + d + + Sassone & Wills / Georgia Tech / Dynamic Strands

Hardware Overview instructions dispatch engine strands strands strand cache closed-loop ALUs transients strand cache fill unit instructions off the critical path fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands

3 3 2 2 0 1 1 Algorithm Example instructions dispatch engine strands 1 strands 2 3 strand cache closed-loop ALUs transients strand cache fill unit instructions fetch decode rename issue queue reg file ALU ALU ALU commit Sassone & Wills / Georgia Tech / Dynamic Strands

Strand Cache Fill Unit PC 1412 1 • Based around the operand table • Detects conditions of transients • When found… • append to existing strand • begin new strand operand table arch reg last producer instruction last consumer instruction consumer count 1404:R5 R0 + 0 R4 1408: . . . R5 PC 1416 PC 1404 1412: R1 R5+ 0 R6 1416:R5 R0 + 0 Sassone & Wills / Georgia Tech / Dynamic Strands

Strand Cache + + + this instruction source 1 source 2 seen pc inst seen pc inst seen pc inst About 175 bytes per line, though very few lines are needed for effect status bits instructions previous reader info strand 1 strand 2 101110101 i1 i2 i3 pc ready value strand 3 Sassone & Wills / Georgia Tech / Dynamic Strands

Dispatch Engine • Watches for strand cache matches • Inserts ready strands into the stream eagerly • Removes component instructions when seen • Correctness checking with dirty table dirtytable decode pre-renamed instructions dispatch engine strand cache rename strands, recovery strands, kill signals, Sassone & Wills / Georgia Tech / Dynamic Strands

Closed-Loop ALUs • Full bypass is half of the execute stage delay • Regular ALUs with double-speed closed-loop mode • two dependent ALU operations in a single cycle • intermediate values (the transients) are discarded! • final result still takes ½ cycle for full bypass “free”local bypass ALU ½ cycle mode switch full bypass network ½ cycle Sassone & Wills / Georgia Tech / Dynamic Strands

Oops… Dirty Read R1 R2 R1 R2 + + R1’ R3 R1’ R3 + + R1’’ R4 R1’’ + R9 insert recovery sub-strand to recover R1 load  16 [ R1 ] R1 is dirty! Sassone & Wills / Georgia Tech / Dynamic Strands

Oops… Anti-Dependence Violation R1 R2 + R1’ R3 + R1’’ R4 + R9 previous value R9 insert load immediate of previous value load 32 [ R9 ] renaming not sufficent – outside reorder buffer safety net R9 has already been replaced Sassone & Wills / Georgia Tech / Dynamic Strands

Instruction Coverage Average ALU inst coverage: 16: 12% 1024: 27% High coverage rates, but only with a big strand cache. Less than a 15% replacement rate, regardless of cache size coverage with various strand cache sizes Sassone & Wills / Georgia Tech / Dynamic Strands

IPC Improvements Some benchmarks almost double in IPC Average IPC Speedup: 4-wide: 17% 8-wide: 20% Some see almost no speedup at all 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands

Resource Occupancy + + + + + + + + + + • CISCification of instructions reduces traffic • reorder buffer occupancy is reduced up to 37%. • issue queue occupancy is reduced up to 34%. • traffic reduction  coverage • Reduced dependence on dependencies • opportunity for pipelined bypass • opportunity for pipelined issue. strand strand Sassone & Wills / Georgia Tech / Dynamic Strands

Resource Occupancy • Caveat emptor • more worst case issue CAMs • more worst case register ports • Prior work applicable • only 1.2 live inputs / strand strand strand + + + + + + + + + + Sassone & Wills / Georgia Tech / Dynamic Strands

Conclusion • Key points: • eagerly executing macro-instructions  value precomputation • limiting focus to transient operands • all new hardware off critical path • Results: • IPC speedup of 18-20% with 3KB strand cache • potential for frequency gains • full binary compatibility • Lots of current and future research: • relaxed constraint of ALU instructions • quantified frequency improvements • static detection of strands Questions? Sassone & Wills / Georgia Tech / Dynamic Strands

Backup Slides Sassone & Wills / Georgia Tech / Dynamic Strands

Sensitivity to Dispatch Delay On average, speedup only drops 1% with three cycles of delay Some actually get faster due to less errant strands Most benchmarks lose a small amount of speedup 4-wide IPC speedup with 16-entry strand cache Sassone & Wills / Georgia Tech / Dynamic Strands

Dynamically Collapsing Dependencies for IPC and Frequency Gain

Dynamically Collapsing Dependencies for IPC and Frequency Gain

Presentation Transcript

Hierarchical Fault Collapsing for Logic Circuits

Keywords, Expanding, and Collapsing

Collapsing Can

DEPENDENCIES AND ADDICTIONS

Collapsing Gracelessly

Delays and IPC

What is “collapsing ”? ( for epidemiologists)

DEPENDENCIES AND ADDICTIONS

Normalization and Functional Dependencies

Dynamically Trading Frequency for Complexity in a GALS Microprocessor

Dependencies

IPC

IPC reform 2006: WIPO Products and Services for the new IPC

Independence Fault Collapsing

Dominance Fault Collapsing

IPC

Dominance Fault Collapsing

IPC reform 2006: WIPO Products and Services for the new IPC

IPC

IPC