Decoupled Pipelines: Rationale, Analysis, and Evaluation

Decoupled Pipelines:Rationale, Analysis, and Evaluation Frederick A. Koopmans, Sanjay J. Patel Department of Computer Engineering University of Illinois at Urbana-Champaign

Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Results

Motivation • Why Asynchronous? • No clock skew • No clock distribution circuitry • Lower power (potentially) • Increased modularity • But what about performance? • What is the architectural benefit of removing the clock? • Decoupled Pipelines!

Motivation • Advantages of a Decoupled Pipeline • Pipeline achieves average-case performance • Rarely taken critical paths no longer affect performance • New potential for average-case optimizations

Synchronizing mechanism Synchronous Latch Synchronous clock Stage1 Stage2 Stage3 data data Asynchronous Communication Protocol Decoupled Self-Timing Logic Elastic Buffer go ack go ack Control1 Control2 Control3 Stage1 Stage2 Stage3 data data Synchronous vs. Decoupled

Start Self-Timing Circuit Done Computational Circuit Input Output Self-Timed Logic • Bounded Delay Model • Definition: event = signal transition • start event provided when inputs are available • done event produced when outputs are stable • Fixed delay based on critical path analysis • Computational circuit is unchanged

X O R S E L C 1 0 Asynchronous Logic Gates • C-gatelogical AND • Waits for events to arrive on both inputs • XOR-gatelogical OR • Waits for an event to arrive on either input • SEL-gatelogical DEMUX • Routes input event to one of the outputs

go Sender Stage Receiver Stage ack data 1 1 0 0 go ack data_1 data_2 Transaction 1 Transaction 2 Asynchronous Communication Protocol • 2-Step, Event Triggered, Level Insensitive Protocol • Transactions are encoded in go / ack events • Asynchronously passes instructions between stages

From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • At a high-level: • 9 stage dynamic pipeline • Multiple instruction issue • Multiple functional units • Out-of-order execution • Looks like Intel P6 µarch • What’s the difference?

From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit DSEP Microarchitecture Decoupled, Self-Timed, Elastic Pipeline • Decoupled: • Each stage controls its own latency • Based on local critical path • Stage balancing not important • Each stage can have several different latencies • Selection based on inputs • Pipeline is operating at several different speeds simultaneously!

Fetch Execute Retire Pipeline Elasticity • Definition: • Pipeline’s ability to stretch with the latency of its instruction stream • Global Elasticity • Provided by reservation stations and reorder buffer • Same for synchronous and asynchronous pipelines • When Execute stalls, the buffers allow Fetch and Retire to keep operating

From I–Cache Flush Fetch Decode R e t i r e Write back Rename Retire Read/Reorder Issue Results Execute Data Read Commit Pipeline Elasticity • Local Elasticity • Needed for a completely decoupled pipeline • Provided by micropipelines • Variable length queues between stages • Efficient implementation, little overhead • Behave like shock absorbers

Analysis • Synchronous Processor • Each stage runs at the speed of the worst-case stage running its worst-case operation • Designer: Focus on critical paths, stage balancing • DSEP • Each stage runs at the speed of its own average operation • Designer: Optimize for most common operation • Fundamental advantage of Decoupled Pipeline

Generic Stage Select logic Inputs M U X Short operation Outputs Long operation Average-Case Optimizations • Consider a generic example: • If short op is much more common, throughput is proportional to the select logic • Designer’s Strategy: • Implement fine grain latency tuning • Avoid latency of untaken paths

ALU Self-Timing Circuit S E L Arithmetic X O R Logic Start Done Shift Compare Inputs ALU Computational Circuit Output Average-Case ALU • Tune ALU latency to closely match the input operation • ALU performance is proportional to the average op • Computational Circuit is unchanged

Decoder Self-Timing Circuit Format 1 S E L X O R Start Done Format 2 Format 3 Input Decoder Computational Circuit Output Average-Case Decoder • Tune Decoder latency to match the input instruction • Common instructions often have simple encodings • Prioritize most frequent instructions

Optimized Fetch Alignment Aligned? M U X Fetch Block Address Inst. Block Fetch Align/Mask Average-Case Fetch Alignment • Optimize for aligned fetch blocks • If the fetch block is aligned on a cache line, it can skip alignment and masking overhead • Optimization is effective when software/hardware alignment optimizations are effective

Optimized Cache Access To Same Line? Address Read line from cache M U X Cache Line Previous line Average-Case Cache Access • Optimize for consecutive reads to the same cache line • Allows subsequent references to skip cache access • Effective for small stride access patterns, tight loops in I-Cache • Very little overhead for non-consecutive references

Optimized Comparator  ? Inputs 4-bit Compare M U X Output 32-bit Compare Average-Case Comparator • Optimize for the case that a difference exists in the lower 4 bits of the inputs • 4-bit comparison is > 50% faster than 32-bit • Very effective for iterative loops • Can be extended for tag comparisons

Outline • Introduction & Motivation • Background • DSEP Design • Average Case Optimizations • Experimental Evaluation

Simulation Environment • VHDL Simulator using Renoir Design Suite • MIPS I Instruction set • Fetch and Retire Bandwidth = 1 • Execute Bandwidth ≤ 4 • 4-entry split Instruction Window • 64-entry Reorder Buffer • Benchmarks • BS 50-element bubble sort • MM 10x10 integer matrix multiply

Operation DSEP Latencies Fixed Latencies Fetch 100 120 Decode 50/80/120 120 Rename 80/120/150 120 Read 120 120 Execute 20/40/80/100/ 130/150/360/600 120/360/600 Retire 5/100/150 120 Caches 100 120 Main Memory 960 960 Micropipeline Register 5 5 Two Pipeline Configurations “Synchronous” Clock Period = 120 time units

DSEP Performance • Compared Fixed and DSEP configurations • DSEP increased performance 28% and 21% for BS and MM respectively Execution Time

Micropipeline Performance • Goals: • Determine the need for local elasticity • Determine appropriate lengths of the queues • Method: • Evaluate DSEP configurations of form AxBxC • A  Micropipelines in Decode, Rename and Retire • B  Micropipelines in Read • C  Micropipelines in Execute • All configurations include fixed length instruction window and reorder buffer

Micropipeline Performance • Measured percent speedup over 1x1x1 • 2x2x1 best for both benchmarks • 2.4% performance improvement for BS, 1.7% for MM • Stalls in Fetch reduced by 60% for 2x2x1 Matrix-Multiply Bubble-Sort Percent Speedup

OOO Engine Utilization • Measured OOO Engine utilization • Instruction Window (IW) and Reorder Buffer (RB) • Utilization = Avg # of instructions in the buffer • IW-Utilization up 75%, RB-Utilization up 40% Instruction Window Reorder Buffer Utilization

Total Performance • Compared Fixed and DSEP configurations • DSEP 2x2x1 increased performance 29% and 22% for BS and MM respectively Execution Time

Conclusions • Decoupled, Self-Timing • Average-Case optimizations significantly increase performance • Rarely taken critical paths no longer matter • Elasticity • Removes pipeline jitter from decoupled operation • Increases utilization of existing resources • Not as important as Average-Case Optimizations (At least for our experiments)

Questions?

Decoupled Pipelines: Rationale, Analysis, and Evaluation