A Scalable Front-End Architecture for Fast Instruction Delivery

A Scalable Front-End Architecture for Fast Instruction Delivery Paper by: Glenn Reinman, Todd Austin and Brad Calder Presenter: Alexander Choong

Conventional Pipeline Architecture • High-performance processors can be broken down into two parts • Front-end: fetches and decodes instructions • Execution core: executes instructions

Front-End and Pipeline Simple Front-End Fetch Decode … Fetch Decode Fetch

Front-End with Prediction Simple Front-End Fetch Predict Decode … Fetch Predict Decode Fetch Predict

Front-End Issues I • Flynn’s bottleneck: • IPC is bounded by the number of Instructions fetched per cycle • Implies: As execution performance increases, the front-end must keep up to ensure overall performance

Front-End Issues II • Two opposing forces • Designing a faster front-end • Increase I-cache size • Interconnect Scaling Problem • Wire performance does not scale with feature size • Decrease I-cache size

Key Contributions I

Key Contributions:Fetch Target Queue • Objective: • Avoid using large cache with branch prediction • Purpose • Decouple I-cache from branch prediction • Results • Improves throughput

Key Contributions:Fetch Target Buffer • Objective • Avoid large caches with branch prediction • Implementation • A multi-level buffer • Results • Deliver performance is 25% better than single level • Scales better with “future” feature size

Outline • Scalable Front-End and Components • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Predict Simple Front-End Fetch Predict Fetch Predict

Fetch Target Queue • Decouples I-cache from branch prediction • Branch predictor can generate predictions independent of when the I-cache uses them Fetch Front-End with FTQ Predict Predict Predict Fetch Predict Fetch Fetch

Fetch Target Queue • Fetch and predict can have different latencies • Allows for I-cache to be pipelined • As long as they have the same throughput

Fetch Blocks • FTQ stores fetch block • Sequence of instructions • Starting at branch target • Ending at a strongly biased branch • Instructions are directly fed into pipeline

Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Experimental Methodology • Results • Analysis and Conclusion

Fetch Target Buffer:Outline • Review: Branch Target Buffer • Fetch Target Buffer • Fetch Blocks • Functionality

Review: Branch Target Buffer I • Previous Work (Perleberg and Smith [2]) • Makes fetch independent of predict Fetch Predict Simple Front-End Fetch Predict Fetch Predict Fetch Predict With Branch Target Buffer Fetch Predict Fetch Predict

Review: Branch Target Buffer II • Characteristics • Hash table • Makes predictions • Caches prediction information

Review: Branch Target Buffer III PC

FTP Optimizations over BTB • Multi-level • Solves conundrum • Need a small cache • Need enough space to successfully predict branches

FTP Optimizationsover BTB • Oversize bit • Indicates if a block is larger than cache line • With multi-port cache • Allows several smaller blocks to be loaded at the same time

FTP Optimizationsover BTB • Only stores partial fall-through address • Fall-through address is close to the current PC • Only need to store an offset

FTP Optimizations over BTB • Doesn’t store every blocks: • Fall-through blocks • Blocks that are seldom taken

Fetch Target Buffer • Target: of branch • Type: conditional, subroutine call/return • Oversize: if block size > cache line Next PC

Fetch Target Buffer

PC used as index into FTB

L1 Hit HIT!

Branch NOT Taken HIT! HIT! NOT TAKEN

Branch Taken HIT! TAKEN

L1 Miss L1: MISS FALL THROUGH

L1 Miss L1: MISS L2: HIT After N cycle Delay FALL THROUGH

L1 and L2 Miss L1: MISS L2: MISS FALL THROUGH: eventually mispredicts

Hybrid branch prediction • Meta-predictor selects between • Local history predictor • Global history predictor • Bimodal predictor

Branch Prediction Global Predictor

Branch Prediction

Committing Results When full, SHQ commits oldest value to local history or global history

Outline • Scalable Front-End and Component • Fetch Target Queue • Fetch Target Buffer • Methodology • Results • Analysis and Conclusion

Experimental Methodology I • Baseline Architecture • Processor • 8 instruction fetch with 16 instruction issue per cycle • 128 entry reorder buffer with 32 entry load/store buffer • 8 cycle minimum branch mis-prediction penalty • Cache • 64k 2-way instruction cache • 64k 4 way data cache (pipelined)

Experimental Methodology II • Timing Model • Cacti cache compiler • Models on-chip memory • Modified for 0.35 um, 0.188 um and 0.10 um processes • Test set • 6 SPEC95 benchmarks • 2 C++ Programs

Comparing FTB to BTB • FTB provides slightly better performance • Tested for various cache sizes: 64, 256, 1k, 4k and 8K entries Better

Comparing Multi-level FTB to Single-Level FTB • Two-level FTB Performance • Smaller fetch size • 2 Level Average Size: 6.6 • 1 Level Average Size: 7.5 • Higher accuracy on average • Two-Level: 83.3% • Single: 73.1 % • Higher performance • 25% average speedup over single

Fall-through Bits Used Better • Number of fall-through bits: 4-5 • Because fetch distances 16 instructions do not improve performance

FTQ Occupancy Better • Roughly indicates throughput • On average, FTQ is • Empty: 21.1% • Full: 10.7% of the time

Scalability Better • Two level FTB scale well with features size • Higher slope is better

Analysis • 25% improvement in IPC over best performing single-level designs • System scales well with feature size • On average, FTQ is non-empty 21.1% of the time • FTB Design requires at most 5 bits for fall-through address

Conclusion • FTQ and FTB design • Decouples the I-cache from branch prediction • Produces higher throughput • Uses multi-level buffer • Produces better scalability

References • [1] A Scalable Front-End Architecture for Fast Instruction Delivery. Glenn Reinman, Todd Austin, and Brand Calder. ACM/IEEE 26th Annual International Symposium on Computer Architecture. May 1999 • [2] Branch Target Buffer: Design and Optimization. Chris Perleberg and Alan Smith. Technical Report. December 1989.

A Scalable Front-End Architecture for Fast Instruction Delivery