Flexicache: Software-based Instruction Caching for Embedded Processors

Flexicache:Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

Outline • Introduction • Baseline Implementation • Optimizations • Energy • Conclusions

Hardware Instruction Caches • Used in virtually all high-performance general-purpose processors DRAM • Good performance • Decreases average memory access time • Easy to use • Transparent operation I-Cache Processor Chip

ICache-less Processors • Embedded procs and DSPs • TMS470, ADSP-21xx, etc. • Embedded multicore processors • IBM Cell SPE DRAM SRAM • No special-purpose hardware • Less design/verification time • Less area • Shorter cycle time • Less energy per access • Predictable behavior Processor Chip • Much harder to program! • Manually partition code and transfer pieces from DRAM

Software-based I-Caching • Use a software system to virtualize instruction memory by recreating hardware cache functionality • Automatic management of simple SRAM memory • Good performance with no extra programming effort • Integrated into each individual application • Customized to program’s needs • Optimize for different goals • Real-time predictability • Maintain low-cost, high-speed hardware

Runtime library Rewritten Binary Binary Rewriter Linker Flexicache Binary I-mem DRAM Processor Flexicache System Overview Original Binary Programmer

Binary Rewriter • Break up user program into cache blocks • Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter

Rewriter: Details • One basic block in each cache block, but… • Fixed-size of 16 instructions • Simplifies bookkeeping • Requires padding of small blocks and splitting of large ones • Control-flow instructions that leave a block are modified to jump to the runtime system • E.g. BEQ $2,$3,foo  JEQL $2,$3,runtime • Original destination addresses stored in table • Fall-through jumps at end of blocks

Runtime: Overview • Stays resident in I-mem • Receive requests from cache blocks • See if requested block is resident • Load new block from DRAM if necessary • Evict blocks to make room • Transfer control to the new block

Runtime System Entry Point 1 Entry Point 2 Indirect EP branch fall-thru DRAM Block 0 request JR Block 1 Miss Handler Block 2 Block 3 reply … Runtime Operation Loaded Cache Blocks Block 2

System Policies and Mechanisms • Fully-associative cache block placement • Replacement Policy: FIFO • Evict oldest block in cache • Matches sequential execution • Pinned functions • Key feature for timing predictability • No cache overhead within function

Experimental Setup • Implemented for a tile in the Raw multicore processor • Similar to many embedded processors • 32-bit single-issue in-order MIPS pipeline • 32 kB SRAM I-mem • Raw simulator • Cycle-accurate • Idealized I/O model • SRAM I-mem or traditional hardware I-cache models • Uses Wattch to estimate energy consumption • Mediabench benchmark suite • Multimedia applications for embedded processors

Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache

Block A Runtime System Block B Block C With Chaining Block D Basic Chaining Block A Runtime System • Problem: Hit case in runtime system takes about 40 cycles Block B Block C Without Chaining Block D • Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time

Basic Chaining Performance Flexicache Overhead

Function Call Chaining • Problem: Function calls were not being chained • Compound instructions (like jump-and-link) handle two virtual addresses • Load return address into link register • Jump to destination address • Solution: • Decompose them in the rewriter • Jump can be chained normally at runtime

Function Call Chaining Performance Flexicache Overhead

older newer Unchaining table A: B: C: D: Replacement Policy • Problem: Too much bookkeeping • Chains must be backed out if destination block is evicted • Idea 1: With FIFO replacement policy, no need to record chains from old to young • Idea 2: Limit # of chains to each block Block A Runtime System Block B Block C Block D • Solution: Flush replacement policy • Evict everything and start fresh • No need to undo or track chains • Increased miss rate vs FIFO  D  C  A

Flush Policy Performance Flexicache Overhead

A A B B if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C C C Indirect Jump Chaining • Problem: Different destination on each execution • Solution: Pre-screen addresses and chain each individually JR $31 • But… • Screening takes time • Which addresses should we chain?

Indirect Jump Chaining Performance Flexicache Overhead

Fixed-size Block Padding 00008400 <L2B1>: 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop … • Padding for small blocks wastes more space than expected • Average basic block contains 5.5 instructions • Most common size is 3 • 60-65% of storage space is wasted on NOPs

8-word Cache Blocks • Reduce cache block size to better fit basic blocks • Less padding  less wasted space  lower miss rate • Bookkeeping structures get bigger  higher miss rate • More block splits  higher miss rate, overhead • Allow up to 4 consecutive blocks to be loaded together • Effectively creates 8, 16, 24 and 32 word blocks • Avoid splitting up large basic blocks • Performance Benefits • Amortize cost of a call into the runtime • Overlap DRAM fetches • Eliminate jumps used to split large blocks • Also used to add extra space for runtime JR chaining

8-word Blocks Performance Flexicache Overhead

Performance Summary • Good performance on 6 of 9 benchmarks: 5-11% • G721 (24.2% overhead) • Indirect jumps • Mesa (24.4% overhead) • Indirect jumps, High miss rate • Rasta (93.6% overhead) • High miss rate, indirect jumps • Majority of remaining overhead is due to modifications to user code, not runtime calls • Fall-through jumps added by rewriter • Indirect jump chain comparisons

Energy Analysis • SRAM uses less energy than cache for each access • No tags and unused cache ways • Saves about 9% of total processor power • Additional instructions for software management use extra energy • Total energy roughly proportional to number of cycles • Software I-cache will use less total energy if instruction overhead is below 9%

Energy Results • Wattch used with CACTI models for SRAM and I-cache • 32 kB, 2-way set associative HW cache, 25% of total power • Total energy to complete each benchmark calculated

Conclusions • Software-based instruction caching can be a practical solution for embedded processors • Provides programming convenience of a HW cache • Performance and energy similar to a HW cache • Overhead < 10% on several benchmarks • Energy savings of up to 3.8% • Maintain advantages of Icache-less architecture • Low-cost hardware • Real-time guarantees http://cag.csail.mit.edu/raw

Questions? http://cag.csail.mit.edu/raw

Flexicache: Software-based Instruction Caching for Embedded Processors