1 / 33

Flexicache: Software-based Instruction Caching for Embedded Processors

Flexicache: Software-based Instruction Caching for Embedded Processors. Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL. Outline. Introduction Baseline Implementation Optimizations Energy Conclusions. Hardware Instruction Caches.

turner
Download Presentation

Flexicache: Software-based Instruction Caching for Embedded Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Flexicache:Software-based Instruction Caching for Embedded Processors Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL

  2. Outline • Introduction • Baseline Implementation • Optimizations • Energy • Conclusions

  3. Hardware Instruction Caches • Used in virtually all high-performance general-purpose processors DRAM • Good performance • Decreases average memory access time • Easy to use • Transparent operation I-Cache Processor Chip

  4. ICache-less Processors • Embedded procs and DSPs • TMS470, ADSP-21xx, etc. • Embedded multicore processors • IBM Cell SPE DRAM SRAM • No special-purpose hardware • Less design/verification time • Less area • Shorter cycle time • Less energy per access • Predictable behavior Processor Chip • Much harder to program! • Manually partition code and transfer pieces from DRAM

  5. Software-based I-Caching • Use a software system to virtualize instruction memory by recreating hardware cache functionality • Automatic management of simple SRAM memory • Good performance with no extra programming effort • Integrated into each individual application • Customized to program’s needs • Optimize for different goals • Real-time predictability • Maintain low-cost, high-speed hardware

  6. Outline • Introduction • Baseline Implementation • Optimizations • Energy • Conclusions

  7. Runtime library Rewritten Binary Binary Rewriter Linker Flexicache Binary I-mem DRAM Processor Flexicache System Overview Original Binary Programmer

  8. Binary Rewriter • Break up user program into cache blocks • Modify control-flow that leaves the blocks Flexicache runtime Binary Rewriter

  9. Rewriter: Details • One basic block in each cache block, but… • Fixed-size of 16 instructions • Simplifies bookkeeping • Requires padding of small blocks and splitting of large ones • Control-flow instructions that leave a block are modified to jump to the runtime system • E.g. BEQ $2,$3,foo  JEQL $2,$3,runtime • Original destination addresses stored in table • Fall-through jumps at end of blocks

  10. Runtime: Overview • Stays resident in I-mem • Receive requests from cache blocks • See if requested block is resident • Load new block from DRAM if necessary • Evict blocks to make room • Transfer control to the new block

  11. Runtime System Entry Point 1 Entry Point 2 Indirect EP branch fall-thru DRAM Block 0 request JR Block 1 Miss Handler Block 2 Block 3 reply … Runtime Operation Loaded Cache Blocks Block 2

  12. System Policies and Mechanisms • Fully-associative cache block placement • Replacement Policy: FIFO • Evict oldest block in cache • Matches sequential execution • Pinned functions • Key feature for timing predictability • No cache overhead within function

  13. Experimental Setup • Implemented for a tile in the Raw multicore processor • Similar to many embedded processors • 32-bit single-issue in-order MIPS pipeline • 32 kB SRAM I-mem • Raw simulator • Cycle-accurate • Idealized I/O model • SRAM I-mem or traditional hardware I-cache models • Uses Wattch to estimate energy consumption • Mediabench benchmark suite • Multimedia applications for embedded processors

  14. Baseline Performance Flexicache Overhead Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache

  15. Outline • Introduction • Baseline Implementation • Optimizations • Energy • Conclusions

  16. Block A Runtime System Block B Block C With Chaining Block D Basic Chaining Block A Runtime System • Problem: Hit case in runtime system takes about 40 cycles Block B Block C Without Chaining Block D • Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time

  17. Basic Chaining Performance Flexicache Overhead

  18. Basic Chaining Performance Flexicache Overhead

  19. Function Call Chaining • Problem: Function calls were not being chained • Compound instructions (like jump-and-link) handle two virtual addresses • Load return address into link register • Jump to destination address • Solution: • Decompose them in the rewriter • Jump can be chained normally at runtime

  20. Function Call Chaining Performance Flexicache Overhead

  21. older newer Unchaining table A: B: C: D: Replacement Policy • Problem: Too much bookkeeping • Chains must be backed out if destination block is evicted • Idea 1: With FIFO replacement policy, no need to record chains from old to young • Idea 2: Limit # of chains to each block Block A Runtime System Block B Block C Block D • Solution: Flush replacement policy • Evict everything and start fresh • No need to undo or track chains • Increased miss rate vs FIFO  D  C  A

  22. Flush Policy Performance Flexicache Overhead

  23. A A B B if $31==A: JMP A if $31==B: JMP B if $31==C: JMP C C C Indirect Jump Chaining • Problem: Different destination on each execution • Solution: Pre-screen addresses and chain each individually JR $31 • But… • Screening takes time • Which addresses should we chain?

  24. Indirect Jump Chaining Performance Flexicache Overhead

  25. Fixed-size Block Padding 00008400 <L2B1>: 8400: mfsr $r9,28 8404: rlm $r9,$r9,0x4,0x0 8408: jnel+ $r9,$0, _dispatch.entry1 840c: jal _dispatch.entry2 8410: nop 8414: nop 8418: nop 841c: nop … • Padding for small blocks wastes more space than expected • Average basic block contains 5.5 instructions • Most common size is 3 • 60-65% of storage space is wasted on NOPs

  26. 8-word Cache Blocks • Reduce cache block size to better fit basic blocks • Less padding  less wasted space  lower miss rate • Bookkeeping structures get bigger  higher miss rate • More block splits  higher miss rate, overhead • Allow up to 4 consecutive blocks to be loaded together • Effectively creates 8, 16, 24 and 32 word blocks • Avoid splitting up large basic blocks • Performance Benefits • Amortize cost of a call into the runtime • Overlap DRAM fetches • Eliminate jumps used to split large blocks • Also used to add extra space for runtime JR chaining

  27. 8-word Blocks Performance Flexicache Overhead

  28. Performance Summary • Good performance on 6 of 9 benchmarks: 5-11% • G721 (24.2% overhead) • Indirect jumps • Mesa (24.4% overhead) • Indirect jumps, High miss rate • Rasta (93.6% overhead) • High miss rate, indirect jumps • Majority of remaining overhead is due to modifications to user code, not runtime calls • Fall-through jumps added by rewriter • Indirect jump chain comparisons

  29. Outline • Introduction • Baseline Implementation • Optimizations • Energy • Conclusions

  30. Energy Analysis • SRAM uses less energy than cache for each access • No tags and unused cache ways • Saves about 9% of total processor power • Additional instructions for software management use extra energy • Total energy roughly proportional to number of cycles • Software I-cache will use less total energy if instruction overhead is below 9%

  31. Energy Results • Wattch used with CACTI models for SRAM and I-cache • 32 kB, 2-way set associative HW cache, 25% of total power • Total energy to complete each benchmark calculated

  32. Conclusions • Software-based instruction caching can be a practical solution for embedded processors • Provides programming convenience of a HW cache • Performance and energy similar to a HW cache • Overhead < 10% on several benchmarks • Energy savings of up to 3.8% • Maintain advantages of Icache-less architecture • Low-cost hardware • Real-time guarantees http://cag.csail.mit.edu/raw

  33. Questions? http://cag.csail.mit.edu/raw

More Related