Flexicache software based instruction caching for embedded processors
Download
1 / 33

Flexicache: Software-based Instruction Caching for Embedded Processors - PowerPoint PPT Presentation


  • 158 Views
  • Uploaded on

Flexicache: Software-based Instruction Caching for Embedded Processors. Jason E Miller and Anant Agarwal Raw Group - MIT CSAIL. Outline. Introduction Baseline Implementation Optimizations Energy Conclusions. Hardware Instruction Caches.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Flexicache: Software-based Instruction Caching for Embedded Processors' - turner


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Flexicache software based instruction caching for embedded processors l.jpg

Flexicache:Software-based Instruction Caching for Embedded Processors

Jason E Miller and Anant Agarwal

Raw Group - MIT CSAIL


Outline l.jpg
Outline

  • Introduction

  • Baseline Implementation

  • Optimizations

  • Energy

  • Conclusions


Hardware instruction caches l.jpg
Hardware Instruction Caches

  • Used in virtually all high-performance general-purpose processors

DRAM

  • Good performance

    • Decreases average memory access time

  • Easy to use

    • Transparent operation

I-Cache

Processor

Chip


Icache less processors l.jpg
ICache-less Processors

  • Embedded procs and DSPs

    • TMS470, ADSP-21xx, etc.

  • Embedded multicore processors

    • IBM Cell SPE

DRAM

SRAM

  • No special-purpose hardware

    • Less design/verification time

    • Less area

    • Shorter cycle time

    • Less energy per access

    • Predictable behavior

Processor

Chip

  • Much harder to program!

    • Manually partition code and transfer pieces from DRAM


Software based i caching l.jpg
Software-based I-Caching

  • Use a software system to virtualize instruction memory by recreating hardware cache functionality

  • Automatic management of simple SRAM memory

    • Good performance with no extra programming effort

  • Integrated into each individual application

    • Customized to program’s needs

    • Optimize for different goals

    • Real-time predictability

  • Maintain low-cost, high-speed hardware


Outline6 l.jpg
Outline

  • Introduction

  • Baseline Implementation

  • Optimizations

  • Energy

  • Conclusions


Flexicache system overview l.jpg

Runtime

library

Rewritten

Binary

Binary

Rewriter

Linker

Flexicache

Binary

I-mem

DRAM

Processor

Flexicache System Overview

Original

Binary

Programmer


Binary rewriter l.jpg
Binary Rewriter

  • Break up user program into cache blocks

  • Modify control-flow that leaves the blocks

Flexicache

runtime

Binary

Rewriter


Rewriter details l.jpg
Rewriter: Details

  • One basic block in each cache block, but…

    • Fixed-size of 16 instructions

      • Simplifies bookkeeping

      • Requires padding of small blocks and splitting of large ones

  • Control-flow instructions that leave a block are modified to jump to the runtime system

    • E.g. BEQ $2,$3,foo  JEQL $2,$3,runtime

    • Original destination addresses stored in table

    • Fall-through jumps at end of blocks


Runtime overview l.jpg
Runtime: Overview

  • Stays resident in I-mem

  • Receive requests from cache blocks

  • See if requested block is resident

  • Load new block from DRAM if necessary

    • Evict blocks to make room

  • Transfer control to the new block


Runtime operation l.jpg

Runtime

System

Entry Point 1

Entry Point 2

Indirect EP

branch

fall-thru

DRAM

Block 0

request

JR

Block 1

Miss Handler

Block 2

Block 3

reply

Runtime Operation

Loaded Cache Blocks

Block 2


System policies and mechanisms l.jpg
System Policies and Mechanisms

  • Fully-associative cache block placement

  • Replacement Policy: FIFO

    • Evict oldest block in cache

    • Matches sequential execution

  • Pinned functions

    • Key feature for timing predictability

    • No cache overhead within function


Experimental setup l.jpg
Experimental Setup

  • Implemented for a tile in the Raw multicore processor

    • Similar to many embedded processors

    • 32-bit single-issue in-order MIPS pipeline

    • 32 kB SRAM I-mem

  • Raw simulator

    • Cycle-accurate

    • Idealized I/O model

    • SRAM I-mem or traditional hardware I-cache models

    • Uses Wattch to estimate energy consumption

  • Mediabench benchmark suite

    • Multimedia applications for embedded processors


Baseline performance l.jpg
Baseline Performance

Flexicache Overhead

Overhead: Number of additional cycles relative to 32 kB, 2-way HW cache


Outline15 l.jpg
Outline

  • Introduction

  • Baseline Implementation

  • Optimizations

  • Energy

  • Conclusions


Basic chaining l.jpg

Block

A

Runtime

System

Block

B

Block

C

With

Chaining

Block

D

Basic Chaining

Block

A

Runtime

System

  • Problem: Hit case in runtime system takes about 40 cycles

Block

B

Block

C

Without

Chaining

Block

D

  • Solution: Modify jump to runtime system so that it jumps directly to loaded code the next time


Basic chaining performance l.jpg
Basic Chaining Performance

Flexicache Overhead


Basic chaining performance18 l.jpg
Basic Chaining Performance

Flexicache Overhead


Function call chaining l.jpg
Function Call Chaining

  • Problem: Function calls were not being chained

  • Compound instructions (like jump-and-link) handle two virtual addresses

    • Load return address into link register

    • Jump to destination address

  • Solution:

    • Decompose them in the rewriter

    • Jump can be chained normally at runtime



Replacement policy l.jpg

older

newer

Unchaining table

A:

B:

C:

D:

Replacement Policy

  • Problem: Too much bookkeeping

    • Chains must be backed out if destination block is evicted

    • Idea 1: With FIFO replacement policy, no need to record chains from old to young

    • Idea 2: Limit # of chains to each block

Block

A

Runtime

System

Block

B

Block

C

Block

D

  • Solution: Flush replacement policy

    • Evict everything and start fresh

    • No need to undo or track chains

    • Increased miss rate vs FIFO

 D

 C

 A


Flush policy performance l.jpg
Flush Policy Performance

Flexicache Overhead


Indirect jump chaining l.jpg

A

A

B

B

if $31==A: JMP A

if $31==B: JMP B

if $31==C: JMP C

C

C

Indirect Jump Chaining

  • Problem: Different destination on each execution

  • Solution: Pre-screen addresses and chain each individually

JR $31

  • But…

    • Screening takes time

    • Which addresses should we chain?



Fixed size block padding l.jpg
Fixed-size Block Padding

00008400 <L2B1>:

8400: mfsr $r9,28

8404: rlm $r9,$r9,0x4,0x0

8408: jnel+ $r9,$0, _dispatch.entry1

840c: jal _dispatch.entry2

8410: nop

8414: nop

8418: nop

841c: nop

  • Padding for small blocks wastes more space than expected

    • Average basic block contains 5.5 instructions

    • Most common size is 3

    • 60-65% of storage space is wasted on NOPs


8 word cache blocks l.jpg
8-word Cache Blocks

  • Reduce cache block size to better fit basic blocks

    • Less padding  less wasted space  lower miss rate

    • Bookkeeping structures get bigger  higher miss rate

    • More block splits  higher miss rate, overhead

  • Allow up to 4 consecutive blocks to be loaded together

    • Effectively creates 8, 16, 24 and 32 word blocks

    • Avoid splitting up large basic blocks

  • Performance Benefits

    • Amortize cost of a call into the runtime

    • Overlap DRAM fetches

    • Eliminate jumps used to split large blocks

    • Also used to add extra space for runtime JR chaining


8 word blocks performance l.jpg
8-word Blocks Performance

Flexicache Overhead


Performance summary l.jpg
Performance Summary

  • Good performance on 6 of 9 benchmarks: 5-11%

  • G721 (24.2% overhead)

    • Indirect jumps

  • Mesa (24.4% overhead)

    • Indirect jumps, High miss rate

  • Rasta (93.6% overhead)

    • High miss rate, indirect jumps

  • Majority of remaining overhead is due to modifications to user code, not runtime calls

    • Fall-through jumps added by rewriter

    • Indirect jump chain comparisons


Outline29 l.jpg
Outline

  • Introduction

  • Baseline Implementation

  • Optimizations

  • Energy

  • Conclusions


Energy analysis l.jpg
Energy Analysis

  • SRAM uses less energy than cache for each access

    • No tags and unused cache ways

    • Saves about 9% of total processor power

  • Additional instructions for software management use extra energy

    • Total energy roughly proportional to number of cycles

  • Software I-cache will use less total energy if instruction overhead is below 9%


Energy results l.jpg
Energy Results

  • Wattch used with CACTI models for SRAM and I-cache

    • 32 kB, 2-way set associative HW cache, 25% of total power

  • Total energy to complete each benchmark calculated


Conclusions l.jpg
Conclusions

  • Software-based instruction caching can be a practical solution for embedded processors

  • Provides programming convenience of a HW cache

  • Performance and energy similar to a HW cache

    • Overhead < 10% on several benchmarks

    • Energy savings of up to 3.8%

  • Maintain advantages of Icache-less architecture

    • Low-cost hardware

    • Real-time guarantees

http://cag.csail.mit.edu/raw


Questions l.jpg

Questions?

http://cag.csail.mit.edu/raw