Block cache for embedded systems
This presentation is the property of its rightful owner.
Sponsored Links
1 / 41

Block Cache for Embedded Systems PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

Block Cache for Embedded Systems. Dominic Hillenbrand and Jörg Henkel Chair for Embedded Systems CES University of Karlsruhe Karlsruhe, Germany. Outline. Motivation Related Work State of the art: “Instruction Cache” Our approach: ”Block cache”

Download Presentation

Block Cache for Embedded Systems

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Block cache for embedded systems

Block Cache for Embedded Systems

Dominic Hillenbrand and Jörg Henkel

Chair for Embedded Systems CES

University of Karlsruhe

Karlsruhe, Germany


Outline

Outline

  • Motivation

  • Related Work

  • State of the art: “Instruction Cache”

  • Our approach: ”Block cache”

    • Workflow (Instruction Selection / Simulation)

    • Assumptions & Constrains

    • Algorithm

  • Results

  • Summary


Motivation

Motivation

  • Area is expected to increase enormously(!)

On-Chip

Off-Chip

Off-chip

memory

CPU

Block

Cache

I-Cache

I-Cache

CPU

I-Cache

I-Cache

CPU

I-Cache

I-Cache

David A. Patterson

„Latency lags bandwidth”

Commun. ACM

2004”

Efficiency

Power consumption

Area

1.. N Memory blocks

of instructions(SRAM cells)

Generally caches

consume more power

than on-chip memory [1,2,3]


Related work

Related Work

  • S. Steinke, L. Wehmeyer B, B. Lee, P. Marwedel „Assigning Program and Data Objects to Scratchpad for Energy Reduction” – DATE ’02

    • Statically partition on- and off-chip memory

  • S. Steinke, N. Grunwald, L. Wehmeyer, R. Banakar, M. Balakrishnan, P. Marwedel, “Reducing energy consumption by dynamic copying of instructions to on-chip memory” – ISSS ‘02

    • Statically determine code copying points

  • P. Francesco, P.Marchal, D.Atienza, L. Benini, F. Catthoor, J.Mendias “An integrated hw/sw-approach for run-time scratchpad management” – DAC ’04

    • DMA for acceleration in on-chip memory for data

  • B. Egger, J. Kee, H. Shin “Scratchpad memory management for portable systems with a memory management unit”, EMSOFT ’06

    • MMU to map between on- and off-chip memory (we share the µTLB)


State of the art instruction cache

“State of the Art”: Instruction Cache

On-Chip

Off-Chip

Off-chip

memory

CPU

Block

Cache

CPU

I-Cache

CPU

I-Cache


Architecture instruction cache

MUX

MUX

Architecture: Instruction Cache

Offset

Tag

Set

T

T

T

T

O

...

...

=

O

...

...

T

=

O

T

...

...

=

T

O

...

...

=

Data

Tag

O


State of the art instruction cache1

Our approach: Block Cache

“State of the Art”: Instruction Cache

On-Chip

Off-Chip

Off-chip

memory

CPU

Block

Cache

CPU

I-Cache

CPU

I-Cache


Our approach block cache

B1

Memory

Blocks

(SRAM

cells)

B2

+

Logic

..

BN

Our approach: Block Cache


Architectural overview block cache

Architectural Overview: Block Cache

Memory blocks

Off-chip memory

On-chip

B1

CPU

B2

Instructions

DMA

..

=

BN

Block load

Instruction

address

µTLB

ControlUnit

Exploit burst transfers

(DRAM Memory)

-Area efficient

(SRAM cells)

-Scalable

(up to application

size)


Architectural overview block cache1

Architectural Overview: Block Cache

Memory blocks

Off-chip memory

On-chip

B1

CPU

B2

Instructions

DMA

..

=

BN

Block load

Instruction

address

µTLB

ControlUnit

Exploit burst transfers

(DRAM Memory)

-Area efficient

(SRAM cells)

-Scalable

(up to application

size)


Architectural overview block cache2

B1

B2

..

BN

Architectural Overview: Block Cache

Memory blocks

On-chip

F1

PUSH R1

PUSH R2

….

POP R2

POP R1

RET

(Assembler)

010101010010110111

100010

(Binary)

F2

1..N Function(s)

=

….

FN


Function to block mapping

Function to Block Mapping

F6

B1

F2

F3

F1

B2

F5

F9

F8

F4

F15

B3

F20

F10

FN

F19

F12

F19

F14

F17

F7

F16

F18

F6a

F6c

Eviction:

LRU, Round Robin,

ARC, Belady

F6b


Design flow analysis

Design Flow : Analysis

Executed

Instruction

Trace

Software

Component

Instrumented

Execution /

Simulation

Dynamic Call

Graph

Input Data /

Parameters

Disassemble

+ Functions not

called during

profiling(need to be included)

Static Call Graph

Trace:

function enter/exit

function address


Besides assumptions constrains

Besides:Assumptions & Constrains

  • Software Behavior Analysis

    • Component level

    • Trace composition reflects deployment usage( parameters / input set )

  • Hardware

    • External memory: High bandwidth / high latency

    • Block size (fixed) / Number of code blocks (fixed)

  • Compiler / Linker

    • Function splitting (function size < block size)


Design flow analysis1

Design Flow : Analysis

Executed

Instruction

Trace

Application

(component)

Instrumented

Execution /

Simulation

Dynamic Call

Graph

Input Data /

Parameters

Disassemble

Static Call Graph

Trace:

function enter/exit

function address


Design flow block composition

Design Flow : Block composition

Dynamic Call

Graph

Block

composition

algorithm

Linker File

Static Call Graph


Design flow re linking

Design Flow : Re-linking

Re-linked Binary

Original Binary

Linker File

Code block 1

Function 1

done

Function 2

X

Function 3

Code block 2

Function 4

Code block 3

Function 5

Function 6

….

Code block 4


Design flow re linking1

Design Flow : Re-linking

Code block 1

Data section size

Original code sectionsize

Code sectionsize after re-linking

Data Reference

Compiler supplies:

Relocation table

Symbol table

ELF headers

Function Pointer

Function Reference


Overview algorithm

Overview: Algorithm

  • Input:Dynamic function call graph

    (Node = function)

  • Output:Block graph

    (Node = 1..n functions)

  • Challenge: “Merge appropriate functions into a block”

  • 3 steps (differ in merging distance):

    (1) combine_neighbor

    (2) merge_direct_children

    (3) bubble_merge


Algorithm step 1 3

Algorithm Step 1/3

combine_neighbor

F1

Dynamic Call Graph

100

1

4

F2

F4

F3

30

1e6

1e4

1e6

1e8

F5

F6

1e10

F8

F9

F7

Function size

(architecture)


Algorithm step 1 31

Algorithm Step 1/3

combine_neighbor

Dynamic Call Graph

F1

0.00

100

1

4

0.00

F2

1.00

F4

0.00

F3

30

1e6

1e4

1e6

1e8

F5

F6

1e10

F8

F9

0.00

F7

0.00

0.00

0.99

0.01

Centrality

Measure:

F4,7


Algorithm step 2 3

Algorithm Step 2/3

merge_direct_children

Dynamic Call Graph

F3

30

1e6

1e4

1e6

F5

F6

F7

F8

F5

F6

F7

F8


Algorithm step 2 31

Algorithm Step 2/3

merge_direct_children

Dynamic Call Graph

F3

30

1e6

1e4

1e6

F5

F6

F7

F8

F5

F6

F7

F8

F6

F7

F8

F5

F6,7,8

F6,7


Algorithm step 2 32

Algorithm Step 2/3

merge_direct_children

Dynamic Call Graph

F3

30

1e6

1e6+1e6+1e4

1e4

1e6

F5

F6

F7

F8

F5

F6,7,8

F5


Algorithm step 3 3

Algorithm Step 3/3

bubble_merge

F1

F1

Dynamic Call Graph

100

1

4

F2

F4

F3

30

1e6

1e4

1e6

1e8

F5

F5

F6

F6

1e10

F8

F9

F7

F7


Algorithm step 3 31

Algorithm Step 3/3

bubble_merge

F1

F1

Dynamic Call Graph

100

1

4

F2

F2

F4

F4

F3

30

1e6

1e4

1e6

1e8

F5

F5

F6

F6

1e10

F8

F8

F9

F9

F7

F7


Algorithm step 3 32

Algorithm Step 3/3

bubble_merge

F1

F1

Dynamic Call Graph

100

1

4

F2

F2

F4

F4

F3

30

1e6

1e4

1e6

1e8

F5

F5

F6

F6

1e10

F8

F8

F9

F9

F7

F7

F3,F8


Results

Results

  • What is interesting ?

    • Memory efficiency: Block Fragmentation

    • Technology scaling: Misses

    • Energy: Amount of transferred data

    • Performance: Number of cycles

  • Benchmark: MediaBench (CJPEG)


Results block fragmentation

Results: Function size distribution

Results: Block Fragmentation

x-axis: Binary size [Byte]

Block size

[Byte]

CJPEG – JPEG encoding (MediaBench)


Results misses lru 6 12 blocks

Results: Misses : LRU: [6-12 blocks]

X-axis: total cache size [Byte]

CJPEG – JPEG encoding (MediaBench)


Results transferred code lru 6 12 blocks

Results: Transferred Code : LRU [6-12 blocks]

X-axis: total cache size [Byte]

CJPEG – JPEG encoding (MediaBench)


Results lru arc rr transferred code 8 blocks

Results: LRU/ARC/RR Transferred Code [8 blocks]

X-axis: total cache size [Byte]

CJPEG – JPEG encoding (MediaBench)


Results copy cycles lru 6 12 blocks

Results: Copy cycles : LRU : [6-12 blocks]

X-axis: total cache size [Byte]

CJPEG – JPEG encoding (MediaBench)


Summary

Summary

  • Introduced: Block Cache for Embedded Systems

    • Area increase / External memory latency

      • Utilization / Suitability of traditional designs

      • Scalability: on-chip memories (Megabytes)

  • Block Cache:

    • Hardware

      • Simple hardware structure:

        Logic + Memory (SRAM not cache memory)

    • Design Flow

      • Execute software component, block composition (algorithm, 3 steps), re-link the binary

    • Results

      • Exploits high-bandwidth memory

      • Good performance


References

References

  • [1] David A. Patterson „Latency lags bandwidth”,Commun. ACM – 2004

  • [2] R.Banakar, S.Steineke, B.Lee, M. Balakrishnan, P.Marwedel, “Scratchpad memory: Design alternative for cache on-chip memory in embedded systems” - CODES, 2002

  • [3] F.Angiolini, F.Menichelli, A.Ferrero, L.Benini, M.Oliveri, “A post compiler approach to scratchpad mapping of code” – CASES, 2004

  • [4] S.Steineke, L.Wehmeyer, B. Lee, P.Marwedel, “Assigning program and data objects to scratchpad for energy reduction” - DATE, 2002


Motivation1

Motivation

Off-chip memory

Bandwidth improves

but latency not [1]

CPU

I-Cache

Generally caches

consume more power

than on-chip memory [2,3,4]

On-chip area

will increase

enormously

CPU

I-Cache

A significant amount

of power will be spent

in the memory

hierarchy

CPU

I-Cache


Motivation2

Motivation

Off-chip memory

CPU

I-Cache

CPU

I-Cache

CPU

I-Cache


Motivation3

Motivation

Off-chip memory

CPU

I-Cache

CPU

I-Cache

B-Cache

CPU

I-Cache


Architectural overview block cache3

Architectural Overview: Block Cache

Off-chip memory

Code blocks

B1

B2

CPU

Exploit burst transfers

(DRAM Memory)

Instructions

-Area efficient

(SRAM cells)

-Scalable

(up to application

size)

DMA

B3

=

Block load

Instruction

address

Addr. B1

Block

status

Addr. B1

ControlUnit

Addr. B1

µTLB

On-chip

……


Function to block mapping1

Function to Block Mapping

F6

B1

F2

F3

F1

B2

F8

F7

F4

F5

F20

F9

F10

FN

F12

F19

F14

F17

F15

F16

F18

F19

Exploit burst transfers

(DRAM Memory)

-Area efficient

(SRAM cells)

-Scalable

(up to application

size)


  • Login