clive butler l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Clive Butler PowerPoint Presentation
Download Presentation
Clive Butler

Loading in 2 Seconds...

play fullscreen
1 / 29

Clive Butler - PowerPoint PPT Presentation


  • 209 Views
  • Uploaded on

Single-pass Cache Optimization. Clive Butler. Clive Butler and Ruofan Yang. Introduction of Problem. Embedded system execute a single application or a class of applications repeatedly Emerging methodology of designing embedded system utilizes configurable processors

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Clive Butler' - oshin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
clive butler

Single-pass Cache Optimization

Clive Butler

Clive Butler and Ruofan Yang

slide2

Introduction of Problem

  • Embedded system execute a single application or a class of applications repeatedly
  • Emerging methodology of designing embedded system utilizes configurable processors
  • Size, associativity, and line size.
  • Energy model and an execution time model are developed to find the best cache configuration for the given embedded application.
  • Current processor design methodologies rely on reserving large enough chip area for caches while conforming with area, performance, and energy cost constraints.
  • Customized cache allows designers to meet tighter energy consumption, performance, and cost constraints.
slide3

Introduction of Problem

  • In existing low power processors, cache memory is known to consume a large portion of the on-chip energy
  • Cache consumes up to 43% to 50% of the total system power of a processor.
  • In embedded systems where a single application or a class of applications are repeatedly executed on a processor, the memory hierarchy could be customized such that an optimal configuration is achieved.
  • The right choice of cache configuration for a given application could have a significant impact on overall performance and energy consumption.
slide4

Introduction of Problem

  • Estimating the hit and miss rates is fairly easy using tools such as Dinero.
  • Can be enormously time consuming to do so for various cache sizes, associativities and line sizes.
  • To use Dinero to estimate cache miss rate for a number of cache configurations means that a large program trace needs to be repeatedly read and evaluated which is time consuming.
  • Very time consuming.
dinero
Dinero
  • Dinero is a trace-driven cache simulator
  • Simulations are repeatable
  • One can simulate either a unified cache (mixed, data and instructions cached together) or separate instruction and data caches.
  • Cheaper (Hardware)
dinero6
A din record is two- tuple label address.

Cache parameters are set by command line options

0 read data, 1 write data, 2 instruction fetch. 3 escape record, 4 escape record (causes cache flush).

Dinero uses the priority stack method of memory hierarchy simulation to increase flexibility and improve simulator performance in highly associative caches.

Dinero
introduction method 1
Introduction Method 1

Introduction Tree-base Method

  • Presents a methodology to rapidly and accurately explore the cache design space
  • Done by estimating cache miss rates for many different cache configurations simultaneously; and investigate the effect of different cache configurations on the energy and performance of a system.
  • Simultaneous evaluation can be rapidly performed by taking advantage of the high correlation between cache behavior of different cache configurations.
asp dac paper
ASP-DAC paper

General Simulation Process

m(max)…..m(min)…..0

Cacheaddr.

tag

Array (stores tree addresses)

Tree

Step 1:

index

Step 3:

Find node and

go link list

Cache Miss Table

Step 2:

Go to tree addr.

and traverse the

list

Link List

Step 4:

Look for

match

asp dac paper9
ASP-DAC paper

Tree example

1010

101(0)

101(1)

Cache Size 2

10(00)

10(10)

10(11)

Cache Size 4

10(01)

Cache Size 8

1(000)

1(100)

1(010)

1(110)

1(001)

1(101)

1(011)

1(111)

Cacheaddr.

tag

1010

Assume each forest has fix line size

Bits are use find path (k)

asp dac paper10
ASP-DAC paper

Link list set associative

Assoc. = 1

Assoc. = 2

Assoc. = 4

Hit

Miss

Hit

Hit

Most recent

element used

Least recently

used element

Table for Miss Count

L

N

A

# of Cache Miss

1 4 1 0

1 4 1 1

*Rest of address is use as tag

asp dac paper11
ASP-DAC paper

Link List LRU update

Assoc. = 1

Assoc. = 2

Assoc. = 4

Most recent

element used

Least recently

used element

Table for Miss Count

L

N

A

# of Cache Miss

1 4 1 0

1 4 1 1

slide12

Detail Trace Example

Example Specifications:

  • Cache Size (N) will vary from 32 bits max to 2 bits min
  • Associatively (A) will vary from 4 max to 1 min
  • Cache Set Size (M) will vary from 8 max to 1 min
  • Assume fix line size (L)
slide13

Detail Trace Example

Instruction Trace

k | m

1. 000000 => 0

2. 001000 => 8

3. 010000 => 16

4. 000000 => 0

5. 001000 => 8

6. 000000 => 0

7. 010000 => 16

Assoc. = 1

2

3

1

3

1

2

M=1

5

4

3

2

1

16

0

8

16

8

0

0

2

4

6

7

1

3

5

M=2

1

8

8

0

16

0

0

16

0

M=4

11

10

16

16

01

8

8

00

0

0

0

M=8

111

110

101

100

011

010

001

000

16

16

8

8

0

0

0

slide14

Detail Trace Example

Instruction Trace

k | m

1. 000000 => 0

2. 001000 => 8

3. 010000 => 16

4. 000000 => 0

5. 001000 => 8

6. 000000 => 0

7. 010000 => 16

Assoc. = 2

M=1

0

0

8

16

8

16

0

0

0

8

8

16

0

4

6

3

5

1

2

M=2

1

0

M=4

11

10

01

00

M=8

111

110

101

100

011

010

001

000

asp dac results
ASP-DAC Results
  • Using benchmarks from Mediabench
  • This method is on average 45 times faster to explore the design space.
  • compared to Dinero IV
  • Still having 100% accuracy.
introduction table based method
Introduction Table-based Method
  • Two cache evaluation techniques include analytical modeling and execution-based evaluation to evaluate the design space
  • SPCE present a simplified, yet efficient way to extract locality properties for an entire cache configuration design space in just one single-pass
  • Includes related work, overview of SPCE, properties for addressing behavior analysis to estimate the cache miss rate, experiment and the results
related work
Related Work
  • Much research exist in this area need multiple passes to explore all configurable parameters or employ large and complex data structures, which restricting their applicability
  • Algorithms for single-pass cache simulation exams concurrently a set of caches. Mattson; Hill and Smith; Sugumar and Abaham; Cascaval and Padua
  • Janapsatya et al. present a technique to evaluate all different cache parameters simultaneously, but not designed with a hardware implementation in mind
  • This paper’s methodology use simple array structures which are more amenable to a light-weight hardware implementation
definitions
Definitions
  • Time ordered sequence of referenced addresses -- T[t] (t is a positive integer),length |T|, such that T[t] is the t(th) address referenced
  • If T[ti] b = T[ti + d] b, then the addresses T[ti] and T[ti + d], are references to the same cache block of 2^b words
  • Define d as the delay or the number of unique cache references occurring between any two references where T[ti]

b = T[ti + d] b

definitions20
Definitions
  • Evaluate the locality in the sequence of addresses T[ti] of a running application ai by counting the occurrences where T[ti]

b = T[ti+d] b and registering it in the cell L(b, d) of the locality table.(2^b is block size , d is delay)

fully associative
Fully-Associative
  • A fully-associative cache configuration is defined by the notation cj (b, n), where b defines the line size in terms of words, and n the total number of lines in the cache
  • The locality table L(b, d) composes an efficient way to estimate the cache miss rate of fully-associative caches
slide22

Fully-Associative Example

Locality table for the trace T

d=3

T b

A sequence of addresses

d=2

set associative
Set-Associative
  • Most real-world cache devices are built as direct-map or set-associative structures
  • Since conflicts, L cannot be used to estimate misses , so define s as the number of sets independent of the associativity, for direct-mapped, set size=1, s=n
  • To analyze the cache conflicts, we build conflict table Kα (b is block size, s is set size), which in composed of α layers, one for each associativity explored
set associative25
Set-Associative
  • The value stored in each element of the table Kα(b, s) indicates how many times the same block (size 2^b) is repeatedly referenced and results in a hit.
  • A given cache configuration with level of associativity w is capable of overcoming no more than w − 1 mapping conflicts.
  • The number of cache hits is determined by summing up the cache hits from layer α = 1 up to its respective layer α = w, where w refers to the associativity.
experiment setup
Experiment Setup
  • Implement SPCE as a standalone C++ program to process an instruction address trace file, gathered instruction address traces for 9 arbitrarily chosen from Motorola’s Power Stone benchmark suite using Simple Scalar
  • Since 64 bytes is the largest block size in the design space utilized, bmax=3; smax is defined by configuration with the maximum number of sets in the design space
  • Exam performance for our suite of benchmarks with SPCE and also with a very popular trace-driven cache simulator (DineroIV)
results
Results
  • Compare performance of SPCE and DineroIV for the 45 cache configurations.
conclusion
Conclusion
  • Both Tree-based method and Table-based method (SPCE) facilitate in ease of cache miss rate estimation and also in reduction in simulation time.
  • Compared to DineroIV method, the average speedup is around 30 times.
  • Our future work includes extending the design space exploration by considering of a second level of cache.