Clustered data cache designs for vliw processors
This presentation is the property of its rightful owner.
Sponsored Links
1 / 49

Clustered Data Cache Designs for VLIW Processors PowerPoint PPT Presentation


  • 73 Views
  • Uploaded on
  • Presentation posted in: General

Clustered Data Cache Designs for VLIW Processors. PhD Candidate : Enric Gibert Advisors : Antonio González, Jesús Sánchez. Motivation. Two major problems in processor design Wire delays Energy consumption. D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨

Download Presentation

Clustered Data Cache Designs for VLIW Processors

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Clustered data cache designs for vliw processors

Clustered Data Cache Designs for VLIW Processors

PhD Candidate: Enric Gibert

Advisors: Antonio González, Jesús Sánchez


Motivation

Motivation

  • Two major problems in processor design

    • Wire delays

    • Energy consumption

D. Matzke, "Will Physical Scalability Sabotage Performance Gains?¨¨

in IEEE Computer 30(9), pp. 37-39, 1997

Data from www.sandpile.org


Clustering

Clustering

L2 cache

L1 cache

Memory buses

FUs

FUs

FUs

FUs

Reg. File

Reg. File

Reg. File

Reg. File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses


Data cache

Data Cache

  • Latency

  • Energy

    • Leakage will soon dominate energy consumption

    • Cache memories will probably be the main source of leakage

(S. Hill, Hot Chips 13)

  • In this Thesis:

    • Latency Reduction Techniques

    • Energy Reduction Techniques


Contributions of this thesis

Contributions of this Thesis

  • Memory hierarchy for clustered VLIW processors

    • Latency Reduction Techniques

      • Distribution of the Data Cache among clusters

      • Cost-effective cache coherence solutions

      • Word-Interleaved distributed data cache

      • Flexible Compiler-Managed L0 Buffers

    • Energy Reduction Techniques

      • Heterogeneous Multi-module Data Cache

      • Unified processors

      • Clustered processors


Evaluation framework

Evaluation Framework

  • IMPACT C compiler

    • Compile + optimize + memory disambiguation

  • Mediabench benchmark suite

  • Microarchitectural VLIW simulator


Presentation outline

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Distributing the data cache

L1 cache

L1 cache

module

L1 cache

module

L1 cache

module

L1 cache

module

Memory buses

Distributing the Data Cache

L2 cache

FUs

FUs

FUs

FUs

Reg. File

Reg. File

Reg. File

Reg. File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses


Multivliw

MultiVLIW

L2 cache

cache block

MSI cache coherence protocol

L1 cache

module

L1 cache

module

L1 cache

module

L1 cache

module

FUs

FUs

FUs

FUs

Reg. File

Reg. File

Reg. File

Reg. File

CLUSTER1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses

(Sánchez and González, MICRO33)


Presentation outline1

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Memory coherence

new value of X

new value of X

new value of X

new value of X

Update X

Read X

Memory Coherence

NEXT MEMORY LEVEL

memory buses

Cache module

Cache module

Remote accesses

Misses

Replacements

Others

NON-DETERMINISTIC BUS LATENCY!!!

X

CLUSTER 2

CLUSTER 3

CLUSTER 1

CLUSTER 4


Coherence solutions overview

Coherence Solutions: Overview

  • Local scheduling solutions  applied to loops

    • Memory Dependent Chains (MDC)

    • Data Dependence Graph Transformations (DDGT)

      • Store replication

      • Load-store synchronization

  • Software-based solutions with little hardware support

  • Applicable to different configurations

    • Word-interleaved cache

    • Replicated distributed cache

    • Flexible Compiler-Managed L0 Buffers


Scheme 1 mem dependent chains

Scheme 1: Mem. Dependent Chains

  • Sets of memory dependent instructions

    • Memory disambiguation by the compiler

      • Conservative assumptions

    • Assign instructions in same set to same cluster

LD

LD

cache module

cache module

X

CLUSTER 2

CLUSTER 3

Register

deps

ADD

CLUSTER 1

CLUSTER 4

Memory

deps

store to X

store to X

ST

load from X


Scheme 2 ddg transformations i

local instance

remote instances

Scheme 2: DDG transformations (I)

  • 2 transformations applied together

  • Store replication  overcome MF and MO

    • Little support from the hardware

cache module

cache module

cache module

cache module

X

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

store to X

store to X

store to X

store to X

load from X


Scheme 2 ddg transformations ii

MA

SYNC

Scheme 2: DDG transformations (II)

  • Load-store synchronization  overcome MA dependences

cache module

cache module

LD

X

CLUSTER 2

CLUSTER 4

RF

add

CLUSTER 1

CLUSTER 3

load from X

ST

store to X

add


Results memory coherence

Results: Memory Coherence

  • Memory Dependent Chains (MDC)

    • Badsince restrictions on the assignment of instructions to clusters

    • Good when memory disambiguation is accurate

  • DDG Transformations (DDGT)

    • Good when there is pressure in the memory buses

      • Increases number of local accesses

    • Bad when there is pressure in the register buses

      • Big increase in inter-cluster communications

  • Solutions useful for different cache schemes


Presentation outline2

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Word interleaved cache

Word-Interleaved Cache

  • Simplify hardware

    • As compared to MultiVLIW

  • Avoid replication

  • Strides +1/-1 element are predominant

    • Page interleaved

    • Block interleaved

    • Word interleaved  best suited


Architecture

remote hit

TAG

W0

W1

W2

W3

W4

W5

W6

W7

local hit

cache block

subblock 1

W0

W4

W1

W5

W2

W6

W3

W7

local miss

remote miss

Architecture

L2 cache

TAG

TAG

TAG

TAG

cache module

cache module

cache module

cache module

Func. Units

Func. Units

Func. Units

Func. Units

Register File

Register File

Register File

Register File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses


Instruction scheduling i unrolling

25% of local accesses

Instruction Scheduling (I): Unrolling

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

a[6]

a[7]

cache module

cache module

cache module

cache module

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

for (i=0; i<MAX; i++) {

ld r3, @a[i]

}

for (i=0; i<MAX; i=i+4) {

ld r3, @a[i]

ld r3, @a[i+1]

ld r3, @a[i+2]

ld r3, @a[i+3]

}

ld r3, @a[i]

ld r3, @a[i+1]

100% of local accesses

ld r3, @a[i+2]

ld r3, @a[i+3]


Instruction scheduling ii

large latencies

small latencies

Cluster 1

C2

C3

C4

Cluster 1

C2

C3

C4

cycle 1

LD

cycle 1

LD

cycle 2

cycle 2

cycle 3

add

cycle 3

cycle 4

cycle 5

add

Instruction Scheduling (II)

  • Assign appropriate latency to memory instruction

    • Small latencies  ILP ↑, stall time ↑

    • Large latencies  ILP ↓, stall time ↓

    • Start with large latency (remote miss) + iteratively reassign appropriate latencies (local miss, remote hit, local hit)

LD

RF

add


Instruction scheduling iii

Instruction Scheduling (III)

  • Assign instructions to clusters

    • Non-memory instructions

      • Minimize inter-cluster communications

      • Maximize workload balance among clusters

    • Memory instructions  2 heuristics

      • Preferred cluster (PrefClus)

        • Average preferred cluster of memory dependent set

      • Minimize inter-cluster communications (MinComs)

        • Min. Comms. for 1st instruction of the memory dependent set


Memory accesses

Memory Accesses

  • Sources of remote accesses:

    • Indirect, chains restrictions, double precision, …


Attraction buffers

Attraction Buffers

  • Cost-effective mechanism  ↑ local accesses

cache module

cache module

cache module

cache module

a[0]

a[1]

a[2]

a[3]

a[4]

a[5]

a[6]

a[7]

a[0] a[4]

Attraction

Buffer

AB

AB

AB

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

i=0

local accesses

0%  50%

load a[i]

i=i+4

loop

  • Results

    • ~ 15% INCREASE in local accesses

    • ~30-35% REDUCTION in stall time

    • 5-7% REDUCTION in overall execution time


Performance

Performance


Presentation outline3

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Why l0 buffers

Why L0 Buffers

  • Still keep hardware simple, but…

  • ... Allow dynamic binding between addresses and clusters


L0 buffers

unpack logic

L0 buffer

L0 buffer

L0 Buffers

  • Small number of entries  flexibility

    • Adaptative to application + dynamic address-cluster binding

  • Controlled by software  load/store hints

    • Mark instructions to access the buffers: which and how

  • Flexible Compiler-Managed L0 Buffers

L1 cache

INT

FP

MEM

INT

FP

MEM

CLUSTER 3

CLUSTER 4

Register File

Register File

CLUSTER 1

CLUSTER 2


Mapping flexibility

unpack logic

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

linear

mapping

interleaved

mapping

(1 cycle penalty)

4 bytes

4 bytes

4 bytes

1

1

4 bytes

2

2

3

3

4

4

a[0]

a[0]

a[0]

a[0]

a[0]

a[1]

a[1]

a[1]

a[1]

a[1]

a[0]

a[4]

All loads with a

4-element stride

a[1]

a[5]

a[3]

a[7]

a[2]

a[6]

load a[3]

load a[0]

load a[1]

load a[2]

load a[0] with stride 1 element

Mapping Flexibility

a[1]

a[3]

a[0]

a[2]

a[4]

a[5]

a[6]

a[7]

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

L1 block (16 bytes)

L1 cache

L0 Buffer

L0 Buffer

L0 Buffer

L0 Buffer

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4


Hints and l0 l1 interface

Hints and L0-L1 Interface

  • Memory hints

    • Access or bypass the L0 Buffers

    • Data mapping: linear/interleaved

    • Prefetch hints  next/previous blocks

  • L0 are write-through with respect to L1

    • Simplifies replacements

    • Makes hardware simple

      • No arbitration

      • No logic to pack data back correctly

    • Simplifies coherence among L0 Buffers


Instruction scheduling

Instruction Scheduling

  • Selective loop unrolling

    • No unroll vs. unroll by N

  • Assign latencies to memory instructions

    • Critical instructions (slack) use L0 Buffers

    • Do not overflow L0 Buffers

      • Use counter of L0 Buffer free entries / cluster

      • Do not schedule critical instruction into cluster with counter == 0

    • Memory coherence

  • Cluster assignment + schedule instructions

    • Minimize global communications

    • Maximize workload balance

    • Critical  Priority to clusters where L0 Buffer can be used

  • Explicit prefetching


Number of entries

Number of Entries


Performance1

Performance


Global comparative

Global Comparative


Presentation outline4

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Motivation1

Motivation

  • Energy consumption  1st class design goal

  • Heterogeneity

    • ↓ supply voltage and/or ↑ threshold voltage

  • Cache memory  ARM10

    • D-cache  24% dynamic energy

    • I-cache  22% dynamic energy

  • Exploit heterogeneity in the L1 D-cache?

processor front-end

processor front-end

structure tuned for performance

processor back-end

processor back-end

structure tuned for energy


Multi module data cache

Variable-Based Multi-Module

STACK

SP2

L2D-CACHE

SLOW SPACE

distributed

stack frames

HEAP DATA

FAST

MODULE

SLOW

MODULE

GLOBAL DATA

STACK

load/store

queues

SP1

ROB

FAST SPACE

L1 D-CACHE

HEAP DATA

@

GLOBAL DATA

  • It is possible to exploit heterogeneity!

Multi-Module Data Cache

Instruction-Based Multi-Module

(Abella and González, ICCD 2003)

L2 D-CACHE

FAST CACHE

MODULE

SLOW CACHE

MODULE

CRITICALITY

TABLE

PROCESSOR

inst PC


Cache configurations

FAST+NONE

FAST+FAST

FAST

FAST

FAST

L2 D-CACHE

FU+RF

FU+RF

FU+RF

FU+RF

latency x2

energy by 1/3

CLUSTER 1

CLUSTER 2

CLUSTER 1

CLUSTER 2

FIRST

MODULE

SECOND

MODULE

FAST+SLOW

FAST

SLOW

FU

FU

FU+RF

FU+RF

CLUSTER 1

CLUSTER 2

RF

RF

CLUSTER 1

CLUSTER 2

SLOW+NONE

SLOW+SLOW

FAST

SLOW

Register buses

8KB

8KB

SLOW

SLOW

SLOW

FU+RF

FU+RF

FU+RF

FU+RF

CLUSTER 1

CLUSTER 2

CLUSTER 1

CLUSTER 2

L=2

1 R/W

L=4

1 R/W

Cache Configurations


Instr to variable graph ivg

CACHE

CACHE

FU+RF

FU+RF

CLUSTER 1

CLUSTER 2

ST1

LD1

ST2

LD4

LD5

LD3

LD2

Instr.-to-Variable Graph (IVG)

  • Built with profiling information

  • Variables = global, local, heap

LD1

LD2

ST1

LD3

ST2

LD4

LD5

VAR V1

VAR V2

VAR V3

VAR V4

FIRST

SECOND


Greedy mapping algorithm

Greedy Mapping Algorithm

  • Initial mapping  all to first @ space

  • Assign affinities to instructions

    • Express a preferred cluster for memory instructions: [0,1]

    • Propagate affinities to other instructions

  • Schedule code + refine mapping

Compute IVG

Compute affinities + propagate affinities

Compute mapping

Schedule code


Computing and propagating affinity

slack 0

slack 0

slack 2

slack 2

slack 0

slack 0

slack 2

slack 2

slack 2

slack 0

slack 0

slack 5

FIRST

MODULE

SECOND

MODULE

FU

FU

RF

RF

CLUSTER 1

CLUSTER 2

Register buses

Computing and Propagating Affinity

add1

add2

add3

add4

L=1

L=1

L=1

L=1

LD1

LD2

LD3

LD4

L=1

L=1

L=1

L=1

mul1

add5

L=3

L=1

AFFINITY=0

AFFINITY=1

AFF.=0.4

LD1

LD2

ST1

LD3

LD4

add6

add7

L=1

L=1

V1

V2

V3

V4

ST1

L=1

FIRST

SECOND


Cluster assignment

Affinity=0

Affinity=0.4

IA

IC

Affinity=0.9

Affinity range (0.3, 0.7)

IB

CACHE

CACHE

100

60

40

≤ 0.3

≥ 0.7

FU+RF

FU+RF

V1

V2

V3

?

CLUSTER 1

CLUSTER 2

Cluster Assignment

  • Cluster affinity + affinity range  used to:

    • Define a preferred cluster

    • Guide the instruction-to-cluster assignment process

  • Strongly preferred cluster

    • Schedule instruction in that cluster

  • Weakly preferred cluster

    • Schedule instruction where global comms. are minimized

IA

IC

IC

IB


Edd results

EDD Results


Other results

Other Results

  • ED

    • The SLOW schemes are better

  • In all cases, these schemes are better than unified cache

    • 29-31% better in EDD, 19-29% better in ED

  • No configuration is best for all cases


Reconfigurable cache results

Reconfigurable Cache Results

  • The OS can set each module in one state:

    • FAST mode / SLOW mode / Turned-off

  • The OS reconfigures the cache on a context switch

    • Depending on the applications scheduled in and scheduled out

  • Two different VDD and VTH for the cache

    • Reconfiguration overhead: 1-2 cycles [Flautner et al. 2002]

  • Simple heuristic to show potential

    • For each application, choose the estimated best cache configuration


Presentation outline5

Presentation Outline

  • Latency reduction techniques

    • Software memory coherence in distributed caches

    • Word-interleaved distributed cache

    • Flexible Compiler-Managed L0 Buffers

  • Energy reduction techniques

    • Multi-Module cache for clustered VLIW processor

  • Conclusions


Conclusions

Conclusions

  • Cache partitioning is a good latency reduction technique

  • Cache heterogeneity can be used to exploit energy efficiency

  • The best energy and performance efficient scheme is a distributed data cache

    • Dynamic vs. Static mapping between addresses and clusters

      • Dynamic for performance (L0 Buffers)

      • Static for energy consumption (Variable-Based mapping)

    • Hardware vs. Software-based memory coherence solutions

      • Software solutions are viable


List of publications

List of Publications

  • Distributed Data Cache Memories

    • ICS, 2002

    • MICRO-35, 2002

    • CGO-1, 2003

    • MICRO-36, 2003

    • IEEE Transactions on Computers, October 2005

    • Concurrency & Computation: practice and experience

      • (to appear late ’05 / ’06)

  • Heterogeneous Data Cache Memories

    • Technical report UPC-DAC-RR-ARCO-2004-4, 2004

    • PACT, 2005


Questions

Questions…


  • Login