slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) PowerPoint Presentation
Download Presentation
Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain)

Loading in 2 Seconds...

play fullscreen
1 / 33

Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain) - PowerPoint PPT Presentation


  • 148 Views
  • Uploaded on

MOSAIC : . The Case for a Scalable Coherence Protocol for Complex On-Chip Cache Hierarchies in Many-Core Systems. Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain). Outline. Motivation Directory Schemas In-cache Sparse MOSAIC Coherence Protocol

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Lucía G. Menezo Valentín Puente José Ángel Gregorio University of Cantabria (Spain)' - lorna


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

MOSAIC :

The Case for a Scalable Coherence Protocol for Complex On-Chip Cache Hierarchies in Many-Core Systems

Lucía G. Menezo

Valentín Puente

José Ángel Gregorio

University of Cantabria (Spain)

outline
Outline
  • Motivation
  • Directory Schemas
    • In-cache
    • Sparse
  • MOSAIC Coherence Protocol
    • Examples
  • Evaluation Results
  • Conclusions
motivation
Motivation
  • Performance improvement: more processors per chip
  • Major challenges: off-chip bandwidth wall
  • Introduce cache into the chip
  • Complex on-chip cache hierarchies
  • Coherence protocol: fundamental role to play
motivation1
Motivation
  • What coherence protocol to use with large number of cores:
    • Broadcast-based protocols  high energy requirements
    • Directory-based protocols  more storage necessities for sharing information
  • MOSAIC: new coherence protocol
    • Directory without inclusiveness
    • Token Coherence to guarantee correctness
outline1
Outline
  • Motivation
  • Directory Schemas
    • In-cache
    • Sparse
  • MOSAIC Coherence Protocol
    • Examples
  • Evaluation Results
  • Conclusions
directory schemas in cache
Directory schemas: In-cache
  • Each block in LLC includes tag, data and the sharers information
  • LLC receives requests  needs precise knowledge
  • Inclusiveness is necessary: any block in the private levels needs to be allocated in LLC
  • Advantage: coherence protocol less complex
  • Disadvantage: all LLC blocks has storage overhead
directory schemas in cache1
Directory schemas: In-cache

LLC + in-cache directory

P

P

P

P

P

P

P

P

P

P

P

P

Interconnection network

Processors and private caches

Overhead!!!

directory schemas in cache2
Directory schemas: In-cache

LLC + in-cache directory

P

P

P

P

P

P

P

P

Interconnection network

Processors and private caches

Overhead!!!

Overhead!!!

directory schemas sparse
Directory schemas: Sparse
  • Directory entries separated from data
  • Allocated under demand
  • Overhead proportional to the aggregate private levels size (not LLC)
  • Capacity and associativity has to be sufficient to keep private-level cache tags
directory schemas sparse1
Directory schemas: Sparse

LLC

Sparse dir

P

P

P

P

P

P

P

P

Interconnection network

Processors and private caches

directory schemas sparse2
Directory schemas: Sparse
  • Duplicate-tag directory: holding all the tags of private levels
  • Example: 16 cores with 4-way 32KB L1 64-way

Associativity = # cores * private caches associativity

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

# sets = # private caches sets

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

directory schemas sparse3
Directory schemas: Sparse

Decrease Associativity: now << # cores * private caches associativity

  • One tag may be in various private caches
  • More than 1 tag per entry  conflicts
  • Inclusiveness needed  invalidate private data (recalls messages)

sharers

sharers

tag

sharers

tag

tag

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

sharers

sharers

tag

sharers

tag

tag

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

sharers

sharers

tag

tag

sharers

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

tag

sharers

sharers

tag

sharers

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

tag

tag

sharers

sharers

tag

sharers

tag

tag

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

Increase

number of sets

sharers

sharers

tag

sharers

tag

tag

tag

tag

tag

tag

sharers

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

tag

outline2
Outline
  • Motivation
  • Directory Schemas
    • In-cache
    • Sparse
  • MOSAIC Coherence Protocol
    • Examples
  • Evaluation Results
  • Conclusions
mosaic protocol
MOSAIC Protocol
  • In-cache or sparse  it doesn’t matter
  • No inclusiveness
  • No invalidations of data in private caches
  • Reconstruction of sharing information under demand
  • Uses token counting to avoid extra traffic and guarantee correctness
  • Token Coherence protocol:
    • Initially each block := # tokens (==#procs)
    • Read request: data and 1 token
    • Write request: data and all tokens
mosaic conceptual approach
MOSAIC Conceptual Approach

3

4

P0

P1

P2

Private Caches

I

0

N/A

O

2

DATA

S

1

DATA

5

1

1

On-chip network

3

2

Last Level Cache

Data_slice

Dir_slice

Memory

Controller

I

V

Sharers

2

I

0

N/A

State

Num. Tokens

Data

mosaic key facts
MOSAIC Key Facts
  • When data not present in LLC  broadcast for reconstruction
  • Private caches inform of num. of held tokens
  • Token counting avoids negative acknowledgements or timeouts
  • Reconstruction message piggybacks type of request and requestor
  • Key: directory may replace silently no invalidations
mosaic read request
MOSAIC Read Request

P0

P1

P2

P3

Dir

LLC

3 tokens

1 token

Read

Reconstruction

Invalid

State IS

Data + token

Info 1 token

State S

Sharers [P2]

Owner: ¿?

Info 2 tokens

Owner

  • State O

Unblock (info 1 token)

Sharers [P2, P1]

Owner: P1

  • State C
  • State A

Sharers [P2, P1, P0]

Owner: P1

Read

Forward GETS to Owner

Data + token

Unblock

Sharers [P2, P1, P0, P3]

Owner: P1

mosaic write request
MOSAIC Write Request

P0

P1

P2

P3

Dir

LLC

3 tokens

1 token

Write

Reconstruction

Invalid

State IS

Data + 3 tokens

1 token

State S

  • State O
  • State C

Unblock (info all tokens)

  • State A

Sharers [P0]

Owner: P0

State IM

State M

Directory Eviction

outline3
Outline
  • Motivation
  • Directory Schemas
    • In-cache
    • Sparse
  • MOSAIC Coherence Protocol
    • Examples
  • Evaluation Results
  • Conclusions
evaluation methodology
Evaluation methodology

Core 0

Core 1

Core 2

Core 3

Core 0

Core 1

Core 2

Core 3

Slice 0

Slice 1

Slice 2

Slice 3

Slice 0

Slice 1

Slice 2

Slice 3

Slice 4

Slice 5

Slice 6

Slice 7

Slice 8

Slice 9

Core 4

Core 15

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

R

Slice 4

Slice 5

Slice 6

Slice 7

Slice 10

Slice 11

Slice 12

Slice 13

Slice 14

Slice 15

Core 5

Core 14

Slice 8

Slice 9

Slice 10

Slice 11

Slice 16

Slice 17

Slice 18

Slice 19

Slice 20

Slice 21

Core 6

Core 13

Slice 22

Slice 23

Slice 24

Slice 25

Slice 26

Slice 27

Slice 12

Slice 13

Slice 14

Slice 15

Core 12

Core 7

Slice 28

Slice 29

Slice 30

Slice 31

Core 4

Core 5

Core 6

Core 7

Core 9

Core 8

Core 11

Core 10

simulation stack and workloads
Simulation stack and Workloads
  • GEMS: full-system evaluation
    • SLICC: Specification Language for Implementing Cache Coherence
mosaic performance reducing associativity
MOSAIC PerformanceReducing associativity

Normalized execution time

128KB  16K entries (8 bytes per entry)

number of misses
Number of misses

x2

Normalized num. misses

mosaic performance reducing associativity and capacity
MOSAIC Performance Reducing associativity and capacity

Normalized execution time

128KB  16K entries (8 bytes per entry)

16KB  2K entries

mosaic latency
MOSAIC Latency

16KB  2K entries

mosaic link utilization
MOSAIC Link Utilization

Average network link utilization

mosaic scalability
MOSAIC Scalability
  • 16 cores configuration

Normalized link utilization

conclusions
Conclusions
  • Low complexity and great scalability
  • Very low storage overhead
  • No noticeable energy cost
  • Alternative for future many-core cache coherent CMPs
  • Bandwidth scalability of a directory Elegancy of Token Coherence
  • MOSAIC Coherence Protocol
realistic cache configuration
Realistic Cache Configuration

L1: 4-way 32KB / L2: 8-way 256KB

x2 full dir

1/10 full dir

Normalized execution time

- Same experiment with BASE: 20% impact in some cases

mosaic energy
MOSAIC Energy

Normalized Dynamic Energy