Improving multiple cmp systems with token coherence
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Improving Multiple-CMP Systems with Token Coherence PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Improving Multiple-CMP Systems with Token Coherence. Mike Marty 1 , Jesse Bingham 2 , Mark Hill 1 , Alan Hu 2 , Milo Martin 3 , and David Wood 1 1 University of Wisconsin-Madison 2 University of British Columbia 3 University of Pennsylvania Thanks to Intel, NSERC, NSF, and Sun. Summary.

Download Presentation

Improving Multiple-CMP Systems with Token Coherence

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Improving multiple cmp systems with token coherence

Improving Multiple-CMP Systems with Token Coherence

Mike Marty1,Jesse Bingham2, Mark Hill1, Alan Hu2, Milo Martin3, and David Wood1

1University of Wisconsin-Madison

2University of British Columbia

3University of Pennsylvania

Thanks to Intel, NSERC, NSF, and Sun


Summary

Summary

  • Microprocessor  Chip Multiprocessor (CMP)

  • Symmetric Multiprocessor (SMP)  Multiple CMPs

  • Problem: Coherence with Multiple CMPs

  • Old Solution: Hierarchical Protocol Complex & Slow

  • New Solution: Apply Token Coherence

    • Developed for glueless multiprocessor [ISCA 2003]

    • Keep: Flat for Correctness

    • Exploit: Hierarchical for performance

  • Less Complex & Faster than Hierarchical Directory


Outline

Outline

  • Motivation and Background

    • Coherence in Multiple-CMP Systems

    • Example: DirectoryCMP

  • Token Coherence: Flat for Correctness

  • Token Coherence: Hierarchical for Performance

  • Evaluation


Coherence in multiple cmp systems

Coherence in Multiple-CMP Systems

P

P

P

P

I

I

D

I

D

D

I

D

interconnect

L2

L2

L2

L2

  • Chip Multiprocessors (CMPs) emerging

  • Larger systems will be built with Multiple CMPs

CMP 2

CMP 1

interconnect

CMP 3

CMP 4


Problem hierarchical coherence

Problem: Hierarchical Coherence

  • Intra-CMP protocol for coherence within CMP

  • Inter-CMP protocol for coherence between CMPs

  • Interactions between protocols increase complexity

    • explodes state space

CMP 2

CMP 1

interconnect

Inter-CMP Coherence

Intra-CMP Coherence

CMP 3

CMP 4


Improving multiple cmp systems with token coherence1

Improving Multiple CMP Systems with Token Coherence

  • Token Coherence allows Multiple-CMP systems to be...

    • Flat for correctness, but

    • Hierarchical for performance

Low Complexity

Fast

Correctness Substrate

CMP 2

CMP 1

interconnect

Performance

Protocol

CMP 3

CMP 4


Example directorycmp

Example: DirectoryCMP

2-level MOESI Directory

RACE CONDITIONS!

CMP 0

CMP 1

Store B

Store B

P0

P1

P2

P3

P4

P5

P6

P7

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L1 I&D

S

S

O

S

data/

ack

data/

ack

getx

WB

getx

inv

ack

inv

ack

inv

fwd

ack

data/

ack

Shared L2 / directory

Shared L2 / directory

S

getx

WB

fwd

B: [M I]

B: [S O]

getx

Memory/Directory

Memory/Directory


Outline1

Outline

  • Motivation and Background

  • Token Coherence: Flat for Correctness

    • Safety

    • Starvation Avoidance

  • Token Coherence: Hierarchical for Performance

  • Evaluation


Example token coherence isca 2003

Example: Token Coherence [ISCA 2003]

Load B

Load B

Store B

Store B

  • Each memory block initialized with T tokens

  • Tokens stored in memory, caches, & messages

  • At least one token to read a block

  • All tokens to write a block

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L2

L2

L2

L2

mem 0

interconnect

mem 3


Extending to multiple cmp system

Extending to Multiple-CMP System

CMP 0

CMP 1

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

L2

L2

L2

L2

interconnect

interconnect

Shared L2

Shared L2

mem 0

interconnect

mem 1


Extending to multiple cmp system1

Extending to Multiple-CMP System

CMP 0

CMP 1

  • Token counting remains flat

  • Tokens to caches

    • Handles shared caches and other complex hierarchies

Store B

Store B

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

interconnect

interconnect

Shared L2

Shared L2

mem 0

mem 1

interconnect


Starvation avoidance

Starvation Avoidance

GETX

GETX

GETX

CMP 0

CMP 1

  • Tokens move freely in the system

    • Transient requests can miss in-flight tokens

    • Incorrect speculation, filters, prediction, etc

Store B

Store B

Store B

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

interconnect

interconnect

Shared L2

Shared L2

mem 0

mem 1

interconnect


Starvation avoidance1

Starvation Avoidance

CMP 0

CMP 1

  • Solution: issue Persistent Request

    • Heavyweight request guaranteed to succeed

    • Methods: Centralized [2003] and Distributed (New)

Store B

Store B

Store B

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

interconnect

interconnect

Shared L2

Shared L2

mem 0

mem 1

interconnect


Old scheme central arbiter 2003

Old Scheme: Central Arbiter [2003]

CMP 0

CMP 1

  • Processors issue persistent requests

Store B

Store B

Store B

timeout

timeout

timeout

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

interconnect

interconnect

Shared L2

Shared L2

mem 0

mem 1

arbiter 0

interconnect

B: P0

arbiter 0

B: P2

B: P1


Old scheme central arbiter 20031

Old Scheme: Central Arbiter [2003]

CMP 0

CMP 1

  • Processors issue persistent requests

  • Arbiter orders and broadcasts activate

Store B

Store B

Store B

Store B

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

B: P0

B: P0

B: P0

B: P0

interconnect

interconnect

B: P0

Shared L2

Shared L2

B: P0

mem 0

mem 1

arbiter 0

interconnect

B: P0

arbiter 0

B: P2

B: P1


Old scheme central arbiter 20032

Old Scheme: Central Arbiter [2003]

CMP 0

CMP 1

  • Processor sends deactivate to arbiter

  • Arbiter broadcasts deactivate (and next activate)

  • Bottom Line: handoff is 3 message latencies

Store B

Store B

Store B

P0

P1

P2

P3

L1 I&D

L1 I&D

L1 I&D

L1 I&D

B: P0

B: P2

B: P0

B: P2

B: P0

B: P2

B: P2

B: P0

3

interconnect

interconnect

B: P0

B: P2

Shared L2

Shared L2

B: P2

B: P0

1

2

mem 0

mem 1

arbiter 0

interconnect

B: P0

arbiter 0

B: P2

B: P2

B: P1


Improved scheme distributed arbitration new

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

CMP 1

  • Processors broadcast persistent requests

Store B

Store B

Store B

P0

P1

P2

P3

P0: B

P0: B

P0: B

P0: B

P1: B

P1: B

P1: B

P1: B

L1 I&D

L1 I&D

L1 I&D

L1 I&D

P2: B

P2: B

P2: B

P2: B

interconnect

interconnect

P0: B

Shared L2

Shared L2

P0: B

P1: B

P1: B

P2: B

P2: B

mem 0

mem 1

interconnect

P0: B

P1: B

P2: B


Improved scheme distributed arbitration new1

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

CMP 1

  • Processors broadcast persistent requests

  • Fixed priority (processor number)

Store B

Store B

Store B

P0

P1

P2

P3

P0: B

P0: B

P0: B

P0: B

P0: B

P0: B

P0: B

P0: B

P1: B

P1: B

P1: B

P1: B

L1 I&D

L1 I&D

L1 I&D

L1 I&D

P2: B

P2: B

P2: B

P2: B

interconnect

interconnect

P0: B

P0: B

Shared L2

Shared L2

P0: B

P0: B

P1: B

P1: B

P2: B

P2: B

mem 0

mem 1

interconnect

P0: B

P0: B

P1: B

P2: B


Improved scheme distributed arbitration new2

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

CMP 1

  • Processors broadcast persistent requests

  • Fixed priority (processor number)

  • Processors broadcast deactivate

Store B

Store B

P0

P1

P2

P3

P0: B

P0: B

P0: B

P0: B

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

1

L1 I&D

L1 I&D

L1 I&D

L1 I&D

P2: B

P2: B

P2: B

P2: B

interconnect

interconnect

P0: B

Shared L2

Shared L2

P0: B

P1: B

P1: B

P1: B

P1: B

P2: B

P2: B

mem 0

mem 1

interconnect

P0: B

P1: B

P1: B

P2: B


Improved scheme distributed arbitration new3

Improved Scheme: Distributed Arbitration [NEW]

CMP 0

CMP 1

  • Bottom line: Handoff is a single message latency

    • Subtle point: P0 and P1 must wait until next “wave”

P0

P1

P2

P3

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

P1: B

1

L1 I&D

L1 I&D

L1 I&D

L1 I&D

P2: B

P2: B

P2: B

P2: B

interconnect

interconnect

Shared L2

Shared L2

P1: B

P1: B

P1: B

P1: B

P2: B

P2: B

mem 0

mem 1

interconnect

P1: B

P1: B

P2: B


Outline2

Outline

  • Motivation and Background

  • Token Coherence: Flat for Correctness

  • Token Coherence: Hierarchical for Performance

  • Evaluation


Hierarchical for performance tokencmp

Hierarchical for Performance: TokenCMP

  • Target System:

    • 2-8 CMPs

    • Private L1s, shared L2 per CMP

    • Any interconnect, but high-bandwidth

  • Performance Policy Goals:

    • Aggressively acquire tokens

    • Exploit on-chip locality and bandwidth

    • Respect cache hierarchy

    • Detecting and handling missed tokens


Hierarchical for performance tokencmp1

Hierarchical for Performance: TokenCMP

  • Approach:

    • On L1 miss, broadcast within own CMP

      • Local cache responds if possible

    • On L2 miss, broadcast to other CMPs

    • Appropriate L2 bank responds or broadcasts within its CMP

      • Optionally filter

    • Responses between CMPs carry extra tokensfor future locality

  • Handling missed tokens:

    • Timeout after average memory latency

    • Invoke persistent request (no retries)

  • Larger systems can use filters, multicast, soft-state directories


Outline3

Outline

  • Motivation and Background

  • Token Coherence: Flat for Correctness

  • Token Coherence: Hierarchical for Performance

  • Evaluation

    • Model checking

    • Performance w/ commercial workloads

    • Robustness


Tokencmp evaluation

TokenCMP Evaluation

  • Simple?

    • Model checking

  • Fast?

    • Full-system simulation w/ commercial workloads

  • Robust?

    • Micro-benchmarks to simulate high contention


Complexity evaluation with model checking

Complexity Evaluation with Model Checking

  • Methods:

    • TLA+ and TLC

    • DirectoryCMP omits all intra-CMP details

    • TokenCMP’s correctness substrate modeled

  • Result:

    • Complexity similar between TokenCMP and non-hierarchical DirectoryCMP

    • Correctness Substrate verified to be correct and deadlock-free

      • Small configuration, varied parameters

    • All possible performance protocols correct


Performance evaluation

Performance Evaluation

  • Target System:

    • 4 CMPs, 4 procs/cmp

    • 2GHz OoO SPARC, 8MB shared L2 per chip

    • Directly connected interconnect

  • Methods: Multifacet GEMS simulator

    • Simics augmented with timing models

    • Released soon: http://www.cs.wisc.edu/gems

    • ISCA 2005 Tutorial!

  • Benchmarks:

    • Performance: Apache, Spec, OLTP

    • Robustness: Locking uBenchmark


Full system simulation runtime

Full-system Simulation: Runtime

  • TokenCMP performs 9-50% faster than DirectoryCMP


Full system simulation runtime1

Full-system Simulation: Runtime

  • TokenCMP performs 9-50% faster than DirectoryCMP

DRAM Directory

Perfect L2


Full system simulation traffic

Full-system Simulation: Traffic

  • TokenCMP traffic is reasonable (or better)

    • DirectoryCMP control overhead greater than broadcast for small system


Performance robustness

Performance Robustness

Locking micro-benchmark

(correctness substrate only)

less contention

more contention


Performance robustness1

Performance Robustness

Locking micro-benchmark

(correctness substrate only)

less contention

more contention


Performance robustness2

Performance Robustness

Locking micro-benchmark

less contention

more contention


Summary1

Summary

  • Microprocessor  Chip Multiprocessor (CMP)

  • Symmetric Multiprocessor (SMP)  Multiple CMPs

  • Problem: Coherence with Multiple CMPs

  • Old Solution: Hierarchical Protocol Complex & Slow

  • New Solution: Apply Token Coherence

    • Developed for glueless multiprocessor [2003]

    • Keep: Flat for Correctness

    • Exploit: Hierarchical for performance

  • Less Complex & Faster than Hierarchical Directory


Full system simulation traffic1

Full-system Simulation: Traffic


Full system simulation intra cmp traffic

Full-system Simulation: Intra-CMP Traffic


  • Login