Alternatives to eager hardware cache coherence on large scale cmps
Download
1 / 57

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs - PowerPoint PPT Presentation


  • 167 Views
  • Uploaded on

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs. University of Edinburgh http://www.homepages.inf.ed.ac.uk/mc. Marcelo Cintra. Acknowledgments. Research Team Dr. Tom Ashby (now at IMEC) Christian Fensch Pedro Jimenez Constantino Ribeiro Marius Kuhn Funding UK – EPSRC

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs' - omer


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Alternatives to eager hardware cache coherence on large scale cmps

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

University of Edinburgh

http://www.homepages.inf.ed.ac.uk/mc

Marcelo Cintra


Acknowledgments
Acknowledgments Large-Scale CMPs

  • Research Team

    • Dr. Tom Ashby (now at IMEC)

    • Christian Fensch

    • Pedro Jimenez

    • Constantino Ribeiro

    • Marius Kuhn

  • Funding

    • UK – EPSRC

    • EC – HiPEAC (NE)

    • EC – SARC (IP)

ORNL - March 2007


Why cache coherence
Why Cache Coherence? Large-Scale CMPs

  • When is it necessary?

    • Shared-memory parallel applications (including OS)

    • Thread migration under multiprogramming

    • Implementing synchronization (e.g., spin-locks and Decker’s algorithm)

  • When is it not so important?

    • Sequential applications with no migration

    • Applications with not much locality or reuse of data

    • Message-passing parallel applications

ORNL - March 2007


Hardware cache coherence
Hardware Cache Coherence? Large-Scale CMPs

  • What is good about it?

    • Portability and transparency: no need to compile applications for different hierarchies

    • High performance: high selectivity of invalidations, low traffic, low latency operations

  • What is not so good about it?

    • Extra hardware: for coherence state, for controllers, and for suitable interconnects

    • Overkill:

      • When coherence is not necessary (see previous slide)

      • When software can do a reasonable, or even better, job

ORNL - March 2007


Hardware cache coherence1
Hardware Cache Coherence? Large-Scale CMPs

  • Extra hardware:

    • Shared bus fabrics and crossbars will take up a large amount of chip real estate (easily >50% overhead for 32 processors or more by some accounts)

    • Protocols and controllers for distributed directories and multi-segment buses are complex and difficult to verify

  • Overkill:

    • Communication in applications does not occur at all times, but hardware assumes so

    • Hardware is always there even when not needed

ORNL - March 2007


Current cmp s
Current CMP’s Large-Scale CMPs

  • Example: IBM Power4

    • 2 8-issue cores with private 32KB L1

    • Crossbar to 3 L2 banks 1.5MB total

    • 174M transistors in 0.22micron on 412mm2

  • and Power5

    • 2 8-issue 2-way SMT cores with private 32KB L1

    • Crossbar to 3 L2 banks 1.9MB total

    • 276M transistors in 0.13micron on 389mm2

ORNL - March 2007


Current cmp s1
Current CMP’s Large-Scale CMPs

  • Example: Sun T1

    • 8 1-issue 4-way MT cores with private 8KB L1

    • Crossbar to 4 L2 banks 3MB total

    • ?M transistors in 0.09micron on ?mm2

  • and T2

    • 8 1-issue 8-way MT cores

    • Crossbar to 8 L2 banks 4MB total

    • ?M transistors in 0.065micron on ?mm2

ORNL - March 2007


Where are we heading

Is it time to re-think traditional hardware cache coherence? Large-Scale CMPs

Where are we heading?

  • Moore’s Law applied to number of cores? Not yet.

    • Multithreading delays the need for more cores, but unlikely to go beyond 8-way

    • Ever larger shared on-chip cache is easy solution (i.e., with directory); but how large can/should we go?

  • But what if we want it to be so?

    • Must find scalable and area-efficient solutions to cache coherence problem

    • Must find ways to make the most out of the off-chip data bandwidth

    • Must (ideally) make architecture that is good for various workload types

ORNL - March 2007


Our tiled architecture approach
Our Tiled Architecture Approach Large-Scale CMPs

  • Simple single- or dual-issue, possibly multithreaded, cores

  • Point-to-point interconnect

  • Cache coherence mostly handled by OS and compiler/programmer with minimal hardware support

  • Cache coherence can be turned off with little hardware waste for certain applications

  • Reconfigurable NUCA caches maximize the usefulness of the on-chip storage

ORNL - March 2007


Outline
Outline Large-Scale CMPs

  • Motivation

  • Page level placement and remote cache access

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Software self invalidations and writebacks

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Current and Future Directions

ORNL - March 2007


Baseline tiled architecture
Baseline Tiled Architecture Large-Scale CMPs

  • Single issue RISC processors

  • Point-to-point interconnects

  • Caches are only for local data storage and there is no hardware cache coherence

  • Shown to be good for ILP- and DLP-based applications (e.g., MIT’s Raw)

  • But, no support for shared-memory parallel applications

ORNL - March 2007


Key ideas
Key Ideas Large-Scale CMPs

  • There is only one copy of each data in all L1 caches (thus, no coherence problem)

    • Later we will relax this constraint a little bit

  • Data is allocated to L1 caches on a page level

    • OS controls placement

    • New hardware tables ensure that location of data can be easily determined

    • Placement can change at quiescent points (e.g., barriers)

  • Hardware supports remote cache accesses

Should work well if most data is placed locally and the distances/latency on chip are not too high

ORNL - March 2007


Mechanism overview
Mechanism Overview Large-Scale CMPs

  • Data placement

    • There is a spectrum of options

    • Dynamic per line is what previous proposals for NUCA-like and adaptive shared/private cache schemes do

    • Static per page is very restrictive and compiler/OS has to get the mapping absolutely right

    • Our proposal: dynamic OS controlled mapping per page

Static per

page

Dynamic per

line

Our proposal

-

Hardware Complexity

+

-

Mapping Flexibility

+

ORNL - March 2007


Data placement
Data Placement Large-Scale CMPs

  • New level of indirection allows OS to map any given page to any L1 cache

  • OS extensions

    • Page table is extended to keep owner tile ID alongside (per page) virtual-to-physical address translation (note that this exposes the chip structure to the OS)

    • Default mapping policy is first-touch

  • Hardware extensions

    • TLB-like table that caches the most recent translations and is used to locate the line upon loads and stores (MAP)

    • Instead of a directory coherence controller, a device to handle remote cache access requests (RAC)

ORNL - March 2007


Data placement1
Data Placement Large-Scale CMPs

  • Why not simply divide the physical address space across tiles and use the virtual-to-physical mapping to place pages?

    • More restrictive

    • Fragmentation in main memory

    • More difficult to change mapping

V addr.

V addr.

L1

P addr.

L1

P addr.

L1

L1

L1

L1

L1

L1

ORNL - March 2007


Data placement2
Data Placement Large-Scale CMPs

  • Difficulty:

    • L1 caches are often virtually-indexed to speed up local processor access

    • Remote cache requests with physical addresses would require some inverse translation at the remote tile

  • Solution:

    • Ship virtual addresses over the on-chip interconnect

    • TLB’s only keep virtual-to-physical translations for the data allocated locally

ORNL - March 2007


Load store mechanism
Load/Store Mechanism Large-Scale CMPs

  • Virtual address (v-addr) is used to index L1, TLB, and MAP

  • If MAP entry points to local L1 then proceed with cache and TLB lookups as usual

  • If MAP entry points to remote L1 then abort L1 lookup and ignore TLB result (including miss)

  • v-addr is sent over network to home tile’s RAC

  • v-addr is used to index both L1 and TLB

  • Remote cache and TLB miss/hits are handled as usual

ORNL - March 2007


Allowing migration of data
Allowing Migration of Data Large-Scale CMPs

  • To migrate pages:

    • Invalidate MAP entries

    • Write-back dirty cache lines

    • Invalidate cache contents

    • The following access will generate a (possibly) new mapping

  • Must be done at a quiescent state: e.g., barrier

  • Instructions for cache writeback and invalidate already exist in most IS’s, all we need is to invalidate the MAP as well

  • Note that it is not necessary to invalidate the TLB’s

ORNL - March 2007


Allowing migration of data1

pa→P0 Large-Scale CMPs

pa→P0

Allowing Migration of Data

P0

L1

TLB

MAP

Memory

1. P0 ld 0xA0 (pa→P0)

2. P1 ld 0xA1 (remote)

3. P0 st 0xA2

4. Barrier (write-back + inv. L1, inv. MAP)

P1

L1

TLB

MAP

P2

L1

TLB

MAP

ORNL - March 2007


Allowing migration of data2

pa→P1 Large-Scale CMPs

5. P1 ld 0xA1 (pa→P1)

pa→P1

6. P0 ld 0xA0 (remote)

Allowing Migration of Data

P0

L1

TLB

MAP

Memory

1. P0 ld 0xA0 (pa→P0)

2. P1 ld 0xA1 (remote)

3. P0 st 0xA2

4. Barrier (write-back + inv. L1, inv. MAP)

P1

L1

TLB

MAP

P2

L1

TLB

MAP

ORNL - March 2007


Allowing replication of read only data
Allowing Replication of Read-Only Data Large-Scale CMPs

  • Controlled replication: multiple readers but single, fixed, writer

  • Caveat: assume release consistent (RC) programs

  • Operation

    • Processors reading data in a page get a local mapping in MAP

    • First write traps to OS and tile becomes the home/owner

    • Subsequent reads with old local mappings can continue

    • Subsequent writes, and reads without mapping, are directed to home tile

    • At locks must also invalidate MAP entries

    • At barriers same procedure as for migration

  • Key features

    • OS does not have to keep track of sharers

    • Ownership of modified data is not transferred

ORNL - March 2007


Allowing replication of data

pa→P0 Large-Scale CMPs

pa→P1

Allowing Replication of Data

P0

L1

TLB

MAP

Memory

1. P0 ld 0xA0 (pa→P0)

2. P1 ld 0xA1 (pa→P1)

3. P0 st 0xA2 (owner)

4. Barrier (write-back + inv. L1, inv. MAP)

P1

L1

TLB

MAP

P2

L1

TLB

MAP

ORNL - March 2007


Allowing replication of data1

pa→P0 Large-Scale CMPs

pa→P2

Allowing Replication of Data

P0

L1

TLB

MAP

Memory

1. P0 ld 0xA0 (pa→P0)

2. P1 ld 0xA1 (pa→P1)

3. P0 st 0xA2 (owner)

4. Barrier (write-back + inv. L1, inv. MAP)

P1

L1

TLB

MAP

5. P0 ld 0xA2 (pa→P0)

6. P2 ld 0xA1 (pa→P2)

7. P2 st 0xA1 (owner)

8. P0 acq. lock (invalidate MAP)

P2

L1

TLB

MAP

ORNL - March 2007


Allowing replication of data2

pa→P2 Large-Scale CMPs

pa→P2

Allowing Replication of Data

P0

L1

TLB

MAP

Memory

1. P0 ld 0xA0 (pa→P0)

2. P1 ld 0xA1 (pa→P1)

3. P0 st 0xA2 (owner)

4. Barrier (write-back + inv. L1, inv. MAP)

P1

L1

TLB

MAP

5. P0 ld 0xA2 (pa→P0)

6. P2 ld 0xA1 (pa→P2)

7. P2 st 0xA1 (owner)

8. P0 acq. lock (invalidate MAP)

P2

L1

TLB

MAP

9. P0 ld 0xA1 (pa→P2)

ORNL - March 2007


Evaluation environment
Evaluation Environment Large-Scale CMPs

  • Simulator:

    • PowerPC IS

    • Pipelined single-issue cores and network modeled with Liberty

    • Caches and memory borrowed from SimpleScalar

    • All latencies modeled in detail; contention modeled in detail, except at internal nodes in the network

  • Compiler: gcc 3.4.4

  • Applications: Splash-2

  • Systems simulated:

    • Proposed scheme with and without migration+sharing (NUCA-Dist and NUCA-Dist+MS)

    • Idealized distributed directory coherent system (Dir-Coh)

ORNL - March 2007


Overall performance

3% Large-Scale CMPs

No gap

42%

NUCA-Dist+MS performs close (within 42%, and 19% on avg.) to idealized directory system

Overall Performance

ORNL - March 2007


Memory access breakdown

Off-chip accesses very rare for all systems Large-Scale CMPs

Memory Access Breakdown

Local miss

Local hit

Remote hit

Remote miss

ORNL - March 2007


Memory access breakdown1

Migration and replication significantly reduce the amount of remote cache accesses

Memory Access Breakdown

Local miss

Local miss

Local hit

Local hit

Remote hit

Remote hit

Remote miss

ORNL - March 2007


Memory access breakdown2
Memory Access Breakdown remote cache accesses

Local miss

Local hit

Remote hit

Remote miss

ORNL - March 2007


Memory access breakdown3

NUCA-Dist+MS has on avg. 90% of local accesses remote cache accesses

Memory Access Breakdown

Local miss

Local hit

Remote hit

Remote miss

ORNL - March 2007


Ovhd of writebacks and invalidations

Writebacks and invalidations don’t seem to be a major bottleneck

Ovhd. of Writebacks and Invalidations

  • Recall:

    • At barriers: write back dirty cache lines, invalidate cache and MAP

    • At lock acquires: write back dirty cache lines, invalidate MAP

ORNL - March 2007


Related work
Related Work bottleneck

  • Tiled architectures:

    • Raw (MIT); TRIPS (U. Texas); Cyclops (IBM); Wavescalar (U. Washington); Vector-Thread (MIT)

  • Shared/Private on-chip caches

    • NUCA (U. Texas); CMP-NUCA (U. Wisconsin); NuRAPID (Purdue); Victim Replication (MIT); Cooperative Caching (U. Wisconsin)

  • Alternatives to Snooping and Directory Coherence:

    • Token Coherence and Ring Coherence (U. Wisconsin)

  • Software DSM

ORNL - March 2007


Conclusions
Conclusions bottleneck

  • Alternative to full hardware distributed directory cache coherence that divides work between OS and hardware

  • Mostly acceptable performance degradation:

    • Within 50% for all applications and within 15% for 8 out of 12 applications w.r.t. an ideal hardware coherent system

    • Small number of remote cache accesses (10% on average and 38% maximum)

  • Likely much less complex than distributed directory coherence:

    • Only request-response transactions; no forwarding or multiple invalidations necessary

    • Less storage required compared to directory state

ORNL - March 2007


Outline1
Outline bottleneck

  • Motivation

  • Page level placement and remote cache access

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Software self invalidations and writebacks

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Current and Future Directions

ORNL - March 2007


Hardware vs software cache coherence
Hardware vs. Software Cache Coherence bottleneck

  • Hardware

    + Fast communication and possibly more accurate invalidations

    • Eager write propagation at the time of store

      + Unnecessary invalidations on false sharing

    • No exploitation of knowledge of memory consistency model, synchronization points, and communication patterns

      + Has been done commercially for many years and proven to work well

ORNL - March 2007


Hardware vs software cache coherence1
Hardware vs. Software Cache Coherence bottleneck

  • Software

    + Compiler/programmer may know exactly what needs to be communicated and when

    • Exact sharing information not usually easy to identify statically

    • Hard to deal with hardware artifacts such as false sharing, speculative execution, and prefetching

    • Never really left the niche of research prototypes and DSP’s

    • Only worked for loop-based data-parallel scientific applications

      + Less hardware required

ORNL - March 2007


Software coh for release consistency
Software Coh. for Release Consistency bottleneck

  • Our approach: look at the communication requirements of the memory consistency model (RC in our case)

    • Synchronization points explicitly defined

    • No communication between threads happens in-between synchronization points

    • Lock acquires:

      • Data modified by previous lock owner must be used (transitively)

    • Lock releases:

      • Data modified must be made visible to future lock owners (transitively)

    • Barriers:

      • Data modified by all threads before the barrier must be used by all threads after the barrier

ORNL - March 2007


Software coh for release consistency1
Software Coh. for Release Consistency bottleneck

  • To enforce cache coherence on RC

    • Lock acquires:

      • Write back dirty lines

      • Invalidate cache contents

    • Lock release:

      • Write back dirty lines

    • Barrier:

      • Write back dirty lines

      • Invalidate cache contents

    • Use per-word dirty bits to perform selective writeback in memory to handle false sharing

ORNL - March 2007


Evaluation environment1
Evaluation Environment bottleneck

  • Simulator:

    • Sparc IS

    • Non-pipelined single-issue cores, caches, and network modeled with Simics

    • All latencies modeled in detail; contention modeled in detail at bus

  • Compiler: Sun C compiler

  • Applications: Splash-2 (scientific&engineering) and ALPBench (media)

  • Systems simulated:

    • Proposed scheme with self-invalidations and writebacks (flush)

    • Hardware coherent system (MSI)

ORNL - March 2007


Overall performance1

For a surprising number of applications software coherence performs close (within 10%) to MSI

Overall Performance

ORNL - March 2007


Overall performance2
Overall Performance performs close (within 10%) to MSI

Somewhat worse results for larger (256KB vs. 64KB), but still fairly close to MSI

ORNL - March 2007


Overall performance3
Overall Performance performs close (within 10%) to MSI

Even better results with a shared L2 cache

ORNL - March 2007


Related work1
Related Work performs close (within 10%) to MSI

  • Software cache coherence:

    • Pfister et. al., ICPP’85; Cheong and Veidenbaum, ISCA’88; Cytron et. al., ICPP’88

    • Compiler-directed, only successful with regular data-parallel applications

  • Most recent/relevant work: shared-regions, Sandhu et. al., PPoPP’93

    • Also based on explicit synchronization

    • Required user help to identify critical sections and the associated data

  • Software DSM

ORNL - March 2007


Conclusions1
Conclusions performs close (within 10%) to MSI

  • Revisited software-only cache coherence mechanisms

  • Pure software-based cache coherence for release-consistent applications performs surprisingly well

    • Within 50% for all applications and within 10% for 9 out of 13 applications w.r.t. hardware coherent system

  • Some applications still suffer due to unnecessary cache invalidations and writebacks

  • Hardware needed to perform more selective cache invalidation

ORNL - March 2007


Outline2
Outline performs close (within 10%) to MSI

  • Motivation

  • Page level placement and remote cache access

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Software self invalidations and writebacks

    • Design

    • Evaluation

    • Related Work and Conclusions

  • Current and Future Directions

ORNL - March 2007


Current and future directions
Current and Future Directions performs close (within 10%) to MSI

  • Tiled architectures

    • Hybrid hardware-software cache coherence schemes

    • Mechanisms to support streaming applications

    • Reconfigurable on-chip cache hierarchies

  • Speculative parallelization

    • Tuning of common microarchitectural features

    • Automatic compilation and optimization with gcc

    • Software-only speculative parallelization

  • Performance prediction and design space exploration

    • With machine learning techniques

  • Database Systems

    • JIT code generation and optimization for relational queries

ORNL - March 2007


Alternatives to eager hardware cache coherence on large scale cmps1

Alternatives to Eager Hardware Cache Coherence on Large-Scale CMPs

University of Edinburgh

http://www.homepages.inf.ed.ac.uk/mc

Marcelo Cintra


Implementing locks
Implementing Locks Large-Scale CMPs

  • Strait-forward to implement Compare&Swap style primitives (there is only one copy of the lock variable)

  • Load-link/Store-conditional primitives are not as easy to implement without cache coherence

Coherence guarantees that all caches will be notified of any conflicting store

Absence of coherence makes it impossible to detect conflict

ORNL - March 2007


Implementing locks1
Implementing Locks Large-Scale CMPs

  • Solution:

    • Place RESERVE register in the tile that is home to the lock

    • RESERVE register is shared by all locks in pages assigned to the tile

    • Must not allow new LL requests to overwrite the RESERVE register (to avoid livelock)

      • Must also add a timeout mechanism to prevent deadlock when there is no matching SC

    • Must add an ID tag to the RESERVE register to identify the processor with the pending LL (to avoid simultaneous acquisition of a lock)

ORNL - March 2007


Evaluation environment2
Evaluation Environment Large-Scale CMPs

  • Configurations simulated:

ORNL - March 2007


Evaluation environment3
Evaluation Environment Large-Scale CMPs

  • Applications: Splash-2 benchmarks with reference input sets (except for radiosity)

ORNL - March 2007


Memory latencies
Memory Latencies Large-Scale CMPs

NUCA-Dist+MS performs very close (within x% of ideal coherent system

ORNL - March 2007


Evaluation environment4
Evaluation Environment Large-Scale CMPs

  • Configurations simulated: 2 to 16 processors

ORNL - March 2007


Evaluation environment5
Evaluation Environment Large-Scale CMPs

  • Applications: Splash-2 applications and ALPBench benchmarks, both with reference input sets (except MPGDEC/ENC)

ORNL - March 2007


Sources of largest errors top 10

Source of error Large-Scale CMPs

2

4

3

1

54~116

98~161

54~61

136

Number

Unknown inner loop iteration count

Incorrect IR workload estimation

Unknown iteration count(i<P)

Biased conditional

Error (%)

Sources of largest errors (top 10%)

ORNL - March 2007


Application characteristics
Application Characteristics Large-Scale CMPs

% of Seq.

Time

Spec data

Size (KB)

Application

Loops

< 1

TREE

accel_10

94

muldeo_200’

muldoe_200’

12,000

WUPWISE

41

MDG

interf_1000

86

< 1

mers_mod_square

(line 444)

4,000

LUCAS

20

AP3M

Shgravll_700

78

3,000

ORNL - March 2007


Breaking news
Breaking News!! Large-Scale CMPs

  • Intel has just announced an 80 core chip at ISSCC

    • Simple cores

    • Point-to-point interconnect

    • No cache coherence (?)

ORNL - March 2007


ad