Cache coherence for cmps
This presentation is the property of its rightful owner.
Sponsored Links
1 / 12

Cache coherence for CMPs PowerPoint PPT Presentation


  • 105 Views
  • Uploaded on
  • Presentation posted in: General

Cache coherence for CMPs. Miodrag Bolic. Private cache. Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]. Private cache. Advantages. Disadvantages. Data blocks can get duplicated

Download Presentation

Cache coherence for CMPs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cache coherence for cmps

Cache coherence for CMPs

Miodrag Bolic


Private cache

Private cache

  • Each cache bank is private to a particular core

  • Cache coherence is maintained at the L2 cache level

  • Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]


Private cache1

Private cache

Advantages

Disadvantages

Data blocks can get duplicated

if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized

  • Short L2 cache access latency

  • Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited.


Shared cache

Shared cache

  • Cache coherence is maintained at the L1 level

  • Bits usually chosen for the mapping to a particular bank are the less significant ones

  • Piranha [16], Hydra [47], Sun UltraSPARC T2 [105] and Intel Merom [104]


Shared caches

Shared caches

Advantage

Disadvantages

Many requests will be will be serviced by remote banks (L2 NUCA architecture)

  • Single copy of blocks

  • Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented.


Hammer protocol

Hammer protocol

  • AMD - Opteron systems

  • It relies on broadcasting requests to all tiles to solve cache misses

  • It targets systems that use unordered point-to-point interconnection networks

  • On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block

  • All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core.

  • The requesting core needs

  • to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.


Hammer protocol1

Hammer protocol

Disadvantages

  • Requires three hops in the critical path before the requested data block is obtained.

  • Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.


Directory protocol

Directory protocol

  • In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block.

  • In tiled CMPs, the directory structure is split into banks which are distributed across the tiles.

  • Each directory bank tracks a particular range of memory blocks.


Directory protocol1

Directory protocol

  • The indirection problem

    • every cache miss must reach the home tile before any coherence action can be performed.

    • adds unnecessary hops into the critical path of the cache misses

  • The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations.

    • Example: block size 16 bytes, 64 tiles


Comparison of protocols

Comparison of protocols


Interleaving

Interleaving


Mapping between cache entries and directory entries

Mapping between cache entries and directory entries

  • One way to keep constant the size of the directory entries is storing duplicate tags.


  • Login