Cache coherence for cmps
Sponsored Links
This presentation is the property of its rightful owner.
1 / 12

Cache coherence for CMPs PowerPoint PPT Presentation


  • 115 Views
  • Uploaded on
  • Presentation posted in: General

Cache coherence for CMPs. Miodrag Bolic. Private cache. Each cache bank is private to a particular core Cache coherence is maintained at the L2 cache level Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]. Private cache. Advantages. Disadvantages. Data blocks can get duplicated

Download Presentation

Cache coherence for CMPs

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Cache coherence for CMPs

Miodrag Bolic


Private cache

  • Each cache bank is private to a particular core

  • Cache coherence is maintained at the L2 cache level

  • Intel Montecito [81], AMD Opteron [56], or IBM POWER6 [63]


Private cache

Advantages

Disadvantages

Data blocks can get duplicated

if the working set accessed by the different cores is not well-balanced, some caches can be over-utilized whilst others can be under-utilized

  • Short L2 cache access latency

  • Small amount of network traffic generated: Since the local L2 cache bank can filter most of the memory requests, the number of coherence messages injected into the interconnection network is limited.


Shared cache

  • Cache coherence is maintained at the L1 level

  • Bits usually chosen for the mapping to a particular bank are the less significant ones

  • Piranha [16], Hydra [47], Sun UltraSPARC T2 [105] and Intel Merom [104]


Shared caches

Advantage

Disadvantages

Many requests will be will be serviced by remote banks (L2 NUCA architecture)

  • Single copy of blocks

  • Workload balancing: Since the utilization of each cache bank does not depend on the working set accessed by each core, but they are uniformly distributed among cache banks in a round-robin fashion, the aggregate cache capacity is augmented.


Hammer protocol

  • AMD - Opteron systems

  • It relies on broadcasting requests to all tiles to solve cache misses

  • It targets systems that use unordered point-to-point interconnection networks

  • On every cache miss, Hammer sends a request to the home tile. If the memory block is present on-chip, the request is forwarded to the rest of tiles to obtain the requested block

  • All tiles answer to the forwarded request by sending either an acknowledgement or the data message to the requesting core.

  • The requesting core needs

  • to wait until it receives the response from each other tile. When the requester receives all the responses, it sends an unblock message to the home tile.


Hammer protocol

Disadvantages

  • Requires three hops in the critical path before the requested data block is obtained.

  • Broadcasting invalidation messages increases considerably the traffic injected into the interconnection network and, therefore, its power consumption.


Directory protocol

  • In order to accelerate cache misses, this directory information is not stored in main memory. Instead, it is usually stored on-chip at the home tile of each block.

  • In tiled CMPs, the directory structure is split into banks which are distributed across the tiles.

  • Each directory bank tracks a particular range of memory blocks.


Directory protocol

  • The indirection problem

    • every cache miss must reach the home tile before any coherence action can be performed.

    • adds unnecessary hops into the critical path of the cache misses

  • The directory memory overhead to keep the track of sharers for each memory block could be intolerable for large-scale configurations.

    • Example: block size 16 bytes, 64 tiles


Comparison of protocols


Interleaving


Mapping between cache entries and directory entries

  • One way to keep constant the size of the directory entries is storing duplicate tags.


  • Login