Non uniform cache architectures for wire delay dominated caches
Sponsored Links
This presentation is the property of its rightful owner.
1 / 22

Non-Uniform Cache Architectures for Wire Delay Dominated Caches PowerPoint PPT Presentation


  • 86 Views
  • Uploaded on
  • Presentation posted in: General

Non-Uniform Cache Architectures for Wire Delay Dominated Caches. Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller. Plan. Motivation What is NUCA UCA and ML-UCA Static NUCA Dynamic NUCA Simulation Results. Motivation. Bigger L2 and L3 Caches are needed Programs are larger

Download Presentation

Non-Uniform Cache Architectures for Wire Delay Dominated Caches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Non-Uniform Cache Architecturesfor Wire Delay Dominated Caches

Abhishek Desai

Bhavesh Mehta

Devang Sachdev

Gilles Muller


Plan

  • Motivation

  • What is NUCA

  • UCA and ML-UCA

  • Static NUCA

  • Dynamic NUCA

  • Simulation Results


Motivation

  • Bigger L2 and L3 Caches are needed

    • Programs are larger

    • SMT requires large cache for spatial locality

    • BW demands have increased on the package

    • Smaller technologies permit more bits per mm2

  • Wire delays dominate in large caches

    • Bulk of the access time will involve routing to and from the banks, not the bank accesses themselves


What is NUCA?

Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor

Example:

The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.


UCA and ML-UCA

L2

41

L3

41

L2

10

ML-UCA

Avg. access time: 11/41 cycles

Banks: 8/32

Size: 16MB

Technology: 50nm

UCA

Avg. access time: 255 cycles

Banks: 1

Size: 16MB

Technology: 50nm


17

41

Static-NUCA-1

S-NUCA-1

Avg. access time: 34 cycles

Banks: 32

Size: 16MB

Technology: 50nm

Area: Wire overhead 20.9%


Sub-bank

Bank

Data Bus

Predecoder

Address Bus

Sense

amplifier

Tag Array

Wordline driver

and decoder

S-NUCA-1 cache design


9

32

Static-NUCA-2

S-NUCA-2

Avg. access time: 24 cycles

Banks: 32

Size: 16MB

Technology: 50nm

Area: Channel overhead 5.9%


Tag Array

Bank

Switch

Data bus

Predecoder

Wordline driver

and decoder

S-NUCA-2 cache design

Addressbus

Sense

amplifier


Data migration

4

47

Dynamic-NUCA

D-NUCA

Avg. access time: 18 cycles

Banks: 256

Size: 16MB

Technology: 50nm


Management of Data in DNUCA

  • Mapping:

    • How the data are mapped to the banks and in which banks a datum can reside?

  • Search:

    • How the set of possible locations are searched to find a line?

  • Movement:

    • Under what conditions the data should be migrated from one bank to another?


Simple Mapping (implemented)

memory controller

bank

one set

way 1

way 2

way 3

way 4

8 bank sets


Fair and Shared Mapping

memory controller

memory controller

Fair Mapping

Shared Mapping


Searching Cached Lines

  • Incremental search

  • Multicast search (Implemented)

  • Limited multicast

  • Partitioned multicast

    Smart Search:

  • ss-performance

  • ss-energy


Dynamic Movement of Lines

  • LRU line furthest and MRU line closest

  • One-bank promotion on a hit (implemented)

    Policy on miss:

  • Which line is evicted?

    • Line in the furthest (slowest) bank -- (implemented)

  • Where is the new line placed?

    • Closest (fastest) bank

    • Furthest (slowest) bank -- (implemented)

  • What happens to the victim line?

    • Zero copy policy (implemented)

    • One copy policy


Advantages of DNUCA over ML-UCA

  • DNUCA does not enforce inclusion thus preventing redundant copies of the same line

  • In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses


Configuration for simulation

  • Used Sim-Alpha and Cacti

  • Simple mapping

  • Multicast search

  • One-bank promotion on each hit

  • Replacement policy that chooses the block in the slowest bank as the victim of a miss


Hit Rate Distribution for D-NUCA


Simulation results – integer benchmarks


Simulation results – FP benchmarks


Summary

D-NUCA has the following plus points:

  • Low Access Latency

  • Technology scalability

  • Performance stability

  • Flattens the memory hierarchy


  • Login