Non uniform cache architectures for wire delay dominated caches
This presentation is the property of its rightful owner.
Sponsored Links
1 / 22

Non-Uniform Cache Architectures for Wire Delay Dominated Caches PowerPoint PPT Presentation


  • 81 Views
  • Uploaded on
  • Presentation posted in: General

Non-Uniform Cache Architectures for Wire Delay Dominated Caches. Abhishek Desai Bhavesh Mehta Devang Sachdev Gilles Muller. Plan. Motivation What is NUCA UCA and ML-UCA Static NUCA Dynamic NUCA Simulation Results. Motivation. Bigger L2 and L3 Caches are needed Programs are larger

Download Presentation

Non-Uniform Cache Architectures for Wire Delay Dominated Caches

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Non uniform cache architectures for wire delay dominated caches

Non-Uniform Cache Architecturesfor Wire Delay Dominated Caches

Abhishek Desai

Bhavesh Mehta

Devang Sachdev

Gilles Muller


Non uniform cache architectures for wire delay dominated caches

Plan

  • Motivation

  • What is NUCA

  • UCA and ML-UCA

  • Static NUCA

  • Dynamic NUCA

  • Simulation Results


Motivation

Motivation

  • Bigger L2 and L3 Caches are needed

    • Programs are larger

    • SMT requires large cache for spatial locality

    • BW demands have increased on the package

    • Smaller technologies permit more bits per mm2

  • Wire delays dominate in large caches

    • Bulk of the access time will involve routing to and from the banks, not the bank accesses themselves


What is nuca

What is NUCA?

Data residing closer to the processor is accessed much faster than data that reside physically farther from the processor

Example:

The closest bank in a 16MB on-chip L2 cache built in 50nm process technology could be accessed in 4 cycles, while an access to the farthest bank might take 47 cycles.


Uca and ml uca

UCA and ML-UCA

L2

41

L3

41

L2

10

ML-UCA

Avg. access time: 11/41 cycles

Banks: 8/32

Size: 16MB

Technology: 50nm

UCA

Avg. access time: 255 cycles

Banks: 1

Size: 16MB

Technology: 50nm


Static nuca 1

17

41

Static-NUCA-1

S-NUCA-1

Avg. access time: 34 cycles

Banks: 32

Size: 16MB

Technology: 50nm

Area: Wire overhead 20.9%


S nuca 1 cache design

Sub-bank

Bank

Data Bus

Predecoder

Address Bus

Sense

amplifier

Tag Array

Wordline driver

and decoder

S-NUCA-1 cache design


Static nuca 2

9

32

Static-NUCA-2

S-NUCA-2

Avg. access time: 24 cycles

Banks: 32

Size: 16MB

Technology: 50nm

Area: Channel overhead 5.9%


S nuca 2 cache design

Tag Array

Bank

Switch

Data bus

Predecoder

Wordline driver

and decoder

S-NUCA-2 cache design

Addressbus

Sense

amplifier


Dynamic nuca

Data migration

4

47

Dynamic-NUCA

D-NUCA

Avg. access time: 18 cycles

Banks: 256

Size: 16MB

Technology: 50nm


Management of data in dnuca

Management of Data in DNUCA

  • Mapping:

    • How the data are mapped to the banks and in which banks a datum can reside?

  • Search:

    • How the set of possible locations are searched to find a line?

  • Movement:

    • Under what conditions the data should be migrated from one bank to another?


Simple mapping implemented

Simple Mapping (implemented)

memory controller

bank

one set

way 1

way 2

way 3

way 4

8 bank sets


Fair and shared mapping

Fair and Shared Mapping

memory controller

memory controller

Fair Mapping

Shared Mapping


Searching cached lines

Searching Cached Lines

  • Incremental search

  • Multicast search (Implemented)

  • Limited multicast

  • Partitioned multicast

    Smart Search:

  • ss-performance

  • ss-energy


Dynamic movement of lines

Dynamic Movement of Lines

  • LRU line furthest and MRU line closest

  • One-bank promotion on a hit (implemented)

    Policy on miss:

  • Which line is evicted?

    • Line in the furthest (slowest) bank -- (implemented)

  • Where is the new line placed?

    • Closest (fastest) bank

    • Furthest (slowest) bank -- (implemented)

  • What happens to the victim line?

    • Zero copy policy (implemented)

    • One copy policy


Advantages of dnuca over ml uca

Advantages of DNUCA over ML-UCA

  • DNUCA does not enforce inclusion thus preventing redundant copies of the same line

  • In ML-UCA the faster level may not match the working set size of an application, either being too large and thus slow, or being too small and thus incurring misses


Configuration for simulation

Configuration for simulation

  • Used Sim-Alpha and Cacti

  • Simple mapping

  • Multicast search

  • One-bank promotion on each hit

  • Replacement policy that chooses the block in the slowest bank as the victim of a miss


Hit rate distribution for d nuca

Hit Rate Distribution for D-NUCA


Simulation results integer benchmarks

Simulation results – integer benchmarks


Simulation results fp benchmarks

Simulation results – FP benchmarks


Summary

Summary

D-NUCA has the following plus points:

  • Low Access Latency

  • Technology scalability

  • Performance stability

  • Flattens the memory hierarchy


  • Login