csce 432 832 high performance an introduction to multicore memory hierarchy l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy PowerPoint Presentation
Download Presentation
CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy

Loading in 2 Seconds...

play fullscreen
1 / 31

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy - PowerPoint PPT Presentation


  • 168 Views
  • Uploaded on

CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy. Dongyuan Zhan. What We Learnt from the Video. The Motivation of Multi-core Processors Better utilization of on-chip transistor resources as technology scales Use thread-level parallelism to increase throughput

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'CSCE 432/832 High Performance ---- An Introduction to Multicore Memory Hierarchy' - tallys


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
csce 432 832 high performance an introduction to multicore memory hierarchy

CSCE 432/832 High Performance---- An Introduction toMulticore Memory Hierarchy

Dongyuan Zhan

what we learnt from the video
What We Learnt from the Video
  • The Motivation of Multi-core Processors
    • Better utilization of on-chip transistor resources as technology scales
    • Use thread-level parallelism to increase throughput
  • Two Models of Multi-core Processors
    • Homogenous vs. Heterogeneous CMPs
  • Communication & Synchronization among Cores
    • Communicate with each other via the shared cache/memory
    • Synchronize reads/writes via locks, mutex or transactional memory
  • How to Program Multi-core Processors
    • Using OpenMP to write parallel programs

CSCE 432/832, CMP Memory Hierarchy

from teraflop multiprocessor to teraflop multicore
From Teraflop Multiprocessor to Teraflop Multicore

ASCI RED (1997~2005)

CSCE 432/832, CMP Memory Hierarchy

intel teraflop multicore prototype
Intel Teraflop Multicore Prototype

CSCE 432/832, CMP Memory Hierarchy

from teraflop multiprocessor to teraflop multicore5
From Teraflop Multiprocessor to Teraflop Multicore
  • Pictured here is ASCI Red which was the first computer to reach a Teraflops of processing, equal to trillions of calculations per second.
    • Using about 10,000 Pentium Processors running at 200MHz
    • Consuming 500kW of power for computation and another 500kW for cooling
    • Occupy a very large room
  • Intel has now announced just over 10 yeas later that they have developed the world’s first processor that will deliver the same Teraflops performance all on one single
    • 80-core on a single chip running at 5 GHz
    • Consuming only 62 watts power
    • Small enough to rest on the tip of your finger.

CSCE 432/832, CMP Memory Hierarchy

a commodity many core processor
A Commodity Many-core Processor

Tile64 Multicore Processor (2007~now)

CSCE 432/832, CMP Memory Hierarchy

the schematic design of tile64

PROCESSOR

CACHE

DDR2 Memory Controller 1

XAUI

MAC

PHY 0

SWITCH

Serdes

Reg File

L2 CACHE

L1I

L1D

P2

P1

P0

ITLB

DTLB

2D DMA

MDN

TDN

XAUI

MAC

PHY 1

UDN

IDN

Serdes

STN

DDR2 Memory Controller 2

The Schematic Design of Tile64

DDR2 Memory Controller 0

PCIe 0

MAC

PHY

Serdes

UART, HPI

JTAG, I2C,

SPI

GbE 0

Flexible IO

GbE 1

Flexible IO

PCIe 1

MAC

PHY

  • 4 essential components
  • Processor Core
  • on-chip Cache
  • Network-on-Chip (NoC)
  • I/O controllers

Serdes

DDR2 Memory Controller 3

CSCE 432/832, CMP Memory Hierarchy

agenda today
Agenda Today
  • An Introduction to the Multi-core Memory Hierarchy
    • Why do we need the memory hierarchy for any processors?
      • A tradeoff between capacity and latency
      • Make common cases fast as a result of programs’ locality (general principle in computer architecture)
    • What is the difference between the memory hierarchies of single-core and multi-core CPUs?
      • Quite distinct from each other in on-chip caches
    • Managing the CMP caches is of paramount importance in performance
      • Again, we still have the capacity and latency issues for CMP caches
      • How to keep CMP cache coherent
      • Hardware & software management schemes

CSCE 432/832, CMP Memory Hierarchy

the motivation for mem hierarchy
The Motivation for Mem Hierarchy

Trading off between capacity and latency

Capacity

Access Time

Cost

Upper Level

faster

CPU Registers

100s Bytes

0.3-0.5 ns

Registers

prog./compiler

4-8 bytes

Instr. Operands

L1 Cache

L1 and L2 Cache

10s-100s K Bytes

~1 ns - ~10 ns

cache cntl

32 or 64 bytes

Blocks

L2 Cache

On Chip

cache cntl

64 or 128 bytes

Blocks

Main Memory

G Bytes

200ns ~ 300ns

~ $15/ GByte

Memory

OS

4K~ 64K bytes

Off Chip

Pages

Disk

1s -10s T Bytes

~ 10 ms

~ $0.15 / GByte

Disk

Larger

Lower Level

CSCE 432/832, CMP Memory Hierarchy

programs locality
Programs’ Locality
  • Two Kinds of Basic Locality
    • Temporal:
      • if a memory location is referenced, then it is likely that the same memory location will be referenced again in the near future.

int i; register int j;

for (i = 0; i < 20000; i++)

for (j = 0; j < 300; j++);

    • Spatial:
      • if a memory location is referenced, then it is likely that nearby memory locations will be referenced in the near future.
  • Locality + smaller HW is to make common cases faster = memory hierarchy

CSCE 432/832, CMP Memory Hierarchy

the challenges of memory wall
The Challenges of Memory Wall
  • The Truths:
    • In many applications, 30-40% the total instructions are memory operations
    • CPU speed scales much faster than the DRAM speed
      • In 1980, CPUs and DRAMs were operated at almost the same speed, about 4MHz~8MHz
      • CPU clock frequency has doubled every 2 years;
      • DRAM speed have only been doubling about every 6 years.

CSCE 432/832, CMP Memory Hierarchy

memory wall
Memory Wall
  • DRAM bandwidth is quite limited: two DDR2-800 modules can reach the bandwidth of 12.8GB/sec (about 6.4B/cpu_cycle if the cpu runs at 2GHz). So, in a multicore processor, when multiple 64-bit cores need to access the memory at the same time, they will exacerbate contention on the DRAM bandwidth.
  • Memory Wall: CPU needs to speed a lot of time on off-chip memory accesses. E.g., Intel XScale spends on average 35% of the total execution time on memory accesses. High latency and low bandwidth of the DRAM system becomes a bottleneck for CPUs.

CSCE 432/832, CMP Memory Hierarchy

solutions
Solutions
  • How to alleviate the memory wall problem
    • Hiding the mem access latency: prefetching
    • Reducing the latency: making memory closer to the CPU: 3D-stacked on-chip DRAM
    • Increasing the bandwidth: optical I/O
    • Reducing the number of memory accesses: keeping as much reusable data on cache as possible

CSCE 432/832, CMP Memory Hierarchy

cmp cache organizations shared l2 cache
CMP Cache Organizations(Shared L2 Cache)

CSCE 432/832, CMP Memory Hierarchy

cmp cache organizations private l2 cache
CMP Cache Organizations(Private L2 Cache)

CSCE 432/832, CMP Memory Hierarchy

how to address blocks in a cmp
How to Address Blocks in a CMP
  • How to address blocks in a single-core processor
    • L1 caches are typically virtually indexed but physically tagged, while L2 caches are mostly physically indexed and tagged (related to virtual memory).
  • How to address blocks in a CMP
    • L1 caches are accessed in the same way as in a single-core processor
    • If the L2 caches are private, the addressing of a block is still the same
    • If the L2 caches are shared among all of the cores, then

CSCE 432/832, CMP Memory Hierarchy

how to address blocks in a cmp17
How to Address Blocks in a CMP

CSCE 432/832, CMP Memory Hierarchy

how to address blocks in a cmp18
How to Address Blocks in a CMP

CSCE 432/832, CMP Memory Hierarchy

cmp cache coherence
CMP Cache Coherence
  • Snoop based:
    • All caches on the bus snoop the bus to determine if they have a copy of the block of data that is requested on the bus. Multiple copies of a data block can be read without any coherence problems; however, a processor must have exclusive access (either invalidate or update other copies) to the bus in order to write.
    • Enough for small-scale CMPs with bus interconnection
  • Directory based
    • the data being shared is tracked in a common directory that maintains the coherence between caches. When a cache line is changed the directory either updates or invalidates the other caches with that cache line.
    • Necessary for many-core CMPs with such interconnection as mesh

CSCE 432/832, CMP Memory Hierarchy

non uniform cache access time in shared l2 caches
Non-Uniform Cache Access Timein Shared L2 Caches

CSCE 432/832, CMP Memory Hierarchy

non uniform cache access time in shared l2 caches21
Non-Uniform Cache Access Timein Shared L2 Caches
  • Let’s assume that Core0 needs to access a data block stored in Tile15
    • Assume that access an L2 cache bank needs 10 cycles;
    • Assume transferring a data block from one router to an adjacent one needs 2 cycles;
    • Then, an remote access to the block in Tile 15 needs 10+2*(2*6)=34 cycles, much greater than an local L2 access.
  • Non-Uniform Cache Access Time (NUCA) means that the latency of accessing an cache is a function of the physical locations of both the requesting core and the cache.

CSCE 432/832, CMP Memory Hierarchy

how to reduce the latency of remote cache access
How to reduce the latency of Remote Cache Access
  • At least two solutions:
    • Place the data close enough to the requesting core
      • Victim replication [1]: placing L1 victim blocks in the Local L2 cache;
      • Change the layout of the data: I will talk about one approach pretty soon;
    • Use faster transmission
      • Use special on-chip interconnect to transmit data via radio-wave or light-wave signals

CSCE 432/832, CMP Memory Hierarchy

the rf interconnect 2
The RF-Interconnect [2]

CSCE 432/832, CMP Memory Hierarchy

interference in caching in shared l2 caches
Interference in Cachingin Shared L2 Caches
  • The Problem: because the shared L2 caches are accessible to all cores, one core can interfere with another in placing blocks in L2 caches
    • For example, in a dual-core CMP, if a stream application like a video player is co-scheduled with a scientific computation application that has good locality, then the aggressive stream application will continuously place new blocks in L2 cache and replace the computation application’s cached blocks, thus affecting the computation application’s performance.
  • Solution:
    • Regulate cores’ usage of the L2 cache based on their utility of using the cache [3]

CSCE 432/832, CMP Memory Hierarchy

the capacity problems in private l2 caches
The Capacity Problemsin Private L2 Caches
  • The Problems:
    • the L2 capacity accessible to each core is fixed, regardless of the core’s real cache capacity demand. E.g., if two applications are co-scheduled on a dual core CMP with two 1MB private L2 caches, and if one application has a cache demand of 0.5 MB while the other asks for 1.5MB, then one private L2 cache is underutilized while the other is overwhelmed.
    • If a parallel program is running on the CMP, different cores will have a lot of data in common. However, the private L2 cache organization requires each core maintain a copy of the common data in its local cache, leading to a lot of data redundancy and degrading the effective
  • A Solution: Cooperative Caching [4]

CSCE 432/832, CMP Memory Hierarchy

a comparison between shared and private l2 caches
A Comparison Between Shared and Private L2 Caches

CSCE 432/832, CMP Memory Hierarchy

using os to manage cmp caches 5
Using OS to Manage CMP Caches [5]
  • Two kinds of address space:
    • virtual (or logic) & physical
  • Page coloring: there is a correspondence between a physical page and its location in the cache
  • In CMPs with Shared L2 Cache, by changing the mapping scheme, we can use the OS to determine where a virtual page required by a core is located in the L2 cache
    • Tile#(where a page is cached) = physical page number % #Tiles

CSCE 432/832, CMP Memory Hierarchy

using os to manage cmp caches
Using OS to Manage CMP Caches

CSCE 432/832, CMP Memory Hierarchy

using os to manage cmp caches29
Using OS to Manage CMP Caches
  • The Benefits
    • Improved Data Proximity
    • Capacity Sharing
    • Data Sharing (to be introduced next time)

CSCE 432/832, CMP Memory Hierarchy

summary
Summary
  • What we have covered this class
    • The Memory Wall problem for CMPs
    • The two basic cache organizations for CMPs
    • HW & SW approaches of managing the last level cache.

CSCE 432/832, CMP Memory Hierarchy

references
References

[1] M. Zhang, et al. Victim Replication: Maximizing Capacity while Hiding Wire Delay in Tiled Chip Multiprocessors. ISCA’05.

[2] F. Chang, et al. CMP Network-on-Chip Overlaid With Multi-Band RF-Interconnect. HPCA’08.

[3] A. Jaleel, et al. Adaptive Insertion Policies for Managing Shared Caches. PACT’08.

[4] J. Chang, et al. Cooperative Caching for Chip Multiprocessors. ISCA’06

[5] S. Cho, et al. Managing Distributed, Shared L2 Caches through OS-Level Page Allocation. MICRO’06.

CSCE 432/832, CMP Memory Hierarchy