SE-292 High Performance Computing
This presentation is the property of its rightful owner.
Sponsored Links
1 / 30

SE-292 High Performance Computing Memory Hierarchy PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

SE-292 High Performance Computing Memory Hierarchy. Reality Check. Question 1: Are real caches built to work on virtual addresses or physical addresses? Question 2: What about multiple levels in caches? Question 3: Do modern processors use pipelining of the kind that we studied?.

Download Presentation

SE-292 High Performance Computing Memory Hierarchy

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Se 292 high performance computing memory hierarchy

SE-292 High Performance Computing

Memory Hierarchy


Reality check

Reality Check

  • Question 1: Are real caches built to work on virtual addresses or physical addresses?

  • Question 2: What about multiple levels in caches?

  • Question 3: Do modern processors use pipelining of the kind that we studied?


Virtual memory system

Virtual Memory System

  • To support memory management when multiple processes are running concurrently.

    • Page based, segment based, ...

  • Ensures protectionacross processes.

  • Address Space: range of memory addresses a process can address (includes program (text), data, heap, and stack)

    • 32-bit address  4 GB

    • with VM, address generated by the processor is virtual address


Page based virtual memory

Page-Based Virtual Memory

  • A process’ address space is divided into a no. of pages (of fixed size).

  • A page is the basic unit of transfer between secondary storage and main memory.

  • Different processes share the physical memory

  • Virtual address to be translated to physical address.


Virtual pages to physical frame mapping

Process 1

Main Memory

Process k

Virtual Pages to Physical Frame Mapping


Page mapping info page table

Page Mapping Info (Page Table)

  • A page is mapped to any frame in the main memory.

  • Where to store/access the mapping?

    • Page Table

    • Each process will have its own page table!

  • Address Translation: virtual to physical address translation


Address translation

Virtual Address

Virtual Page No.

18-bits

Offset

14 bits

Phy. Page #

Pr

V

D

Physical Frame No.

18-bits

Offset

14 bits

Page Table

Physical Address

Address Translation

  • Virtual address (issued by processor) to Physical Address (used to access memory)


Memory hierarchy secondary to main memory

Memory Hierarchy: Secondary to Main Memory

  • Analogous to Main memory to Cache.

  • When the required virtual page is not in main memory: Page Hit

  • When the required virtual page is not in main memory: Page fault

  • Page fault penalty very high (~10’s of mSecs.) as it involves disk (secondary storage) access.

  • Page fault ratio shd. be very low (10-4 to 10-5)

  • Page fault handled by OS.


Page placement

Page Placement

  • A virtual page is placed anywhere in physical memory (fully associative).

  • Page Table keeps track of the page mapping.

  • Separate page table for each process.

  • Page table size is quite large!

    Assume 32-bit address space and 16KB page size.

    # of entries in page table = 232 / 214 = 218 = 256K

    Page table size = 256 K * 4B = 1 MB = 64 pages!

  • Page table itself may be paged (multi-level page tables)!


Page identification

Page Identification

  • Use virtual page number to index into page table.

  • Accessing page table causes one extra memory access!

Virtual Page No.

18-bits

Offset

14 bits

Phy. Page #

Pr

V

D

Physical Frame No.

18-bits

Offset

14 bits


Page replacement

Page Replacement

  • Page replacement can use more sophisticated policies than in Cache.

    • Least Recently Used

    • Second-chance algorithm

    • Recency vs. frequency

      Write Policies

    • Write-back

    • Write-allocate


Translation look aside buffer

Translation Look-Aside Buffer

  • Accessing the page tables causes one extra memory access!

  • To reduce translation time, use translation look-aside buffer (TLB) which caches recent address translations.

  • TLB organization similar to cache orgn. (direct mapped, set-, or full-associative).

  • Size of TLB is small (4 - 512 entries).

  • TLBs is important for fast translation.


Translation using tlb

Virtual

Page No.

Offset

Tag

13bits

Ind.

5bits

Pr

D

V

P. P. #

Tag

Pr

Phy. Page #

V

D

Physical

Frame No.

Offset

=

TLB Hit

Translation using TLB

  • Assume 128 entry 4-way associative TLB

14 bits

18 bits


Q1 caches and address translation

Q1: Caches and Address Translation

Physical Addressed Cache

Physical Address

Virtual Address

Cache

MMU

Physical Address

Virtual Address

Virtual Address

MMU

Cache

(if cache miss)

(to main memory)

Virtual Addressed Cache


Which is less preferable

Which is less preferable?

  • Physical addressed cache

    • Hit time higher (cache access after translation)

  • Virtual addressed cache

    • Data/instruction of different processes with same virtual address in cache at the same time …

      • Flush cache on context switch, or

      • Include Process id as part of each cache directory entry

    • Synonyms

      • Virtual addresses that translate to same physical address

      • More than one copy in cache can lead to a data consistency problem


Another possibility overlapped operation

Another possibility: Overlapped operation

MMU

Virtual Address

Physical Address

Tag comparison using physical address

Indexing into cache directory using virtual address

Cache

Virtual indexed physical tagged cache


Addresses and caches

Addresses and Caches

  • `Physical Addressed Cache’

    Physical Indexed Physical Tagged

  • `Virtual Addressed Cache’

    Virtual Indexed Virtual Tagged

  • Overlapped cache indexing and translation

    Virtual Indexed Physical Tagged

  • Physical Indexed Virtual Tagged (?)


Physical indexed physical tagged cache

Physical Indexed Physical Tagged Cache

16KB page size

64KB direct mapped cache with 32B block size

Virtual Page No

18 bits

Page Offset

14 bits

Virtual Address

MMU

=

Physical Cache

5

Cache Tag

16 bits

C

offset

Physical Address

Physical Page No

18 bits

C-Index

11 bits

Page Offset

14 bits


Virtual index virtual tagged cache

Virtual Index Virtual Tagged Cache

5

VPN

18 bits

C-Index

11 bits

C

offset

Virtual Address

=

MMU

Hit/Miss

Page Offset

14 bits

PPN

18 bits

Physical Address


Virtual index physical tagged cache

Virtual Index Physical Tagged Cache

5

VPN

18 bits

C-Index

11 bits

C

offset

Virtual Address

=

MMU

Physical Address

Cache Tag

16 bits

P-Offset

14 bits


Multi level caches

Multi-Level Caches

  • Small L1 cache -- to give a low hit time, and hence faster CPU cycle time .

  • Large L2 cache to reduce L1 cache miss penalty.

  • L2 cache is typically set-associative to reduce L2-cache miss ratio!

  • Typically, L1 cache is direct mapped, separate I and D cache orgn. L2 is unified and set-associative.

  • L1 and L2 are on-chip; L3 is also getting in on-chip.


Multi level caches1

Multi-Level Caches

CPU

MMU

Acc. Time

2 - 4 ns

L1

I-Cache

L1

D-Cache

L2 Unified Cache

Acc. Time

16-30 ns

Acc. Time

100 ns

Memory


Cache performance

Cache Performance

  • One Level Cache

    AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

  • Two-Level Caches

    AMAT = Hit TimeL1 + Miss RateL1 x Miss PenaltyL1

    Miss PenaltyL1 = Hit TimeL2 + Miss RateL2 x Miss PenaltyL2

    AMAT = Hit TimeL1 +Miss RateL1 x

    (Hit TimeL2 + Miss RateL2 + Miss PenaltyL2)


Putting it together alpha 21264

Putting it Together: Alpha 21264

  • 48-bit virtual addr. and 44-bit physical address.

  • 64KB 2-way assoc. L1 I-Cache with 64byte blocks ( 512 sets)

  • L1 I-Cache is virtually index and tagged (address translation reqd. only on a miss)

  • 8-bit ASID for each process (to avoid cache flush on context switch)


Alpha 21264

Alpha 21264

  • 8KB page size ( 13 bit page offset)

  • 128 entry fully associative TLB

  • 8MB Direct mapped unified L2 cache, 64B block size

  • Critical word (16B) first

  • Prefetch next 64B into instrn. prefetcher


21264 data cache

21264 Data Cache

  • L1 data cache uses virtual addr. index, but physical addr. tag

  • Addr. translation along with cache access.

  • 64KB 2-way assoc. L1 data cache with write back.


Q2 high performance pipelined processors

Q2: High Performance Pipelined Processors

  • Pipelining

    • Overlaps execution of consecutive instructions

    • Performance of processor improves

  • Current processors use more aggressive techniques for more performance

  • Some exploit Instruction Level Parallelism - often, many consecutive instructions are independent of each other and can be executed in parallel (at the same time)


Instruction level parallelism processors

Instruction Level Parallelism Processors

  • Challenge: identifying which instructions are independent

  • Approach 1: build processor hardware to analyze and keep track of dependences

    • Superscalar processors: Pentium 4, RS6000,…

  • Approach 2: compiler does analysis and packs suitable instructions together for parallel execution by processor

    • VLIW (very long instruction word) processors: Intel Itanium


Ilp processors contd

Pipelined

IF

IF

IF

IF

IF

IF

IF

IF

IF

ID

ID

ID

ID

ID

ID

ID

ID

ID

EX

EX

EX

EX

EX

EX

EX

EX

EX

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

MEM

WB

WB

WB

WB

WB

WB

WB

WB

WB

Superscalar

VLIW/EPIC

EX

EX

ILP Processors (contd.)


Multicores

Multicores

  • Multiple cores in a single die

  • Early efforts utilized multiple cores for multiple programs

    • Throughput oriented rather than speedup-oriented!

  • Can they be used by Parallel Programs?


  • Login