Lecture on High Performance Processor Architecture ( CS05162 )

Lecture on High Performance Processor Architecture(CS05162) Review of Memory Hierarchy An Hong han@ustc.edu.cn Fall 2009 School of Computer Science and Technology University of Science and Technology of China

Quick review of everything you should have learned USTC CS AN Hong

Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) Proc 60%/yr. (2X/1.5yr) 1000 CPU Moore’sLaw 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time USTC CS AN Hong

Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes 10-100 ns 1-0.1 cents/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 200ns- 500ns $.0001-.00001 cents /bit Memory OS 512-4K bytes Pages Disk G Bytes, 10 ms (10,000,000 ns) 10 - 10 cents/bit Disk -6 -5 user/operator Mbytes Files Larger Tape T Bytesor infinite sec-min 10 cents/bit Tape Lower Level -8 Levels of the Memory Hierarchy USTC CS AN Hong

The Principle of Locality • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Two Different Types of Locality: • Temporal Locality(Locality in Time): If an item is referenced, it will tend to be referenced again soon (e.g., loops, reuse) • Spatial Locality(Locality in Space): If an item is referenced, items whose addresses are close by tend to be referenced soon (e.g., straightline code, array access) • Last 20 years, HW relied on locality for speed and cost It is a property of programs which is exploited in machine design. USTC CS AN Hong

Lower Level Memory Upper Level Memory To Processor Blk X From Processor Blk Y Memory Hierarchy: Terminology • Hit(命中): CPU访问存储系统时,在高层找到所需的信息 (e.g.: Block X) • Hit Rate(命中率): CPU访问存储系统时,在高层找到所需信息的概率 • Hit Time(命中时间): 在高层命中时的访问时间 RAM access time + Time to determine hit/miss • Miss(缺失): CPU访问存储系统时,所需的信息 (e.g.:Block Y)需要从低层传送到高层 • Miss Rate(缺失率) = 1 - (Hit Rate) • Miss Penalty(缺失开销): 在低层存储中找到一个信息块所需的时间 + 向高层传送一个信息块所需的时间 • Hit Time << Miss Penalty USTC CS AN Hong

Cache Measures • Hit rate(命中率) • So high that usually talk about Miss rate • Miss rate fallacy: as MIPS to CPU performance, miss rate to average memory access time in memory • Miss penalty(缺失开销): time to replace a block from lower level, including time to replace in CPU • access time: 在低层存储中找到一个信息块所需的时间 = f(latency to access lower level) • transfer time: 向高层传送一个信息块所需的时间 =f(BW between upper & lower levels) • Average memory-access time(平均访问时间) = Hit time + Miss rate x Miss penalty (ns or clocks) USTC CS AN Hong

Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? 块的放置(Block placement) • Fully Associative, Set Associative, Direct Mapped • Q2: How is a block found if it is in the upper level? 块的定位(Block identification) • Tag/Block • Q3: Which block should be replaced on a miss? 块的替换(Block replacement) • Random, LRU • Q4: What happens on a write? 写策略(Write strategy) • Write Through (with Write Buffer)(直写) or Write Back(回写) USTC CS AN Hong

Direct mapped: block 12 can go only into block 4 (12 mod 8) Set associative: block 12 can go anywhere in set 0 (12 mod 4) Block no. 0 1 2 3 4 5 6 7 Block no. 0 1 2 3 4 5 6 7 Block-frame address Set 0 Set 1 Set 2 Set 3 Block no. 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 Q1: Where can a block be placed? • Block 12 placed in 8 block cache: • Fully associative, direct mapped, 2-way set associative • S.A. Mapping = Block Number Modulo Number Sets Fully associative: block 12 can go anywhere Block no. 0 1 2 3 4 5 6 7 USTC CS AN Hong

Block frame Address Block offset Index Tag Q2: How is a block found? • Cache 中的每一块都有一个地址标志(Tag),给出块地址.对每一个可能包含有被访问信息的Cache 块中的标志进行检查,看是否与来自CPU的块地址匹配. • 增加相联度, 则Cache 的组数减少, 每组中的块数增加--> 则index域的宽度减少, Tag域的宽度增加 Set Select (用于判断是否命中) Data Select USTC CS AN Hong

1 KB Direct Mapped Cache, 32B blocks • For a 2 N byte cache: • The uppermost (32 - N) bits are always the Cache Tag • The lowest M bits are the Byte Select (Block Size = 2 M ) 31 9 4 0 Cache Tag Example: 0x50 Cache Index Byte Select Ex: 0x01 Ex: 0x00 ① ② ③ Valid Bit Cache Tag Cache Data Byte 31 Byte 1 Byte 0 0 : 0x50 Byte 63 Byte 33 Byte 32 1 : 2 3 : : : Byte 1023 Byte 992 31 : USTC CS AN Hong

Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 : : : : : : ① ③ ③ ② ② Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit Two-way Set Associative Cache • N-way set associative: N entries for each Cache Index • N direct mapped caches operates in parallel (N typically 2 to 4) • Example: Two-way set associative cache • Cache Index selects a set?from the cache • The two tags in the set are compared in parallel • Data is selected based on the tag result USTC CS AN Hong

Cache Index Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 : : : : : : Adr Tag Compare Compare 1 0 Mux Sel1 Sel0 OR Cache Block Hit N-way Set Associative Cache v. Direct Mapped Cache • 相联度越高, Cache空间的利用率就越高,块冲突的概率越低,Cache 的缺失率也就越低.全相联的缺失率最低,直接映射的缺失率最高 • 相联度越高, Cache 实现的复杂度和访问延迟就越大 • 大多数的处理器采用直接映射, 两路组相联,或四路组相联 USTC CS AN Hong

Q3: Which block should be replaced on a miss? • Easy for Direct Mapped • Set Associative or Fully Associative: • Random • LRU (Least Recently Used) Associativity: 2-way 4-way 8-way Size LRU Random LRU Random LRU Random 16 KB 5.2% 5.7% 4.7% 5.3% 4.4% 5.0% 64 KB 1.9% 2.0% 1.5% 1.7% 1.4% 1.5% 256 KB 1.15% 1.17% 1.13% 1.13% 1.12% 1.12% USTC CS AN Hong

Q4: What happens on a write? • Write through(WT,直写): the information is written to both the block in the cache and to the block in the lower-level memory. • WT always combined with write buffers so that don’twait for lower level memory • Write back（WB,后写）: the information is written only to the block in the cache. The modified cache block is written to main memory only when it is replaced. • is block clean or dirty? USTC CS AN Hong

Cache Processor DRAM Write Buffer Write Buffer for Write Through • A Write Buffer is needed between the Cache and Memory • Processor: writes data into the cache and the write buffer • Memory controller: write contents of the buffer to memory • Write buffer is just a FIFO: • Typical number of entries: 4 • Works fine if: Store frequency (w.r.t. time) << 1 / DRAM write cycle • Memory system designer’s nightmare: • Store frequency (w.r.t. time) -> 1 / DRAM write cycle • Write buffer saturation USTC CS AN Hong

Write-miss Policy(写缺失策略): Write Allocate versus Not Allocate • Write Allocate(取写): 即写分配. 在发生写缺失时,主存的块读到Cache 中,然后执行写命中操作. • 通常配合后写法 • Write Not Allocate(绕写): 即非写分配. 仅修改下层存储器的该块而不将该块取到Cache 中. • 通常配合直写法 USTC CS AN Hong

Cache history • Caches introduced (commercially) more than 30 years ago in the IBM 360/85 • already a processor-memory gap • Oblivious to the ISA • caches were organization, not architecture • Many different organizations • direct-mapped, set-associative, skewed-associative, sector, decoupled sector etc. • Caches are ubiquitous • On-chip, off-chip • But also, disk caches, web caches, trace caches etc. • Multilevel cache hierarchy • With inclusion or exclusion • 4+ level USTC CS AN Hong

Cache history • Cache exposed to the ISA • Prefetch, Fence, Purge etc. • Cache exposed to the compiler • Code and data placement • Cache exposed to the O.S. • Page coloring • Many different write policies • copy-back, write-through, fetch-on-write, write-around, write-allocate etc. USTC CS AN Hong

Cache history • Numerous cache assists, for example: • For storage: write-buffers, victim caches, temporal/spatial caches • For overlap: lock-up free caches • For latency reduction: prefetch • For better cache utilization: bypass mechanisms, dynamic line sizes • etc ... USTC CS AN Hong

Caches and Parallelism • Cache coherence • Directory schemes • Snoopy protocols • Synchronization • Test-and-test-and-set • load linked -- store conditional • Models of memory consistency • TCC, LogTM USTC CS AN Hong

When were the 2K papers being written? • A few facts: • 1980 textbook: < 10 pages on caches (2%) • 1996 textbook: > 120 pages on caches (20%) • Smith survey (1982) • About 40 references on caches • Uhlig and Mudge survey on trace-driven simulation (1997) • About 25 references specific to cache performance only • Many more on tools for performance etc. USTC CS AN Hong

Cache research vs. time Largest number (14) 1st session on caches USTC CS AN Hong

Present Latency Solutions and Limitation USTC CS AN Hong

The Memory Bandwidth Problem • It’s expensive! • Often ignored • Processor-centric optimization to bridge the gap but lead to memory-bandwidth problems • Prefetching • Speculation • Multithreading hide latency Can we always just trade bandwidth for latency? USTC CS AN Hong

Present Bandwidth Solutions • Wider/faster connections to memory • Rambus DRAM • Use higher signaling rates on existing pins • Use more pins for the memory interface • Larger on-chip caches • Fewer requests to DRAM • Only effective if larger caches improve hit rate • Traffic-efficient requests • Only request what you need • Caches are “guessing” that you might need adjacent data • Compression? USTC CS AN Hong

Present Bandwidth Solutions • More efficient on-chip caches • Only 1/20 – 1/3 of the data in a cache is live • Again, caches are “guessing” what will be used again • Spatial vs. temporal vs. no locality • Logic/DRAM integration • Put the memory on the processor • On-chip bandwidth is cheaper than pin bandwidth • You will still probably have external DRAM as well • Memory-centric architectures • “Smart” memory (PIM) • Put processing elements wherever there is memory USTC CS AN Hong

执行一个程序所需的时钟周期数× 时钟周期 CPU时间 = = IC ×CPI × Clock Cycle time 评测存储系统性能的指标 A.M.A.T : Average Memory Access time IC: Instruction Count CCT: Clock cycle time (1) 平均访存时间 A.M.A.T= (Hit Rate x Hit Time) +(Miss Rate x Miss Time ) = (Hit Rate x Hit Time) +(1–Hit Rate ) x (Hit Time + Miss Penalty) = Hit Time + (Miss Rate x Miss Penalty) (2) CPU性能公式(带存储器层次结构的CPU性能公式) 复习: 不考虑访存造成的CPU暂停周期时的CPU性能公式 USTC CS AN Hong

评测存储系统性能的指标 第一种扩展形式: CPU time = (CPU execution clock cycles + Memory stall clock cycles) x CCT Memory stall clock cycles = (Reads x Read miss rate x Read miss penalty + Writes x Write miss rate x Write miss penalty) = Memory accesses x Miss rate x Miss penalty 第二种扩展形式: CPU time = ICx (CPIexecution +Mem accesses per instruction x Miss rate x Miss penalty) x CCT = IC x (CPIexecution + Misses per instruction x Miss penalty) x CCT USTC CS AN Hong

Improving Cache Performance: 3 general options • Average Memory Access time = • Hit Time + (Miss Rate x Miss Penalty) = • (Hit Rate x Hit Time) + (Miss Rate x Miss Time) 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the cache. USTC CS AN Hong

Processor Control Tertiary Storage (Disk/Tape) Secondary Storage (Disk) Second Level Cache (SRAM) Main Memory (DRAM) On-Chip Cache Datapath Registers Speed (ns): 1s 10s 100s 10,000,000s (10s ms) 10,000,000,000s (10s sec) Size (bytes): 100s Ks Ms Gs Ts A Modern Memory Hierarchy • By taking advantage of the principle of locality: • Present the user with as much memory as is available in the cheapest technology. • Provide access at the speed offered by the fastest technology. USTC CS AN Hong

Recall: Levels of the Memory Hierarchy Upper Level Capacity Access Time Cost Staging Xfer Unit faster CPU Registers 100s Bytes <10s ns Registers prog./compiler 1-8 bytes Instr. Operands Cache K Bytes 10-100 ns $.01-.001/bit Cache cache cntl 8-128 bytes Blocks Main Memory M Bytes 100ns-1us $.01-.001 Memory OS 512-4K bytes Pages Disk G Bytes ms 10 - 10 cents Disk -4 -3 user/operator Mbytes Files Larger Tape infinite sec-min 10 Tape Lower Level -6 USTC CS AN Hong

How is the hierarchy managed? • Registers <-> Memory • by compiler (programmer?) • Cache <-> Memory • by the hardware • Memory <-> Disks • by the hardware and operating system (virtual memory) • by the programmer (files) USTC CS AN Hong

虚地址空间 物理地址空间实页号页内偏移 10 什么是虚拟存储? • 虚拟存储：将内存看作是硬盘的cache • 页的典型大小为： 1K — 8K • 页表将编程空间（虚地址空间）中的“虚页”映射为内存空间（物理地址空间）中的“实页” 虚地址格式虚页号页内偏移页表页表基地址寄存器 Access Rights V PA 索引页表内存中的页表物理地址格式 USTC CS AN Hong

虚拟存储器与cache 的比较 • 访问单位 • 在cache中是“块”，在虚存中是“页” • 缺失时的替换操作 • Cache由硬件担当；虚存由OS担当 • 容量 • 虚存的容量由处理器的地址位数决定；Cache 的容量与处理器的地址位数无关 • 下一级的存储空间 • 主存是cache的下一级存储器；辅存不仅用作主存的下一级存储器，而且用作文件系统 USTC CS AN Hong

a missing item fault 名空间 V 缺页处理 Processor 0 地址变换机制辅存内存 a a' 物理地址由OS完成地址映射 V = {0, 1, . . . , n - 1} 虚地址空间 M = {0, 1, . . . , m - 1} 物理地址空间映射: V --> M U {0} n > m 地址映象函数： MAP(a) = a' 当虚地址a 的数据出现在物理地址a' 且a'在M中 = 0 当虚地址a 的数据没有出现在M中 One Map per process! USTC CS AN Hong

虚拟存储的优点 • Translation（内存变换）: 简化了程序的装入 • 在小的物理地址空间里提供大的一致的编程空间 • 多个线程的每一个只需分配部分内存块即可运行 • 只有程序中最重要的部分（称为“工作集”）必需放在内存中 • Protection（内存保护）: 自动管理，程序员不必做存储管理工作 • 为不同的页赋予特定的保护 • (如,只读, 对用户程序不可见等) • 保护不同的线程（或进程）之间互不干扰 • 保护核心程序不受用户程序干扰 • 防病毒或恶意程序的破坏 • Sharing（内存共享）: 多个进程可以共享主存空间 • 同一个物理页映射给多个用户(“共享主存”) USTC CS AN Hong

虚拟存储系统设计中的问题 • 一个新的块可以放在主存的何处? • 页的放置(placement policy) • 允许将页放在主存中的任意位置--------全相联 • 如果一块在主存中,如何找到它? • 页的定位(block identification) • TLB/页表或段表 • 当发生缺页时哪一页将会被替换掉? • 页的替换(replacement policy) • 引用位/LRU算法 • 采用什么写策略? • 写策略(write policy) • 后写策略 USTC CS AN Hong

V.A. P.A. unit of mapping 0 frame 0 1K Addr Trans MAP 0 1K page 0 1024 1 1K 1024 1 1K also unit of transfer from virtual to physical memory 7 1K 7168 Physical Memory 31 1K 31744 Virtual Memory Address Mapping 10 VA page no. disp Page Table Page Table Base Reg Access Rights actually, concatenation is more likely V + PA index into page table table located in physical memory physical memory address 分页 USTC CS AN Hong

大地址空间 两级页表 1K PTEs 4KB 32-bit 地址: 10 10 12 P1 index P2 index page offest 4 bytes 一级页表 ° 4 GB virtual address space ° 4 MB of PTE2 ° 4 KB of PTE1 What about a 48-64 bit address space? 4 bytes 二级页表 USTC CS AN Hong

miss VA PA Trans- lation Cache Main Memory CPU hit data Virtual Address and a Cache: Step backward??? • Virtual memory seems to be really slow: • we have to access memory on every access -- even cache hits! • Worse, if translation not completely in memory, may need to go to disk before hitting in cache! • Solution: Caching page table! • Keep track of most common translations and place them in a “Translation Lookaside Buffer” (TLB，快表) USTC CS AN Hong

Virtual Address Space Physical Memory Space virtual address physical address off off page page 2 0 1 3 frame page 2 2 0 5 Making address translation practical: TLB • Translation Look-aside Buffer (TLB) is a cache of recent translations Page Table 在内存 TLB 在片内 USTC CS AN Hong

TLB organization: include protection • TLB usually organized as fully-associative cache • Lookup is by Virtual Address • Returns Physical Address + other info • Dirty => Page modified (Y/N)? Ref => Page touched (Y/N)?Valid => TLB entry valid (Y/N)? Access => Read? Write? ASID => Which User? Virtual Address Physical Address Dirty Ref Valid Access ASID 0xFA00 0x0003 Y N Y R/W 34 0xFA00 0x0003 Y N Y R/W 34 0x0040 0x0010 N Y Y R 0 0x0041 0x0011 N Y Y R 0 USTC CS AN Hong

缺页处理 • 缺页（Page fault）：该页不在内存里 • 检测：由硬件完成 • 处理：陷入OS，由它来完成缺页处理 • 选择一页换出（可能写到辅存上） • 从辅存装入该页 • 调度一些其它程序到处理器上运行 Later (when page has come back from disk): • 更新页表 • 恢复程序继续执行! • What is in the page fault handler? • 见OS教材 • What can HW do to help it do a good job? USTC CS AN Hong

Virtual Address 10 TLB Lookup V page no. offset Access Rights V PA Physical Address 10 P page no. offset Reducing translation time further • As described, TLB lookup is in serial with cache lookup: • Machines with TLBs go one step further: they overlap TLB lookup with cache access. • Works because lower bits of result (offset) available early USTC CS AN Hong

Summary #1/5: Control and Pipelining • Control VIA State Machines and Microprogramming • Just overlap tasks; easy if tasks are independent • Speed Up  Pipeline Depth; if ideal CPI is 1, then: • Hazards limit performance on computers: • Structural: need more HW resources • Data (RAW,WAR,WAW): need forwarding, compiler scheduling • Control: delayed branch, prediction USTC CS AN Hong

Summary #2/5: Caches • The Principle of Locality: • Program access a relatively small portion of the address space at any instant of time. • Temporal Locality: Locality in Time • Spatial Locality: Locality in Space • Three Major Categories of Cache Misses: • Compulsory Misses: sad facts of life. Example: cold start misses. • Capacity Misses: increase cache size • Conflict Misses: increase cache size and/or associativity. Nightmare Scenario: ping pong effect! • Write Policy: • Write Through: needs a write buffer. Nightmare: WB saturation • Write Back: control can be complex USTC CS AN Hong

Summary #3/5: The Cache Design Space • Several interacting dimensions • cache size • block size • associativity • replacement policy • write-through vs write-back • write allocation • The optimal choice is a compromise • depends on access characteristics • workload • use (I-cache, D-cache, TLB) • depends on technology / cost • Simplicity often wins Cache Size Associativity Block Size Bad Factor A Factor B Good Less More USTC CS AN Hong

Summary #4/5: TLB, Virtual Memory • Caches, TLBs, Virtual Memory all understood by examining how they deal with 4 questions: • 1) Where can block be placed? • 2) How is block found? • 3) What block is repalced on miss? • 4) How are writes handled? • Page tables map virtual address to physical address • TLBs are important for fast translation • TLB misses are significant in processor performance • funny times, as most systems can’t access all of 2nd level cache without TLB misses! USTC CS AN Hong

Summary #5/5: Memory Hierachy • Virtual memory was controversial at the time: can SW automatically manage 64KB across many programs? • 1000X DRAM growth removed the controversy • Today VM allows many processes to share single memory without having to swap all processes to disk; today VM protection is more important than memory hierarchy • Today CPU time is a function of (ops, cache misses) vs. just f(ops):What does this mean to Compilers, Data structures, Algorithms? USTC CS AN Hong

Lecture on High Performance Processor Architecture ( CS05162 )