CMPUT429/CMPE382 Winter 2001

CMPUT429/CMPE382 Winter 2001 Topic6: Main Memory and Virtual Memory (Adapted from David A. Patterson’s CS252, Spring 2001 Lecture Slides)

Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms, 1% time) • Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistorSize: DRAM/SRAM 4-8, Cost/Cycle time: SRAM/DRAM 8-16

Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) • Extended Data Out (EDO): 30% faster in page mode • New DRAMs to address gap; what will they cost, will they survive? • RAMBUS: startup company; reinvent DRAM interface • Each Chip a module vs. slice of memory • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • 20% increase in DRAM area • Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) • Intel claims RAMBUS Direct (16 b wide) is future PC memory? • Possibly not true! Intel to drop RAMBUS? • Niche memory or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output

Main Memory Organizations • Simple: • CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) • Interleaved: • CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved

Main Memory Performance • Timing model (word size is 32 bits) • 1 to send address, • 6 access time, 1 to send data • Cache Block is 4 words • Simple M.P. = 4 x (1+6+1) = 32 • Wide M.P. = 1 + 6 + 1 = 8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses • Multiprocessor • I/O • CPU with Hit under n Misses, Non-blocking Cache • Superbank: all memory active on one block transfer (or Bank) • Bank: portion within a superbank that is word interleaved (or Subbank) … Superbank Bank Superbank Offset Superbank Number Bank Offset Bank Number

Independent Memory Banks • How many banks? number banks  number clocks to access word in bank • For sequential accesses, otherwise will return to original bank before it has next word ready • Increasing DRAM => fewer chips => harder to have enough banks

Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • Software: loop interchange or declaring array not power of 2 (“array padding”) • Hardware: Use a Prime number of banks • bank number = address mod number of banks • address within bank = address / number of words in bank

Finding Bank Number and Address within a bank • Problem: We want to determine the number of banks, Nb, to use • and the number of words to store in each bank, Wb, such that: • given a word address x, it is easy to find the bank where x will • be found, B(x), and the address of x within the bank, A(x). • for any address x, B(x) and A(x) are unique. • the number of bank conflicts is minimized

Finding Bank Number and Address within a bank Solution: We will use the following relation to determine the bank number for x, B(x), and the address of x within the bank, A(x): B(x) = x MOD Nb A(x) = x MOD Wb and we will choose Nb and Wb to be co-prime, i.e., there is no prime number that is a factor of Nband Wb (this condition is satisfied if we choose Nbto be a prime number that is equal to an integer power of two minus 1). We can then use the Chinese Remainder Theorem (see page 436, and exercise 5.10) to show that B(x) and A(x) is always unique.

Fast Bank Number Example: Values for B(x) and A(x) for a system with 3 banks, Nb = 3, and 8 words per bank, Wb = 8. Comparison between Sequential Interliving and Module Interleaving. Seq. Interleaved Modulo Interleaved Bank Number: 0 1 2 0 1 2Address within Bank: 0 0 1 2 0 16 8 1 3 4 5 9 1 172 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13 14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7 21 22 23 15 7 23

32 8 8 2 4 1 8 2 4 1 8 2 DRAMs per PC over Time DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 16 4 Minimum Memory Size

Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM • Need Error correction

Vitual Memory The memory is divided and portions are assigned to different processes. Each process has the “illusion” of accessing the entire memory space. When a portion is not available for a processor, a “magic hand” goes to the disk, finds the missing data, and places it in the memory (like the cache line replacement mechanism). Problem: Multiple processes access the same memory address. A processor access a memory address where there is no RAM memory mapped to it. Solution: “Virtualize” the address space used by the processes (and by the processor). Complication: Must translate the address used in every memory access. Must protect the data that belongs to one process from access by another process.

Virtual to Physical Mapping Physical Address: Virtual Address: 0 4K C 0 A 8K 4K B 12K Physical main memory 8K C A 16K 12K D 20K B 24K 28K Disk Figure 5.36 D

Virtual Memory (VM) Terminology VM Page or Segment Cache block or line Page Fault or Address Fault Cache miss memory mapping or address translation is the process of converting the virtual address produced by the processor into the physical address used to access the main memoy. In VM, the replacement policy is controlled by software. The unit of exchange for VM can have fix size (pages), or variable size (segments). Some new machines use paged segments.

Address Translation Virtual Address Virtual Page number Page offset Main memory Page table Fig. 5.40

The for memory-hierarchy questions for Virtual Memory Q1: Where can a block be placed in main memory? Pages or segments can be placed anywhere in main memory Q2: Where is a block found if it is in main memory? A page table contains the mapping from virtual to physical addresses Inverted pages reduce the size of the page table. A translation lookaside buffer (TLB) is used to cache the most recently used translations. Q3: Which block should be replaced on a virtual memory miss? The least recently used (LRU) page is replaced. Q4: What happens on a write? The write policy is always writeback.

Low-order 13 bits of address ••• 32:1 Mux 1-2: Send virtual address to all tags. 2: Check access type for violation of protection. (High-order 21 bits of address) 3: Use matching tag as a mux selector. 4: Combine page offset with physical page frameto form physical address. Fast Address Translation Page offset <13> Page-frame address <30> V <1> R <2> W <2> Tag <30> Phys. Addr. <21> ••• •••

Protecting Processes(minimum protection system) Use a pair of registers, base and bound, to check if the address is within an allowed interval: base address  bound The hardware must implement at least two modes of operation. The operating system runs in the supervisor mode and regular application programs run in the user mode. The base, bound, the user/supervisor mode bit, and the exception enable/disable bit can only be changed when the processor is running in the supervisor mode.

2. Fast hits by Avoiding Address Translation CPU CPU CPU VA VA VA VA Tags PA Tags $ TB $ TB VA PA PA L2 $ TB $ MEM PA PA MEM MEM Overlap $ access with VA translation: requires $ index to remain invariant across translation Conventional Organization Virtually Addressed Cache Translate only on miss Synonym Problem

2. Fast hits by Avoiding Address Translation • Send virtual address to cache? Called Virtually Addressed Cacheor just Virtual Cache vs. Physical Cache • Every time process is switched logically must flush the cache; otherwise get false hits • Cost is time to flush + “compulsory” misses from empty cache • Dealing with aliases(sometimes called synonyms); Two different virtual addresses map to same physical address • I/O must interact with cache, so need virtual address • Solution to aliases • HW guarantees covers index field & direct mapped, they must be unique;called page coloring • Solution to cache flush • Addprocess identifier tagthat identifies process as well as address within process: can’t get a hit if wrong process

2. Fast Cache Hits by Avoiding Translation: Index with Physical Portion of Address • If index is physical part of address, can start tag access in parallel with translation so that can compare to physical tag • Limits cache to page size: what if want bigger caches and uses same trick? • Higher associativity moves barrier to right • Page coloring Page Address Page Offset Address Tag Block Offset Index

3: Fast Hits by pipelining CacheCase Study: MIPS R4000 • 8 Stage Pipeline: • IF–first half of fetching of instruction; PC selection happens here as well as initiation of instruction cache access. • IS–second half of access to instruction cache. • RF–instruction decode and register fetch, hazard checking and also instruction cache hit detection. • EX–execution, which includes effective address calculation, ALU operation, and branch target computation and condition evaluation. • DF–data fetch, first half of access to data cache. • DS–second half of access to data cache. • TC–tag check, determine whether the data cache access hit. • WB–write back for loads and register-register operations. • What is impact on Load delay? • Need 2 instructions between a load and its use!

Case Study: MIPS R4000 IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF TWO Cycle Load Latency IF IS IF RF IS IF EX RF IS IF DF EX RF IS IF DS DF EX RF IS IF TC DS DF EX RF IS IF WB TC DS DF EX RF IS IF THREE Cycle Branch Latency (conditions evaluated during EX phase) Delay slot plus two stalls Branch likely cancels delay slot if not taken

R4000 Performance • Not ideal CPI of 1: • Load stalls (1 or 2 clock cycles) • Branch stalls (2 cycles + unfilled slots) • FP result stalls: RAW data hazard (latency) • FP structural stalls: Not enough FP hardware (parallelism)

What is the Impact of What You’ve Learned About Caches? • 1960-1985: Speed = ƒ(no. operations) • 1990 • Pipelined Execution & Fast Clock Rate • Out-of-Order execution • Superscalar Instruction Issue • 1998: Speed = ƒ(non-cached memory accesses) • What does this mean for • Compilers?,Operating Systems?, Algorithms? Data Structures?

Alpha 21064 • Separate Instr & Data TLB & Caches • TLBs fully associative • TLB updates in SW(“Priv Arch Libr”) • Caches 8KB direct mapped, write thru • Critical 8 bytes first • Prefetch instr. stream buffer • 2 MB L2 cache, direct mapped, WB (off-chip) • 256 bit path to main memory, 4 x 64-bit modules • Victim Buffer: to give read priority over write • 4 entry write buffer between D$ & L2$ Instr Data Write Buffer Stream Buffer Victim Buffer

Alpha Memory Performance: Miss Rates of SPEC92 I$ miss = 6% D$ miss = 32% L2 miss = 10% 8K 8K 2M I$ miss = 2% D$ miss = 13% L2 miss = 0.6% I$ miss = 1% D$ miss = 21% L2 miss = 0.3%

Alpha CPI Components • Instruction stall: branch mispredict (green); • Data cache (blue); Instruction cache (yellow); L2$ (pink) Other: compute + reg conflicts, structural conflicts

Pitfall: Predicting Cache Performance from Different Prog. (ISA, compiler, ...) • 4KB Data cache miss rate 8%,12%, or 28%? • 1KB Instr cache miss rate 0%,3%,or 10%? • Alpha vs. MIPS for 8KB Data $:17% vs. 10% • Why 2X Alpha v. MIPS? D$, Tom D$, gcc D$, esp I$, gcc I$, esp I$, Tom

Cache Optimization Summary Technique MR MP HT Complexity Larger Block Size + – 0Higher Associativity + – 1Victim Caches + 2Pseudo-Associative Caches + 2HW Prefetching of Instr/Data + 2Compiler Controlled Prefetching + 3Compiler Reduce Misses + 0 Priority to Read Misses + 1Early Restart & Critical Word 1st + 2Non-Blocking Caches + 3Second Level Caches + 2Better memory system + 3 Small & Simple Caches – + 0Avoiding Address Translation + 2Pipelining Caches + 2 miss rate miss penalty hit time

CMPUT429/CMPE382 Winter 2001

CMPUT429/CMPE382 Winter 2001

Presentation Transcript