Maximizing Memory Hierarchy Efficiency: SRAM & DRAM Technology

Chapter 5 Large and Fast: Exploiting Memory Hierarchy

Computer System Processor Reg Cache Memory-I/O bus I/O controller I/O controller I/O controller Memory Display Network Disk Disk

Memory Technology §5.1 Introduction • Static RAM (SRAM) • 0.5ns – 2.5ns, $2000 – $5000 per GB • Dynamic RAM (DRAM) • 50ns – 70ns, $20 – $75 per GB • Magnetic disk • 5ms – 20ms, $0.20 – $2 per GB • Flash Memory • 25μs – 250μs, 0.25$ per GB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 3

Static RAM (SRAM) • Fast • ~4 nsec access time • Persistent • as long as power is supplied • no refresh required • Expensive • 6 transistors/bit • Stable • High immunity to noise and environmental disturbances • Technology for caches

bit line bit line Stable Configurations b b’ word line (6 transistors) 0 1 1 0 Anatomy of an SRAM Cell Terminology: bit line: carries data word line: used for addressing • Write: • 1. set bit lines to new data value • b’ is set to the opposite of b • 2. raise word line to “high” •  sets cell to new state (may involve flipping relative to old state) • Read: • 1. set bit lines high • 2. set word line high • 3. see which bit line goes low

Address decoder A0 A1 A2 A3 Example SRAM Configuration (16 x 8) b7 b7’ b1 b1’ b0 b0’ W0 W1 memory cells W15 R/W sense/write amps sense/write amps sense/write amps Input/output lines d7 d1 d0

Dynamic RAM (DRAM) • Slower than SRAM • access time ~60 ns • Non-persistent • every row must be accessed every ~1 ms (refreshed) • Cheaper than SRAM • 1 transistor/bit • Fragile • electrical noise, light, radiation • Workhorse memory technology

Reading Word Line Bit Line V ~ Cnode / CBL Anatomy of a DRAM Cell Word Line Bit Line Storage Node Access Transistor Cnode CBL Writing Word Line Bit Line V Storage Node

c r Addressing Arrays with Bits • Array Size • R rows, R = 2r • C columns, C = 2c • N = R * C bits of memory • Addressing • Addresses are n bits, where N = 2n • row(address) = address / C • leftmost r bits of address • col(address) = address % C • rightmost bits of address • Example • R = 2 • C = 4 • address = 6 row col address = n 0 1 2 3 0 000 001 010 011 1 100 101 110 111 col 2 row 1

8 \ 8 \ Example 2-Level Decode DRAM (64Kx1) RAS 256 Rows Row decoder 256x256 cell array Row address latch row 256 Columns A7-A0 column sense/write amps R/W’ col Provide 16-bit address in two 8-bit chunks Column address latch column latch and decoder CAS Dout Din

DRAM Operation • Row Address (~50ns) • Set Row address on address lines & strobe RAS • Entire row read & stored in column latches • Contents of row of memory cells destroyed • Column Address (~10ns) • Set Column address on address lines & strobe CAS • Access selected bit • READ: transfer from selected column latch to Dout • WRITE: Set selected column latch to Din • Rewrite (~30ns) • Write back entire row

Observations About DRAMs • Timing • Access time (= 60ns) < cycle time (= 90ns) • Need to rewrite row • Must Refresh Periodically • Perform complete memory cycle for each row • Approximately once every 1ms • Handled in background by memory controller • Inefficient Way to Get a Single Bit • Effectively read entire row

Enhanced Performance DRAMs • Conventional Access • Row + Col • RAS CAS RAS CAS ... • Page Mode (Burst Mode) • Row + Series of columns • RAS CAS CAS CAS ... • Gives successive bits • Double data rate (DDR) DRAM • Transfer on rising and falling clock edges • Quad data rate (QDR) DRAM • Separate DDR inputs and outputs (used in Communication devices) row access time col access time cycle time page mode cycle time 50ns 10ns 90ns 25ns

Internal organization of DRAM dual inline memory modules (DIMM) contain 4-16 DRAM chips and are 8 bytes wide.

Memory performance improvement

Improving DRAM performance • Multiple accesses to same row (using the row buffer as a cache) • Synchronous DRAM (SDRAM) • Added clock to DRAM interface • Burst mode with critical word first • Double data rate (DDR) transfer data on both edges of clock • Multiple banks on each DRAM device (up to 8 banks in DDR3)

Disk Storage §6.3 Disk Storage • Nonvolatile, rotating magnetic storage Chapter 6 — Storage and Other I/O Topics — 17

Magnetic Disks Disk surface spins at 3600–7200 RPM read/write head arm The surface consists of a set of concentric magnetized rings called tracks The read/write head floats over the disk surface and moves back and forth on an arm from track to track. Each track is divided into sectors

Disk Sectors and Access • Each sector records • Sector ID • Data (512 bytes, 4096 bytes proposed) • Error correcting code (ECC) • Used to hide defects and recording errors • Synchronization fields and gaps • Access to a sector involves • Queuing delay if other accesses are pending • Seek: move the heads • Rotational latency • Data transfer • Controller overhead Chapter 6 — Storage and Other I/O Topics — 19

Disk Capacity • Parameter 18GB Example • Number of Platters 12 • Surfaces / Platter 2 • Number of tracks 6962 • Number of sectors / track 213 • Bytes / sector 512 • Total Bytes 18,221,948,928

Disk Operation • Operation • Read or write complete sector • Seek • Position head over proper track • Typically 6-9ms • Rotational Latency • Wait until desired sector passes under head • Worst case: complete rotation 10,025 RPM  6 ms • Read or Write Bits • Transfer rate depends on # bits per track and rotational speed • E.g., 213 * 512 bytes @10,025RPM = 18 MB/sec. • Modern disks have external transfer rates of up to 80 MB/sec • DRAM caches on disk help sustain these higher rates

Disk Performance • Getting First Byte • Seek + Rotational latency = 7,000 – 19,000 µsec • Getting Successive Bytes • ~ 0.06 µsec each • roughly 100,000 times faster than getting the first byte! • Optimizing Performance: • Large block transfers are more efficient • processor is interrupted when transfer completes

Magnetic Disk Technology • Seagate ST-12550N Barracuda 2 Disk • Linear density 52,187 bits per inch (BPI) • Bit spacing 0.5 microns • Track density 3,047 tracks per inch (TPI) • Track spacing 8.3 microns • Total tracks 2,707 tracks • Rotational Speed 7200 RPM • Avg Linear Speed 86.4 kilometers / hour • Head Floating Height 0.13 microns

Flash Storage §6.4 Flash Storage • Nonvolatile semiconductor storage • 100× – 1000× faster than disk • Smaller, lower power, more robust • But more $/GB (between disk and DRAM) Chapter 6 — Storage and Other I/O Topics — 24

Flash Types • NOR flash: bit cell like a NOR gate • Random read/write access • Used for instruction memory in embedded systems • NAND flash: bit cell like a NAND gate • Denser (bits/area), but block-at-a-time access • Cheaper per GB • Used for USB keys, media storage, … • Flash bits wears out after 1000’s of accesses • Not suitable for direct RAM or disk replacement • Wear leveling: remap data to less used blocks Chapter 6 — Storage and Other I/O Topics — 25

Flash Memory- Floating Gate based Floating Gate is isolated. Floating gate not charged:  Functions like normal MOSFET Floating gate charged:  Charge shields channel region from control gate and prevents the formation of a channel between source and drain

Floating Gate set/reset • Charging / Discharging Floating Gate: • Channel Hot Electron Injection • Fowler-Nordheim Tunneling • Both approaches use high voltages for operations

NOR Array • Reading: • Assert a single word line. The source lines are asserted and the read of the bitline gives the contents of the cell. • Erasure: • Set sources to 12V • Set word lines to Ground • Write: • Set sources to 12V • Set word line to -5V

Virtual Memory §5.4 Virtual Memory • Use main memory as a “cache” for secondary (disk) storage • Managed jointly by CPU hardware and the operating system (OS) • Programs share main memory • Each gets a private virtual address space holding its frequently used code and data • Protected from other programs • CPU and OS translate virtual addresses to physical addresses • VM “block” is called a page • VM translation “miss” is called a page fault Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 29

Address Translation • Fixed-size pages (e.g., 4K) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 30

Virtual Memory design • On page fault, the page must be fetched from disk • Handled by OS software • Smart replacement algorithms (LRU) is used • Try to minimize page fault rate • Fully associative placement • Big page size : 4-16 KB (32-64 KB recently) • Using write-back mechanism Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 31

Page Tables • Stores placement information • Array of page table entries, indexed by virtual page number • Page table register in CPU points to page table in physical memory • Each program has its own PT & PT register • If page is present in memory • PTE stores the physical page number • Plus other status bits (referenced, dirty, …) • If page is not present • PTE refers to location on disk (swap space). Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 32

Swap Space • Swap space is used as a backup memory space for active processes. • OS creates swap space upon a process call and places all of its pages there. • Inactive pages in memory are moved to the swap space on hard drives, which have a slower access time than physical memory. • Swap space can be a dedicated swap partition (recommended), a swap file, or a combination of swap partitions and swap files. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 33

Translation Using a Page Table Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 34

Mapping Pages to Storage Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 35

Reducing Page table size • Using multi-level page tables • Upper level as segment table. • Addresses multiple pages (segment) • Inverted page table • Contains only present pages on main memory. • Requires a hashing function to find the PTE. Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 36

Replacement and Writes • To reduce page fault rate, prefer least-recently used (LRU) replacement • Reference bit (aka use bit) in PTE set to 1 on access to page • Periodically cleared to 0 by OS • A page with reference bit = 0 has not been used recently • Disk writes take millions of cycles • Page at once, not individual locations • Write through is impractical • Use write-back • Dirty bit in PTE set when page is written Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 37

Fast Translation Using a TLB • Address translation would appear to require extra memory references • One to access the PTE • Then the actual memory access • But access to page tables has good locality • So use a fast cache of PTEs within the CPU • Called a Translation Look-aside Buffer (TLB) • Typical: 16–512 PTEs, 0.5–1 cycle for hit, 10–100 cycles for miss, 0.01%–1% miss rate Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 38

Fast Translation Using a TLB Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 39

TLB Misses • If page is in memory • Load the PTE from memory and retry • Could be handled in hardware • Can get complex for more complicated page tables • Or in software • Raise a special exception, with optimized handler • If page is not in memory (page fault) • OS handles fetching the page and updating the page table • Then restart the faulting instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 40

Page Fault Handler • Use faulting virtual address to find PTE • Locate page on disk • Choose page to replace • If dirty, write to disk first • Read page into memory and update page table • Make process runnable again • Restart from faulting instruction Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 41

TLB and Cache Interaction • If cache tag uses physical address • Need to translate before cache lookup • Alternative: use virtual address tag • Complications due to aliasing • Different virtual addresses for shared physical address Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 42

Memory Protection • Different tasks can share parts of their virtual address spaces • But need to protect against errant access • Requires OS assistance • Hardware support for OS protection • Privileged supervisor mode (aka kernel mode) • Privileged instructions • Page tables and other state information only accessible in supervisor mode • System call exception (e.g., syscall in MIPS) Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 43

0: Read? Write? Physical Addr 1: VP 0: VP 0: Yes Yes PP 9 VP 1: VP 1: Yes Yes PP 4 VP 2: VP 2: No No XXXXXXX • • • • • • • • • Read? Write? Physical Addr Yes Yes PP 6 N-1: Yes No PP 9 No No XXXXXXX • • • • • • • • • Memory Protection • Page table entry contains access rights information • hardware enforces this protection (trap into OS if violation occurs) Page Tables Memory Process i: Process j:

Virtual Machines §5.6 Virtual Machines • Host computer emulates guest operating system and machine resources • Improved isolation of multiple guests • Avoids security and reliability problems • Aids sharing of resources • Virtualization has some performance impact • Feasible with modern high-performance comptuers • Examples • IBM VM/370 (1970s technology!) • VMWare • Microsoft Virtual PC Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 45

Virtual Machine Monitor • Maps virtual resources to physical resources • Memory, I/O devices, CPUs • Guest code runs on native machine in user mode • Traps to VMM on privileged instructions and access to protected resources • Guest OS may be different from host OS • VMM handles real I/O devices • Emulates generic virtual I/O devices for guest • Uses time sharing for physical I/O access Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 46

Example: Timer Virtualization • In native machine, on timer interrupt • OS suspends current process, handles interrupt, selects and resumes next process • With Virtual Machine Monitor • VMM suspends current VM, handles interrupt, selects and resumes next VM • If a VM requires timer interrupts • VMM emulates a virtual timer • Emulates interrupt for VM when physical timer interrupt occurs Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 47

Instruction Set Support • User and System modes • Privileged instructions only available in system mode • Trap to system if executed in user mode • All physical resources only accessible using privileged instructions • Including page tables, interrupt controls, I/O registers • Renaissance of virtualization support • Current ISAs (e.g., x86) adapting Chapter 5 — Large and Fast: Exploiting Memory Hierarchy — 48

Maximizing Memory Hierarchy Efficiency: SRAM & DRAM Technology

Maximizing Memory Hierarchy Efficiency: SRAM & DRAM Technology

Presentation Transcript

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5 5

chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

CHAPTER 5

Chapter 5

CHAPTER 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5

Chapter 5