Memory Performance and Organization: A Comprehensive Overview

Administration • Midterm on Thursday Oct 28. Covers material through 10/21. • Histogram of grades for HW#1 posted on newsgroup. • Sample problem set (and solutions) on pipelining are posted on the web page. • Last year’s practice exam, and last year’s midterm (with solutions) are on web site (under “Exams”)

Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms, 1% time) • Addresses divided into 2 halves (Memory as a 2D matrix): • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistorSize: DRAM/SRAM 4-8, Cost & Cycle time: SRAM/DRAM 8-16

DRAM Organization • Row and Column separate, because pins/packaging expensive. So address bus / 2. • RAS (Row Access Strobe) typically first. • Some allows multiple CAS for same RAS (page mode). • Refresh: Write after read (wipes out data), • Refresh: Periodic. Read each row every 8ms. Cost is O(sqrt(capacity)).

4 Key DRAM Timing Parameters • tRAC: minimum time from RAS line falling to the valid data output. • Quoted as the speed of a DRAM when buy • A typical 4Mb DRAM tRAC = 60 ns • Speed of DRAM since on purchase sheet? • tRC: minimum time from the start of one row access to the start of the next. • tRC = 110 ns for a 4Mbit DRAM with a tRAC of 60 ns • tCAC: minimum time from CAS line falling to valid data output. • 15 ns for a 4Mbit DRAM with a tRAC of 60 ns • tPC: minimum time from the start of one column access to the start of the next. • 35 ns for a 4Mbit DRAM with a tRAC of 60 ns

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS 60ns

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns The Bad News RAS RAS 60ns 110ns

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS RAS 60ns 110ns

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS RAS 60ns 110ns 15ns

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS RAS 60ns 110ns 15ns Wait

ExampleMemory Performance Single access • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS CAS RAS 60ns 110ns 15ns Wait 35ns

ExampleMemory Performance Multiple accesses • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS CAS RAS 60ns The Good News 110ns 15ns Wait 35ns

ExampleMemory Performance Multiple accesses • Row Access: 60 ns Row Cycle: 110 ns • Column Access: 15 ns Column Cycle: 35ns RAS CAS CAS CAS CAS CAS 60ns 110ns 15ns Wait 15ns 15ns 35ns 35ns 35ns But refresh is a RAS...

DRAM Performance • A 60 ns (tRAC) DRAM can • perform a row access only every 110 ns (tRC) • perform column access (tCAC) in 15 ns, but time between column accesses is at least 35 ns (tPC). • In practice, external address delays and turning around buses make it 40 to 50 ns • These times do not include the time to drive the addresses off the microprocessor nor the memory controller overhead!

DRAM Trends • DRAMs: capacity +60%/yr, cost –30%/yr • 2.5X cells/area, 1.5X die size in _3 years • ‘98 DRAM fab line costs $2B • Commodity, second source industry => high volume, low profit, conservative • Order of importance: 1) Cost/bit 2) Capacity • First RAMBUS: 10X BW, +30% cost => little impact • Gigabit DRAM will take over market

Main Memory Performance • Simple: • CPU, Cache, Bus, Memory same width (32 or 64 bits) • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits; UtraSPARC 512) • Interleaved: • CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved (logically “wide”).

Why not have wider memory? • Pins, packaging • CPU accesses word at a time. Need multiplexer in critical path. • Unit of expansion. • ECC (need to read full ECC block on every write to portion of block).

Main Memory Performance • Timing model (word size is 32 bits) • 1 to send address, • 6 access time, 1 to send data • Cache Block is 4 words • Simple M.P. = 4 x (1+6+1) = 32 • Wide M.P. = 1 + 6 + 1 = 8 • Interleaved M.P. = 1 + 6 + 4x1 = 11

Main Memory Performance • Timing model (word size is 32 bits) • 1 to send address, • 6 access time, 2 (or more) to send data • Cache Block is 4 words • Interleaved M.P. = 1 + 6 + 4x1 = 11 • Independent reads or writes: don’t need to wait as long as next op is to different bank.

Independent Memory Banks • Memory banks for independent accesses vs. faster sequential accesses • Multiprocessor • I/O • CPU with Hit under n Misses, Non-blocking Cache • Superbank: all memory active on one block transfer (or Bank) • Bank: portion within a superbank that is word interleaved (or Subbank) … Superbank Bank

Independent Memory Banks • How many banks? number banks >= number clocks to access word in bank • For sequential accesses, otherwise will return to original bank before it has next word ready • Increasing DRAM => fewer chips => harder to have banks

32 8 8 2 4 1 8 2 4 1 8 2 DRAMs per PC over Time DRAM Generation ‘86 ‘89 ‘92 ‘96 ‘99 ‘02 1 Mb 4 Mb 16 Mb 64 Mb 256 Mb 1 Gb 4 MB 8 MB 16 MB 32 MB 64 MB 128 MB 256 MB 16 4 Minimum Memory Size

Fast Memory Systems: DRAM specific • Multiple CAS accesses: several names (page mode) • Extended Data Out (EDO): 30% faster in page mode • New DRAMs to address gap; • RAMBUS: reinvent DRAM interface • Each Chip a module vs. slice of memory • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Synchronous DRAM: 2 banks on chip, a clock signal to DRAM, transfer synchronous to system clock (66 - 150 MHz) • RAMBUS first seen as niche (e.g. video memory), now poised to become standard.

Main Memory OrganizationDRAM/Disk interface

Four Questions for Memory Hierarchy Designers • Q1: Where can a block be placed in the upper level? (Block placement) • Q2: How is a block found if it is in the upper level? (Block identification) • Q3: Which block should be replaced on a miss? (Block replacement) • Q4: What happens on a write? (Write strategy)

Four Questions Applied to Virtual Memory • Q1: Where can a block be placed in the upper level? (Block placement: fully-associative vs. page coloring) • Q2: How is a block found if it is in the upper level?(Block identification:translation & lookup, page-tables and TLB) • Q3: Which block should be replaced on a miss? (Block replacement: random vs. LRU) • Q4: What happens on a write? (Write strategy: copy-on-write, protection, etc.) • Q?: protection? demand-load vs. prefetch? fixed vs. variable size? unit of transfer vs. frame size. Software

Paging Organization • size of information blocks that are transferred from secondary to main storage (M) • virtual and physical address space partitioned into blocks of equal size page frames pages disk mem cache reg pages frame

Addresses in Virtual Memory • 3 addresses to consider • Physical address: where in main memory frame is stored • Virtual address: a logical address, relative to process/name space/page table • Disk address: specifying where on disk page is stored. • Disk addresses can be either physical (specifying cyclinder, block, etc.) or indirect (another level of naming --- even file system or segment). • Virtual addresses logically include a process_id concatenated to n-bit address.

Virtual Address Space and Physical Address Space sizes • From point of view of hierarchy, the disk will have more capacity than DRAM BUT, this does not mean that virtual address space will be bigger than physical address space. • Virtual addresses provide protection and a naming mechanism. • A long, long, time ago, some machines had physical address space bigger than virtual address space, and more core than vaddr space. (Multiple processes in memory at same time).

V = {0, 1, . . . , n - 1} virtual address space M = {0, 1, . . . , m - 1} physical address space MAP: V --> M U {0} address mapping function Address Map n > m, (n=m, n<m history) MAP(a) = a' if data at virtual address a is present in physical address a' and a' in M = 0 if data at virtual address a is not present in M a missing item fault Name Space V fault handler Processor 0 Secondary Memory Addr Trans Mechanism Main Memory a a' physical address OS performs this transfer

V.A. P.A. Paging Organization unit of mapping 0 frame 0 1K Addr Trans MAP 0 1K page 0 1024 1 1K 1024 1 1K also unit of transfer from virtual to physical memory 7 1K 7168 Physical Memory 31 1K 31744 Virtual Memory Address Mapping 10 VA page no. disp Page Table Page Table Base Reg Access Rights actually, concatenation is more likely V + PA index into page table table located in physical memory physical memory address

miss VA PA Trans- lation Cache Main Memory Virtual Address and a Cache CPU hit Virtually Addressed Caches Revisited data It takes an extra memory access to translate VA to PA This makes cache access very expensive, and this is the "innermost loop" that you want to go as fast as possible ASIDE: Why access cache with PA at all? VA caches have a problem! synonym / alias problem: two different virtual addresses map to same physical address => two different cache entries holding data for the same physical address! for update: must update all cache entries with same physical address or memory becomes inconsistent determining this requires significant hardware, essentially an associative lookup on the physical address tags to see if you have multiple hits; or software enforced alias boundary: same lsb of VA &PA > cache size

TLBs A way to speed up translation is to use a special cache of recently used page table entries -- this has many names, but the most frequently used is Translation Lookaside Buffer or TLB Virtual Address Physical Address Dirty Ref Valid Access Really just a cache on the page table mappings TLB access time comparable to cache access time (much less than main memory access time)

Memory Performance and Organization: A Comprehensive Overview

Memory Performance and Organization: A Comprehensive Overview

Presentation Transcript

Administration

Administration

Administration

ADMINISTRATION

Administration

ADMINISTRATION

Administration

Administration

Administration

Administration

Administration

Administration

Administration

Administration

Administration

ADMINISTRATION

Administration

administration

Administration

Administration

ADMINISTRATION

Administration