1 / 18

Lecture 13 Main Memory

Lecture 13 Main Memory. Computer Architecture COE 501. Who Cares About the Memory Hierarchy?. Processor-DRAM Memory Gap (latency). µProc 60%/yr. (2X/1.5yr). 1000. CPU. “Moore’s Law”. 100. Processor-Memory Performance Gap: (grows 50% / year). Performance. 10. DRAM 9%/yr.

jamil
Download Presentation

Lecture 13 Main Memory

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 13Main Memory Computer Architecture COE 501

  2. Who Cares About the Memory Hierarchy? Processor-DRAM Memory Gap (latency) µProc 60%/yr. (2X/1.5yr) 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 9%/yr. (2X/10 yrs) DRAM 1 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 Time

  3. Impact on Performance • Suppose a processor executes at • Clock Rate = 200 MHz (5 ns per cycle) • CPI = 1.1 • 50% arith/logic, 30% ld/st, 20% control • Suppose that 10% of memory operations get 50 cycle miss penalty • CPI = ideal CPI + average stalls per instruction = 1.1(cyc) +( 0.30 (datamops/ins) x 0.10 (miss/datamop) x 50 (cycle/miss) ) = 1.1 cycle + 1.5 cycle = 2. 6 • 58 % of the time the processor is stalled waiting for memory! • a 1% instruction miss rate would add an additional 0.5 cycles to the CPI!

  4. Main Memory Background • Performance of Main Memory: • Latency: Cache Miss Penalty • Access Time: time between request and when desired word arrives • Cycle Time: time between requests • Bandwidth: I/O & Large Block Miss Penalty (L2) • Main Memory is DRAM: Dynamic Random Access Memory • Dynamic since needs to be refreshed periodically (8 ms) • Addresses divided into 2 halves: • RAS or Row Access Strobe • CAS or Column Access Strobe • Cache uses SRAM: Static Random Access Memory • No refresh (6 transistors/bit vs. 1 transistor/bit for DRAM) • Address not divided • Size: DRAM/SRAM ­ 4-8, Cost & Cycle time: SRAM/DRAM ­ 8-16

  5. Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger Wr Driver & Precharger - + - + - + - + - + - + - + - + Sense Amp Sense Amp Sense Amp Sense Amp Typical SRAM Organization: 16-word x 4-bit Din 3 Din 2 Din 1 Din 0 WrEn Precharge A0 Word 0 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A1 Address Decoder A2 Word 1 SRAM Cell SRAM Cell SRAM Cell SRAM Cell A3 : : : : Word 15 SRAM Cell SRAM Cell SRAM Cell SRAM Cell Q: Which is longer: word line or bit line? Dout 3 Dout 2 Dout 1 Dout 0

  6. A N 2 words N x M bit SRAM WE_L OE_L D M Logic Diagram of a Typical SRAM • Write Enable is usually active low (WE_L) • Bi-directional data • A new control signal, output enable (OE_L) is needed • WE_L is asserted (Low), OE_L is disasserted (High) • D serves as the data input pin • WE_L is disasserted (High), OE_L is asserted (Low) • D is the data output pin • Both WE_L and OE_L are asserted: • Result is unknown. Don’t do that!!!

  7. Classical DRAM Organization (square) bit (data) lines r o w d e c o d e r Each intersection represents a 1-T DRAM Cell RAM Cell Array word (row) select Column Selector & I/O Circuits row address Column Address • Row and Column Address together: • Select 1 bit a time data

  8. Logic Diagram of a Typical DRAM RAS_L CAS_L WE_L OE_L A 256K x 8 DRAM D 9 8 • Control Signals (RAS_L, CAS_L, WE_L, OE_L) are all active low • Din and Dout are combined (D): • WE_L is asserted (Low), OE_L is disasserted (High) • D serves as the data input pin • WE_L is disasserted (High), OE_L is asserted (Low) • D is the data output pin • Row and column addresses share the same pins (A) • RAS_L goes low: Pins A are latched in as row address • CAS_L goes low: Pins A are latched in as column address • RAS/CAS edge-sensitive

  9. Main Memory Performance • Wide: • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits) • Interleaved: • CPU, Cache, Bus 1 word: Memory N Modules(4 Modules); example is word interleaved • Simple: • CPU, Cache, Bus, Memory same width (32 bits)

  10. Main Memory Performance Cycle Time Access Time Time • DRAM (Read/Write) Cycle Time >> DRAM (Read/Write) Access Time • ­ 2:1; why? • DRAM (Read/Write) Cycle Time : • How frequent can you initiate an access? • Analogy: A little kid can only ask his father for money on Saturday • DRAM (Read/Write) Access Time: • How quickly will you get what you want once you initiate an access? • Analogy: As soon as he asks, his father will give him the money • DRAM Bandwidth Limitation analogy: • What happens if he runs out of money on Wednesday?

  11. Increasing Bandwidth - Interleaving Access Pattern without Interleaving: CPU Memory D1 available Start Access for D1 Start Access for D2 Memory Bank 0 Access Pattern with 4-way Interleaving: Memory Bank 1 CPU Memory Bank 2 Memory Bank 3 Access Bank 1 Access Bank 0 Access Bank 2 Access Bank 3 We can Access Bank 0 again

  12. address address address address 0 1 4 5 2 3 9 8 6 7 12 13 10 11 14 15 Bank 1 Bank 0 Bank 2 Bank 3 Main Memory Performance(Figure 5.32, pg. 432) • How long does it take to send 4 words? • Timing model • 1 cycle to send address, • 6 cycles to access data + 1 cycle to send data • Cache Block is 4 words • Simple Mem. = 4 x (1+6+1) = 32 • Wide Mem. = 1 + 6 + 1 = 8 • Interleaved Mem. = 1 + 6 + 4x1 = 11

  13. Wider vs. Interleaved Memory • Wider memory • Cost for the wider connection • Need a multiplexor between cache and CPU • May lead to significantly larger/more expensive memory • Intervleaved memory • Send address to several banks simultaneously • Want Number of banks > Number of clocks to access bank • As the size of memory chips increases, it becomes difficult to interleave memory • Difficult to expand memory by small amounts • May not work well for non-sequential accesses

  14. Avoiding Bank Conflicts • Lots of banks int x[256][512]; for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) x[i][j] = 2 * x[i][j]; • Even with 128 banks their are conflicts, since 512 is an even multiple of 128 • SW: loop interchange or declaring array not power of 2 • HW: Prime number of banks • bank number = address mod number of banks • address within bank = address / number of banks • modulo & divide per memory access? • address within bank = address mod number words in bank (3, 7, 31)

  15. N cols RAS_L Page Mode DRAM: Motivation Column Address • Regular DRAM Organization: • N rows x N column x M-bit • Read & Write M-bit at a time • Each M-bit access requiresa RAS / CAS cycle • Fast Page Mode DRAM • N x M “register” to save a row DRAM Row Address N rows M bits M-bit Output 1st M-bit Access 2nd M-bit Access CAS_L A Row Address Col Address Junk Row Address Col Address Junk

  16. N cols Fast Page Mode Operation Column Address • Fast Page Mode DRAM • N x M “SRAM” to save a row • After a row is read into the register • Only CAS is needed to access other M-bit blocks on that row • RAS_L remains asserted while CAS_L is toggled DRAM Row Address N rows N x M “SRAM” M bits M-bit Output 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit RAS_L CAS_L A Row Address Col Address Col Address Col Address Col Address

  17. Fast Memory Systems: DRAM specific • Multiple column accesses: page mode • DRAM buffers a row of bits inside the DRAM for column access (e.g., 16 Kbits row for 64 Mbits DRAM) • Allow repeated column access without another row access • 64 Mbit DRAM: cycle time = 90 ns, optimized access = 25 ns • New DRAMs: what will they cost, will they survive? • Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock • RAMBUS: startup company; reinvent DRAM interface • Each chip a module vs. slice of memory • Short bus between CPU and chips • Does own refresh • Variable amount of data returned • 1 byte / 2 ns (500 MB/s per chip) • Niche memory or main memory? • e.g., Video RAM for frame buffers, DRAM + fast serial output

  18. Main Memory Summary • Wider Memory • Interleaved Memory: for sequential or independent accesses • Avoiding bank conflicts: SW & HW • DRAM specific optimizations: page mode & Specialty DRAM

More Related