Lecture 15 main memory
This presentation is the property of its rightful owner.
Sponsored Links
1 / 16

Lecture 15 Main Memory PowerPoint PPT Presentation


  • 79 Views
  • Uploaded on
  • Presentation posted in: General

Lecture 15 Main Memory. Main Memory Background. Performance of Main Memory: Latency: Cache Miss Penalty Access Time(AT) : time between request and word arrives Cycle Time(CT) : time between requests Bandwidth: I/O & Large Block Miss Penalty (L2) Main Memory, a 2D matrix, is DRAM :

Download Presentation

Lecture 15 Main Memory

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Lecture 15 main memory

Lecture 15Main Memory

CS510 Computer Architectures


Main memory background

Main Memory Background

  • Performance of Main Memory:

    • Latency: Cache Miss Penalty

      • Access Time(AT): time between request and word arrives

      • Cycle Time(CT): time between requests

    • Bandwidth: I/O & Large Block Miss Penalty (L2)

  • Main Memory, a 2D matrix, is DRAM:

    • Dynamic since needs to be refreshed periodically (8 ms)

      • Difference in AT and CT, AT<CT

    • Addresses divided into 2 halves, multiplexing them to memory:

      • RAS or Row Access Strobe

      • CAS or Column Access Strobe

  • Cache uses SRAM:

    • No refresh (6 transistors/bit vs. 1 transistor/bit)

      • No difference in AT and CT, AT=CT

    • Address not divided

CS510 Computer Architectures


Main memory background1

Main Memory Background

  • Size: DRAM/SRAM » 4~8

  • Costand Cycle time: SRAM/DRAM » 8~16

  • Capacity of DRAM :4 times/3 years or 60%/year

  • RAS access time :7% per year

CS510 Computer Architectures


Main memory organization

1-word-wide

memory

Interleaved Memory

Wide Memory

CPU

CPU

CPU

MUX

Cache

Cache

Cache

BUS

BUS

BUS

M

M

M

M

M

M

Bank bank bank bank

0 1 2 3

Main Memory Organization

  • Simple:

    • CPU, Cache, Bus, Memory are same width (32 bits)

  • Wide:

    • CPU/Mux 1 word; Mux/Cache, Bus, Memory N words (Alpha: 64 bits & 256 bits)

  • Interleaved:

    • CPU, Cache, Bus 1wd: Memory N Modules(4 Modules); shows word interleave

CS510 Computer Architectures


Main memory performance

Address Bank 0 Bank1 Bank 2 Bank 3

0

4

8

12

3

7

11

15

1

5

9

13

2

6

10

14

Main Memory Performance

Timing model

  • 1 to send address,

  • 6 access time,

  • 1 to send data

  • Block access time

    • Assuming Cache Block is 4 words

Simple M.P. =4x (1+6+1) = 32

Wide M.P. =1+6+ 1= 8

Interleaved M.P.= 1+6 + (4x1)= 11

CS510 Computer Architectures


Technique for higher bw 1 wider main memory

Technique for Higher BW:1. Wider Main Memory

  • Alpha AXP 21064 : 256-bit wide L2, Memory Bus, Memory

  • Drawbacks

    • expandability

      • doubling the width needs doubling the capacity

    • bus width

      • need a multiplexer to get the desired word from a block

    • error correction - separate error correction every 32 bits

      • otherwise, on WRITE, read block -> modify word -> calculate the new ECC -> store

CS510 Computer Architectures


Technique for higher bw 2 interleaved memory

  • block size(word) 1 2 4

  • miss rate(%) 3 2 1

Technique for Higher BW:2. Interleaved Memory

Interleaved Memory and Wide Memory

  • Consider the following description of a machine and its cache performance

    • mem bus width = 1 word=32 bit

  • memory accesses/ instr = 1.2

  • cache miss penalty = 8(1+6+1) cycles

  • average CPI(ignoring cache misses) = 2

  • What is the improvement over the base machine(block size=1) in performance of interleaving 2-way and 4-way versus doubling the width of memory and the bus

  • CS510 Computer Architectures


    Interleaved memory

    Interleaved Memory

    Answer

    • CPI + (M ref/instr. x miss rate x miss penalty)

      • 2 + (1.2 x (0.03 for 1-way, 0.02 for 2-way, or 0.01 for 4-way) x mis penalty)

    • the CPI for the base machine(Simple Memory)(BM)

      • 2+(1.2 x 0.03 x 8) = 2.288

    • 2-word wide memory

      • 32-bit bus and mem, no interleaving = 2+(1.2x0.02x(2x8)) = 2.384 slower than BM

      • 32-bit bus and mem, interleaving = 2+(1.2x0.02x(1+6+(2x1))) = 2.216 faster than BM

      • 64-bit bus and mem, no interleaving = 2+(1.2x0.02x8) = 2.192 faster than BM

    • 4-word wide memory

      • 32-bit bus and mem, no interleaving = 2+(1.2x0.01x(4x8)) = 2.384slower than BM

      • 32-bit bus and mem, interleaving = 2+(1.2x0.01x(1+6+(4x1))) = 2.132 faster than 2-word

      • 64-bit bus and mem, no interleaving = 2+(1.2x0.01x(2x8)) = 2.192 same as 2-word

    CS510 Computer Architectures


    Technique for higher bw 3 independent memory banks

    Superbank Bank

    Superbank Offset

    Bank Number Bank Offset

    Superbank Number

    Technique for Higher BW:3. Independent Memory Banks

    • Interleaved Memory-Faster Sequential Accesses;

    • Independent Memory Banks - Faster Independent Accesses

    • Motivation: Higher BW for sequential accesses by interleaving sequential bank addresses - each bank shares the address line

    • Memory banks for independent accesses - each bank has a bank controller, separate address lines

      • 1 bank for I/O, 1 bank for cache read, 1 bank for cache write, etc.

      • If 1 controller controls all the banks, it can only provide fast access time for one operation

      • Benefit of memory banks for Miss under Miss in Non-faulting caches

        Superbank: all memory banks active on one block transfer

        Bank: portion within a superbank that is word interleaved

    CS510 Computer Architectures


    Independent memory banks

    Independent Memory Banks

    • How many banks?

      • For sequential accesses, a new bank delivers a word on each clock

      • For sequential accesses, number of banks ³ number of clocks to access a word in a bank

      • Otherwise will return to the original bank before it has the next word ready

    • Increasing capacity of a DRAM chip => fewer chips to build the same capacity memory system => harder to have banks

    CS510 Computer Architectures


    Technique for higher bw 4 avoiding bank conflicts

    Bank0Bank1Bank127 ,…, Bank511

    0,0 0,1 ,..., 0,127 ,..., 0,511

    1,0 1,1 ,..., 1,127 ,..., 1,511

    ...

    127,0 127,1 ,..., 127,127 ,..., 127,511

    ...

    128,0 128,1 ,..., 128,127 ,..., 128,511

    ...

    255,0 255,1 ,..., 255,127 ,..., 255,511

    int x[256][512];

    for (j = 0; j < 512; j = j+1)

    for (i = 0; i < 256; i = i+1)

    x[i][j] = 2 * x[i][j];

    Column processing

    Technique for Higher BW:4. Avoiding Bank Conflicts

    Even a lot of banks, still bank conflict in certain regular accesses

    - e.g. Storing 256x512 array in 128 banks and column processing (512 is an even multiple of 128)

    Inner Loop is a column processing which causes bank conflicts

    Column elements are in the same bank

    CS510 Computer Architectures


    Avoiding bank conflicts

    Avoiding Bank Conflicts

    • SW approaches

      • Loop interchange to avoid accessing the same bank

      • Declaring array size not power of 2(number of banks is a power of 2) so that addresses point to the different banks, i.e., a column elements are spread around different banks

    • HW: Prime number of banks

      • bank number = (address) MOD (number of banks)

      • address within bank = address / number of banks

        • To avoid calculation of divide per memory access

          address within bank = (address) MOD (number words in bank ) 3=(31)MOD(7)

      • bank number? words per bank?

        • Easy if both are power of 2

    CS510 Computer Architectures


    Fast bank number

    bi=(x) MOD (ai), 0 < bi < ai, 0 < x < a0 x a1 x a2 x ...

    Fast Bank Number

    Chinese Remainder Theorem

    As long as two sets of integers ai and bi follow these rules

    • and that ai and aj are co-prime if i ¹ j, then the integer x has only one solution (unambiguous mapping):

    • bank number = b0=(x) Mod (a0);

    • number of banks = a0 (= 3 in ex), 0 < b0 < a0

      • address within a bank = b1=(x) Mod (a1);

      • size of a bank = a1 (= 8 in ex)

      • N words’ addresses 0 to N-1;

      • prime no. of banks(3);

      • words/bank power of 2(8)

    CS510 Computer Architectures


    Fast bank numbers

    Seq. Interleaved Modulo Interleaved

    Bank Number:012012

    Addr in Bank: 001201681345911726781810239101131911412131412420515161721135618192062214721222315723

    Bank # = (5) Mod (3) = 2:

    5/3 = 1

    (5) Mod (8) = 5

    Fast Bank Numbers

    Address = 5

    CS510 Computer Architectures


    Technique for higher bw 5 dram specific interleaving

    Technique for Higher BW:5. DRAM Specific Interleaving

    • DRAM access - Row Access(RAS) and Column Access(CAS)

    • Multiple accesses to a RAS buffer: several names (page mode)

      • 64 Mbit DRAM: cycle time = 100 ns, page mode = 20 ns

    • New DRAMs to address CPU-DRAM speed gap; what will they cost, will they survive?

      • Synchronous DRAM: Provide a clock signal to DRAM, transfer synchronous to system clock

      • RAMBUS: startup company; reinvent DRAM interface

        • Each Chip acts as a module vs. slice of memory(or bank)

        • Short bus between CPU and chips

        • Does own refresh

        • Variable amount of data returned

        • 1 byte / 2 ns (500 MB/s per chip)

    • Niche memory only? or main memory?

      • e.g., Video RAM for frame buffers, DRAM + fast serial output

    CS510 Computer Architectures


    Main memory summary

    Main Memory Summary

    • Wider Memory: for independent access

    • Interleaved Memory: for sequential or independent accesses

    • Avoiding bank conflicts: SW & HW

    • DRAM specific optimizations: page mode & Specialty DRAM

    CS510 Computer Architectures


  • Login