Lecture 14 Advanced Memory Systems

Lecture 14Advanced Memory Systems Advanced Memory Systems

Lecture 13:Advanced Memory System In this lecture, we will study • Multiple module memory system • Memory address interleaving • Associative memory system • Hierarchical storage system • Cache memory system • Virtual memory system Advanced Memory Systems

Multiple Module Memory and Address Interleaving Advanced Memory Systems

Memory Cell Array Decoder MAR . . . Amplifier Access Cell Array MBR Decode Amplify Memory Characteristics Memory access is a sequence of events; [1] Decode address [2] Access cell array and sense/amplify signals [3] Send data to MBR In this situation, if there are 3 such memory modules in the memory system, each module can be made to perform different phases of memory access resulting a fast memory. Advanced Memory Systems

U1 MAR U2 MAR Memory Module Memory Module MBR MBR Memory Address Bus U0 MAR Memory Module MBR Memory Data Bus Multiple Module Memory 3-Module Memory 3-module memory can be made to operate; - While U0 decodes address, U1 accesses cell array, and U2 sends data to MBR - Such a mode of operation is possible only if addresses are interleaved Advanced Memory Systems

XX=00 XX=01 XX=10 XX=11 1001100 Address Interleaving • For a 4-module, U0, U1, U2, U3, memory system, addresses are called “interleaved” if consecutive addresses are directing to the different modules such as • U0, U1, U2, U3, U0, U1, U2, U3, U0, and so on • Interleaved addresses give the best performance 1001100XX Decode address Access cell array Send to MBR Advanced Memory Systems

Address Interleaving and the Instruction Pipelined Example : Consider a GPR computer architecture, and R-M instruction execution steps - 4-stage pipeline • IF stage: Instruction Fetch, and update PC • Memory access • IP Stage: Instruction decode, Calculation of EA, Register Fetch • OP Stage: Memory operand fetch, or store data • Memory access • EX Stage: Perform Operation, and store result in a register Advanced Memory Systems

Clock 0 1 2 3 4 5 6 7 IF Stage U0 IF0 IF2 IF4 IF6 U1 IF1 IF3 IF5 IF7 IP Stage IP0 IP1 IP2 IP3 IP4 IP5 OF Stage V0 OF0 OF2 OF4 V1 OF1 OF3 2-Module Memory and Instruction Execution Pipeline • Separate instruction(U) and data(V) memory • Each memory consists of 2 modules(U0, U1 and V0, V1, respectively), providing effective memory access time of the memory 1/2 the access time of each module • Memory access time = 2 clock periods • 2-module memory effective access time = 1 clock period EX Stage EX0 EX1 EX2 Advanced Memory Systems

Clock 0 1 2 3 4 5 6 7 IF and U0 IF0 OF0 IF4 OF4 OF U1 IF1 OF1 IF5 Stages U2 IF2 OF2 IF6 U3 IF3 OF3 IF7 IP Stage IP0 IP1 IP2 P3 IP4 IP5 Multi-Module Memory and Instruction Execution Pipeline 4-Module memory - Instruction and Data in the same memory EX Stage EX0 EX1 EX2 Advanced Memory Systems

Multi-Module Memory Performance • Multi-Module memory does not perform as such good as shown in the previous slide • When the addresses generated for memory accesses are completely sequential it gives the ideal performance • Completely interleaved sequence of addresses • It could give good performance for processing with a vector(Vector elements are given sequential addresses) • However, in practical programs, it never happens • Data are more frequently processed in random orders • Programs usually run in sequential order, but branches, conditional branches, procedure calls/returns, loops, etc prohibits from sequential instruction address generation • Statistics: • For an n-Module memory, the performance improvement is about square root of n Advanced Memory Systems

Associative Memory(Content Addressable Memory) Advanced Memory Systems

Memory word Search all the words in parallel for the words that match with the given value Associative Memory • Memory that can be accessed by the contents of the stored information - Content Addressable Memory(CAM) • Example: To find the words that contain the given value, so that the remainder of information in the words can be accessed • Very useful for table search operations • Search modes • Bit-serial or Bit-parallel • Match criteria • =, >, > , <, < , ... Advanced Memory Systems

Cj C Search Data Register Mj M Mask Register I Match Indicator W0 W1 W2 Wi Wn-1 Ii bij Organization of CAM Advanced Memory Systems

Hierarchical Storage Advanced Memory Systems

Access Time: t0 < t1 < t2 < … < tn-1 Capacity: S0 < S1 < S2 < … < Sn-1 Cost/bit: Q0 > Q1 > Q2 > … > Qn-1 Storage Hierarchy • There are several storage systems in a computer system M0, M1, M2, … , Mn-1 with following characteristics; • Combined memory system is a hierarchical storage system with the memory characteristics as follows; • Access Time: t0 • Capacity: Sn-1 • Cost/bit: Qn-1 • Example: Cache - Main Memory - Secondary Storage(Disk) • Access Time: tC, Capacity: SD, Cost/bit: QD Advanced Memory Systems

Hierarchical Storage In a Main Memory-Secondary Storage hierarchy system, speed of a program is limited by the access time of • Main Memory: executing program and data are accessed from main memory • Secondary Storage: All the information needed by the program cannot be stored in the Main Memory • Cache • A storage faster than main memory, called Cache is placed on top of memory to Improve the Access Time limitation of Main Memory • Resulting Hierarchy(to get the memory access time looks faster) • Access Time: tC(<tM), Capacity: SM, Cost/bit: QM • Virtual Storage System • A larger capacity, low cost secondary storage under the main memory • Resulting hierarchy(to get the memory capacity looks larger) • Access Time: tM, Capacity: SS, Cost/bit: QS Advanced Memory Systems

Cache Storage Advanced Memory Systems

Cache Storage System • Locality of Reference • During execution of a program, it is highly probable that the instruction and data executed recently will be executed again soon • Temporal Locality • Recently used information will probably be reused • Spatial Locality • Adjacent information will probably be accessed with a small time apart • Organization of information • Both Cache and Main Memory are organized into same size blocks, a contiguous words • Exchange of information between Cache and Main Memory is in the unit of a block Advanced Memory Systems

Memory Program Cache Memory Address Bock n(Miss) Hit ? . . . y (hit) Decide Victim B . . . Cache Main Memory B Exchange Cache Storage System Operation Advanced Memory Systems

Mapping of Memory Blocks into Cache Blocks • Direct Mapping • Cache Block Number = (MM block number) MOD (Number of cache blocks) • Each of the MM blocks is associated with a unique Cache block • Set Associative Mapping • Cache Block Number = (MM block number) MOD (Number of sets in cache) • Each of the MM block is associated with a unique Cache set, which consists of n blocks, where n is the size of the set • Fully Associative Mapping • Each of the MM block can occupy any any Cache block • Example: Cache: 6 blocks, 3 sets, MM block 8 = ? • Direct : 8 MOD 6 = 2 • Set Associative: 8 MOD 3 = 2 • Fully Associative: 0 ~ 5 Advanced Memory Systems

Block Number Block Number Memory Cache 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 . . . set 0 0 set 1 1 set 2 2 set 3 3 set 4 4 set 5 5 Direct Mapping 8 MOD 6 = 2 set 0 0 set 0 1 set 0 2 set 0 3 set 0 4 set 0 5 Fully Associative Mapping set 0 0 set 0 1 set 1 2 set 1 3 set 2 4 set 2 5 Set Associative Mapping with set size=2(set associativity=2) Mapping of Memory Blocks into Cache Blocks Advanced Memory Systems

Block Address Word Address TAG SET Main Memory Address • Main memory address consists of a block address and a word address • Block address consists of a TAG and SET address • A block of data in the cache is stored along with its TAG • Example: 1 Mbyte MM, 1 word=4 bytes, 8 Kbyte cache, Set associativity =2, Block size=32 words • Word address: 1 Mbyte -> 20_bit address, 1 word=4 bytes -> 2_bit address, word address -> 2 + log232 = 7 bits • SET address: 8 Kbyte cache -> 64 blocks -> 32 sets SET = 5_bit • TAG address: 20-(7+5)=8_bit Advanced Memory Systems

Address TAG SET Word CPU Cache Word Select Di Do V TAG data Block Select ... =0? MUX Memory y Hit Direct Mapping Cache • In the Direct Mapping • SET = block number • Access cache with SET to read TAG • If Address(TAG) = Cache(TAG), Hit, otherwise Miss • If Hit, check V, if V=1 read to MUX to select the word, otherwise wait till V=1 • If Miss, find a victim block for replacement • V: Validity Bit • Even if Hit, V may not be 1 • Loading cache is a block basis, therefore, some words in the block may be valid, but some may not when the block is being loaded Advanced Memory Systems

TAG SET WD Address D0 Cache Directory D1 CPU V TAG0 V TAG1 Di Do … … … ... =0? =0? C0 C1 Cache … … Data Data Memory Hit 0 MUX Hit 1 Set Associative Mapping Cache • Set Associativity = 2 • SET is sent to both directory and cache • Directory determines hit 0, hit 1 or miss • When hit, concatenation of SET and WD accesses the word in the corresponding cache if V=1 Advanced Memory Systems

4 Algorithms for Cache Miss Upon miss, we need four algorithms • Selecting a victim block, 2 cases • There is a free block in the cache • Load cache block from memory and update cache directory • There is no free block in the cache • Select a victim block in the cache to exchange victim block with the block on demand, update cache directory • Retiring the victim block • When can we access the information in the demanding block ? • From memory ? Or after it is loaded in cache ? • On memory writes, do we have to bring the block in the cache ? • Write into the memory ? Or bring the block in the cache then write ? Advanced Memory Systems

(1) Selecting A Victim Block:Simple Cases • It is very straight forward for Direct Mapping cache, since cache block that should load the block on demand is unique. • In Set Associative Mapping(Fully Associative Mapping), candidate for victim block is not unique. Number of candidate victim blocks is equal to the set size. • If set size is small, say 2, the selection of a victim block is simple. Based on the “Temporal Locality” the block which were not referenced last becomes the victim block. • In such a case, 1-bit last reference(L) bit can be attached to each cache block in the directory. • If set size is greater than 2, but not large, still several bits can be attached to the block in the directory to record the access records. Advanced Memory Systems

Selecting A Victim Block:More Complex Cases • In general, cache performance is better for a larger cache as well as set size becomes larger. • When set size is large, selection of a victim block needs to select from a large number of candidate victim blocks • Random • Hashing is most popular for Fully Associative Mapping cache • If there are n candidate victims, apply a hashing function to SET address to obtain one from n • Algorithm is simple to implement with hardware, but do not consider locality of reference • LRU(least recently used), Known to give the best performance • LRU selects a least recently used block as the victim among the candidates • A counter is needed for each block in the cache directory • Counters are reset periodically • Whenever a memory access hit to a block, its associated counter in the directory advances 1 • Among the candidate blocks, counter that represents the smallest number becomes the victim block Advanced Memory Systems

Victim Block Original copy of the Victim Block (2) Retiring a Victim Block • If there were no write operation(s) on the victim block, information in the block in cache and its original copy in memory are identical - Victim block is clean - No need to write back to memory - Overwrite the demanding block to the victim block • If there were some write operations on victim block, the original copy in the memory should be made up-to-date - Victim block is dirty(contaminated) - Need to write back to memory Advanced Memory Systems

CPU Cache Memory CPU Cache Memory (3)Time to Access Information • Load cache then access it • Accesses are made always to the cache • Miss penalty is great, since all the words in the block must be loaded before accessing a word • Straight forward to control • Access memory while loading cache • The first access to the missing block is made to the memory, and thereafter to the cache • Miss penalty is low, one memory access time Advanced Memory Systems

CPU Cache Memory CPU Cache Memory (4) Write Policy • Write Through(Store Through) • If hit, write both in cache and memory • If write miss, either • Load the missing block and write both in the cache and in the memory - Write Allocate • Do not load cache and write in memory - No Write Allocate(Write Wherever) • Cache and memory blocks are kept identical • Victim block retirement does not needs WB • Write Back • Write operation always takes place on cache • Victim block retirement may require WB operation • But if the victim block is clean, no need for WB • Including a “dirty-bit” in the directory to record the fact that information was changed in the block • Cache and memory blocks may have inconsistent information Advanced Memory Systems

Cache Storage Performance h: hit rate = (Number of hits) / (number of memory references) m: miss rate = 1 - h Tc: Cache access time, Tm: Memory access time Tcm: Cache storage system access time Access memory while loading cache: Tcm = h x Tc + (1 - h) x Tm Load cache then access cache: Tcm = Tc + (1 - h) x Tm T: average instruction execution time T = B + (1 - h) x r x (Tm - Tc) , where B: instruction execution time for ideal cache and r: average memory references per instruction Sc: CPU speed Sc = 1/T = 1/[B + (1 - h) x r x (Tm - Tc)] Advanced Memory Systems

L1 L2 Memory CPU Hierarchical Cache • L1 - L2 Cache • Go to L1 cache and L2 cache at the same time • Hit to L1: Speed is same as the conventional cache • Miss to L1, wait for L2 to respond • If Hit to L2, access L2: Slower than conventional cache, but faster that conventional cache miss • If miss to L2, block exchange • Find the victim block in L1 and write it into L2(this may require to find another victim block in L2 to write back into memory) • Load the demanding block from memory into L1 and access L1 • This may be slower than the conventional cache miss Advanced Memory Systems

Virtual Storage Advanced Memory Systems

Virtual Storage Space Real Storage Space Memory Real Address page 0 page 1 page 2 … page 36 ... Virtual Address page 0 page 1 page 2 ... A B C B C Secondary Storage A Virtual Storage System Advanced Memory Systems

Secondary Storage Block 0 Block 1 . . . Block n-1 . . . Memory Block 0 Block 1 . . . Block m-1 . . . . . . . . . . . . . . . . . . . . . Storage Organization Main memory and secondary storage are organized in to a block, and information is exchanged between memory and secondary storage in the unit of a block. This block can be; - Page(fixed number of contiguous words) - Segment(Variable number of contiguous words) - Paged Segment(Variable number of contiguous pages) Advanced Memory Systems

... ... Paged/Segmented Memory Paged System P0 P1 P2 P3 P4 P5 P6 P7 Segmented System S0 S1 S2 S3 Paged Segment System S0 S1 S2 S3 S3 P0 P1 P2 P3 P4 P5 P6 P7 Advanced Memory Systems

Program Virtual Address Virtual Page Addr Virtual Word Addr Virtual Address Page Table Page Table Hit ? n(page fault) Real Address y Decide Victim Page Real page addr Real word addr Real word address Virtual word address Replace A Page Secondary Storage Memory Virtual Storage System Operation Advanced Memory Systems

19 12 11 0 Real page addr Real byte addr 31 12 11 0 Virtual page addr Virtual byte addr Virtual and Real Address • Main Memory: 1 Mbyte, 1 page: 1 k words, 1 word: 4 bytes, 32-bit virtual address Find Real Address and Virtual Address organizations • Real Address: Byte address/page: 12 bits [1 word=4bytes(2 bits), 1 page=1K words(10 bits)] Page Address: Number of Memory pages: 220 / 212 = 28 • Virtual Address: • Virtual byte address is same as real byte address: 12 bits • Virtual page address: 32-12=20 bits Advanced Memory Systems

CPU VA VPA VWA V RPA VPA PT RA RPA RWA Memory Address Translation Using PT Advanced Memory Systems

CPU VA VPA VWA V VPA RPA PT TLB VPA VPA V RPA RA RPA RWA Memory Translation Look-Aside Buffer • Three Cases • - Hit both PT and TLB • RPA is obtained fast from TLB • - Hit PT and Miss TLB • RPA is obtained from PT • Record this page in TLB • - Miss both PT and TLB • Page fault • Page replacement • Record this new page in both PT and TLB Advanced Memory Systems

V RPA other information Page Mode 00 Page is FREE 01 Page is IN-USE 10 Page is LOCKED-IN 11 Page is IN-TRANSITION Age-counter - Periodically increment the count Use-bit(use-counter) - Set to 1(or increment counter) for either Read or Write access - Periodically reset to 0 Dirty-bit - Set to 1 for Write access Page Table Entry • Size of the page table is equal to the number of pages in the memory • For the pages in the page table V: validity bit(1 bit) 1: virtual page is in memory 0: virtual page is not in memory RPA: Translated real page address Other: Page Mode(2 bits), Use-bit(1 bit or a counter), Dirty-bit(1 bit) Advanced Memory Systems

Actions on Page Fault • Page replacement • Decide a victim page • Since Fully Associative Mapping is used in the virtual storage system, a victim page must be found from all over the memory( not in the set) • More complicated than cache storage system • Retire the victim page to the secondary storage • A dirty victim page needs write back to secondary storage • But clean victim page does not require write back, simply overwrite the demanding page on the victim page • Load the page on demand into memory • Nothing particular, simply load in the page where the victim page were loaded Advanced Memory Systems

Page Replacement Algorithm • All the replacement algorithms favor the clean pages, because there is no need to write back the victim page in replacement • Random • Most simple to implement using a hashing function • FIFO • In order to implement FIFO, age-counter is needed for each entry in PT to record the age of the page after it is loaded into memory • Temporal locality is honored • LRU • In order to implement, use-counter is needed for each entry in PT to record the number of accesses to the page in the predefined time period. • Counter is periodically reset to 0 to represent the recently used. • Page that has the smallest count represent the least usage, so it becomes the victim • Both Temporal and Spatial Localities are honored Advanced Memory Systems

Page Replacement Algorithm • LFU • To implement this algorithm, use-counter is needed in PT to represents the usage of the page since it is loaded into memory • A page that has the smallest value in the counter becomes the victim • Spatial Locality is honored • Working Set • Small number of pages that does not cause severe page faults is called “working set” • Member of the working set does not become a victim page • This is very difficult to implement, unless there is a very powerful compiler or a preprocessor Advanced Memory Systems

Page Replacement Algorithm Second Chance • Using both use-bit and dirty-bit, try to find a page with (U, D)=(0,0) as the victim page • If there is no such page, then go through all the pages and if there are pages with (U, D)=(1,0) or (1,1), then change them to (0,0) and (0,1), respectively. Page with (U, D)=(0,1) may also be changed to (0,0) and record some where that it is dirty. • After then go through once again looking for the page with (0,0), favoring that page is not dirty • Both Temporal and Spatial Localities are honored Advanced Memory Systems

Lecture 14 Advanced Memory Systems

Lecture 14 Advanced Memory Systems

Presentation Transcript

Lecture 14: Virtual Memory

ECE 4100/6100 Advanced Computer Architecture Lecture 14 Multiprocessor and Memory Coherence

Lecture 14 – Queuing Systems

Advanced Operating Systems Lecture notes

Lecture 14 Advanced Features of Java

Lecture 14 Advanced Procedures

Advanced Operating Systems Lecture notes

Advanced Operating Systems Lecture notes

Advanced Operating Systems - Spring 2009 Lecture 14 – February 25, 2009

Advanced Operating Systems Lecture notes

Advanced Operating Systems Lecture notes

Advanced Operating Systems - Fall 2009 Lecture 3 – January 14, 2009

Advanced Operating Systems Lecture notes

Advanced Operating Systems: Linux Memory Management

Lecture 14 Memory Management III

Advanced Operating Systems Lecture notes

Advanced Operating Systems Lecture notes

Advanced Operating Systems Lecture notes

Lecture 13: Memory Systems

Advanced Operating Systems Lecture notes

Lecture 14: Virtual Memory

Lesson 14: Advanced Navigation Systems