1 / 35

More solutions

More solutions. Superpipelining – more stages means more than 5 stages of pipelining Dynamic pipeline scheduling change the order of executing instructions to fill gaps if possible (= instead of bubbles) Superscalar- performing things in parallel

jace
Download Presentation

More solutions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. More solutions • Superpipelining – more stages • means more than 5 stages of pipelining • Dynamic pipeline scheduling • change the order of executing instructions to fill gaps if possible (= instead of bubbles) • Superscalar- performing things in parallel • Performing two instructions simultaneously. This means fetch two instructions together, decode them at the same time(have more inputs and outputs in the GPR), execute, i.e., almost double the hardware

  2. Memory Hierarchy In 1998 SRAM 2 - 25ns $100 to $250 per Mbyte. Cache DRAM 60-120ns $5 to $10 per Mbyte. Memory Disk 10 to 20 million ns $0.10 to $0.20 per Mbyte. Disk Users want fast and inexpensive memory The solution: a memory hierarchy A memory hierarchy in which the faster but smaller part is “close” to the CPU and used most of the time and in which slower but larger part is ‘’far” from the CPU, will give us the illusion of having a fast large inexpensive memory

  3. Locality • temporal locality: • If we accessed a certain address, the chances are high to access it again shortly. For data this is so since we probably update it, for instruction it is so since we tend to use loops • spatial locality: • If we accessed a certain address, the chances are high to access its neighbors. • For instructions this is so due to the sequential nature of programs. For data this is so since we use groups of variable such as arrays. • So, let’s keep recent data and instructions in a fast memory (i.e., close to the CPU) This memory is called the cache.

  4. The cache principle The important terms Hit: a successful search of info in the cache. If it is in the cache, we have a hit. We continue executing the instructions. Miss: an unsuccessful search of info in the cache. If it is not in the cache, we have a miss and we have to bring the requested data from a slower memory up one level in the hierarchy. Until then, we must stall the pipeline! Block: The basic unit that is loaded into the cache when miss occurs is a block. The minimal size of block is a single word.

  5. Direct Mapped Cache

  6. Direct Mapped Cacheblock = 1 wordsize of cache=16words 2n blocks

  7. A d d r e s s ( s h o w i n g b i t p o s i t i o n s ) 3 1 3 0 1 3 1 2 1 1 2 1 0 B y t e o f f s e t 1 0 2 0 H i t D a t a T a g I n d e x I n d e x V a l i d T a g D a t a 0 1 2 1 0 2 1 1 0 2 2 1 0 2 3 2 0 3 2 One possible arrangement for MIPS cache:

  8. Another possibility for MIPS(actual DECStation 3100):

  9. A d d r e s s ( s h o w i n g b i t p o s i t i o n s ) 3 1 3 n+2 n+1 2 1 0 B y t e o f f s e t 30-n n T a g I n d e x V a l i d T a g D a t a I n d e x 0 1 2 2n -1 3 2 30-n D a t a H i t For any 32 bit address CPU: 2n locations

  10. Handling writes: • Write through • Anything we write is written to the cache and to the memory (we now discuss a single word block) • Write through usually uses a Write buffer • Since writing to the slower memory take too much time, we use an intermediate buffer. It gets the write “bursts’ of the program and slowly but surely writes it to the memory. (If the buffer gets full, we must stall the CPU) • Write-back • Another method is to copy the cache into the memory only when the block is replaced with another block. This is called write-back or copy-back.

  11. The block size The block does not have to be a single word. When we increase the size of the cache blocks, we improve the hit rate since we reduce the misses due to spatial locality of the program (mainly) but also the data (e.g., in image processing). Here is a comparison of the miss rate of two programs with a single word vs 4 words blocks:

  12. Direct Mapped Cacheblock = 1 wordsize of cache=16words

  13. Direct Mapped Cacheblock = 4 wordsize of cache=16words 1 block = 4 words This is still called a direct mapped cache since each block in the memory is mapped directly to a single block in the cache

  14. A d d r e s s ( s h o w i n g b i t p o s i t i o n s ) 3 1 1 6 1 5 4 3 2 1 0 1 6 1 2 2 B y t e T a g o f f s e t I n d e x B l o c k o f f s e t 1 6 b i t s 1 2 8 b i t s V D a t a T a g 4 K e n t r i e s 1 6 3 2 3 2 3 2 3 2 M u x 3 2 Hit Data A 4 words block direct mapped implementation When we have more than a single word in a block, the efficiency of storage is slightly higher since we have 1 tag for each block instead of for each word. On the other hand we slow the cache somewhat since we add multiplexors. Anyhow, this is not the issue. The issue is reducing miss rate

  15. A d d r e s s ( s h o w i n g b i t p o s i t i o n s ) 31 n+m+1 m+1 2 1 0 30-n-m Byte Offset inside a word n m T a g I n d e x B l o c k o f f s e t 30-n-m bits b i t s 32*2m V D a ta T a g 2n entries 30-n-m 3 2 3 2 3 2 3 2 M u x m 3 2 Hit Data A 2m words block implementation

  16. Block size and miss rate: When we increase the size of the block, the miss rate, especially for instructions, is reduced. However, in case we leave the cache size as is, we’ll get to a situation where there are too few blocks, so we have to change them even before we took advantage on the locality, i.e., before we used the entire block. That will increase the miss rate(explains the right hand side of the graphs below)

  17. Block size and write: When we have more than a single word in a block, then when we write (a single word) into a block, we must first read the entire block from the memory (unless its already in the cache) and only then write to the cache and to the memory. If we had the block in the cache, the process is exactly as it was for a single word block cache. Separate instruction and data caches Note that usually we have separate instruction and data caches. Having a single cache for both could give some flexibility since we have sometimes more room for data but the 2 separate caches have twice the bandwidth, I.e., we can read both at the same time (2 times faster). That is why most CPUs use separate instruction and data caches.

  18. Block size and read: When we have more than a single word in a block, then when need to wait longer to read the entire block. There are some techniques to start the writing into the cache as soon as possible. The other approach is to design the memory so reading is faster, especially reading consecutive addresses. This is done by reading several words in parallel.

  19. Faster CPUs need better caches It is shown in the book (section 7.3 pp. 565-567) that when we improve the CPU (shorten the CPI or the CK period) but leave the cache as is, the percentage of miss penaly is increased. This means that we need better caches for faster CPUs. Better means we should reduce the miss rate and reduce the miss penalty. Reducing the miss rate This is done by letting the cache more flexibility in keeping data. So far we allowed a memory block to be mapped to a single block in cache. We called it a direct mapped cache. There is no flexibility here. The most flexible scheme is that a block can be store at any of the cache blocks. That way, we can keep some frequently used blocked that always competed on the same cache block in direct mapped block implementation. Such a flexible scheme is called fully associative cache. In a fully associative cachethe tag should be compared to all cache entries. We have also a compromise called “N-way set associative” cache. Here each memory block is mapped to one of an N blocks of the cache. Note that for caches having more than 1 possible mapping, we should employ some replacement policy. (LRU or Random are used)

  20. Direct Mapped Cacheblock = 4 wordsize of cache=16words 1 block = 4 words

  21. 2 way set associative Cacheblock = 4 wordsize of cache=32words 1 2 1 block = 4 words 1 2 N*2n blocks of 2m words 1 2 1 2

  22. A 4-way set associative cache Here the block size is 1 word. We see that we have actually 4 “regular” caches + a multiplexor

  23. A d d r e s s ( s h o w i n g b i t p o s i t i o n s ) 31 0 Byte Offset inside a word 30-n-m n Tag m Tag D a ta index Block offset 30-n-m bits 30-n-m bits 32*2m bits V Tag Data (32*2m bits) T a g V 2n 2n entries entries 30-n-m 30-n-m 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 Mux Mux m m 3 2 3 2 Hit2 Hit1 Data2 Data2 Mux Hit Data A 2-way set associative cache Here the block size is 2m word. We see that we have actually 2 caches + a multiplexor

  24. Fully associative Cacheblock = 4 wordsize of cache=32words 1 2 1 block = 4 words 3 4 5 6 7 8

  25. 30-m m 2 30-m Tag V Tag 32 32 V Tag 32 m 32 32 32 32 32 32 32 32 32 32 32 32 32 32 V Tag 32 32 32 32 A fully associative cache Data hit Here the block size is 2m word. We see that we have only N blocks

  26. D i r e c t m a p p e d S e t a s s o c i a t i v e F u l l y a s s o c i a t i v e B l o c k # 0 1 2 3 4 5 6 7 S e t # 0 1 2 3 D a t a D a t a D a t a 1 1 1 T a g T a g T a g 2 2 2 S e a r c h S e a r c h S e a r c h N-way set associativeN*2n(N=2k-n) Suppose we have 2k words in a cache Fully associativeN=2k Directed mapped 1*2n (n=k) Searching for address 12 (marked) in 3 types of caches.

  27. Faster CPUs need better caches Better means we should reduce the miss rate. For that we used 4 set associative cache. Better also means reduce the miss penalty. Reducing the miss penalty This is done by using 2 levels cache. The first cache will be on the same chip as the CPU, actually it is a part of the CPU. It is very fast (1-2ns=less than 1 ck cycle) it is small, the block is also small, so it can be 4-way set associative. The level 2 cache is out of the chip 10 times slower but still 10 times faster than the memory (DRAM). It has larger block almost always 2-way set associative or direct mapped. Mainly aimed to reduce the read penalty. Analyzing such caches is complicated. Ususally simulations are required. An optimal single level cache is usually larger and slower than the level1 cache and faster and smaller than the level2 cache. Note that usually we have separate instruction and data caches.

  28. Virtual Memory Virtual Memory (VM) enables to have a small physical memory (the actual memory in the computer) while the program uses a much larger virtual memory. In principle, the idea is similar to this of the cache. Here the physical memory (DRAM) of the processor keeps the data used right now, while the disk keeps the entire program. The blocks here are called pages (4KB-16KB or even up to 64KB today). Miss in VM is called page fault.There are 2 reasons to use VM:1) To have larger virtual address space than the actual physical memory available2) To enable several programs, or processes, to run simultaneously, each “thinks” it has the same virtual address (since we cannot compile them knowing the real addresses to be used later)So we map the virtual address created by the processor (PC or ALUOut) to a physical address. We need an address translation.

  29. Address translation The translation is simple. We use the LSBs to point at the address inside a page and the rest of the bits, the MSBs to point at a “virtual” page. The translation should replace the virtual page number with physical page number, having a smaller number of bits. This means that the physical memory is smaller than the virtual memory and so, we’ll have to load and store pages whenever required. Before VM, the programmer was responsible to load and replace “overlays” of code or data. VM take this burden away. By the way, using pages with “relocating” the code and the data every time it is loaded into memory also enables better usage of memory. Large contiguous areas are not required.

  30. Address translation The translation is done by a table called the page table. WE have such a table, residing in the main memory, for each process. A special register, the page table register, points at the start of the table. When switching the program, I.e, switching to another process, we change the contents of that register so it points to the appropriate page table. [ To switch a process means also storing all the registers including the PC of the current process and retrieving those of the process we want to switch to. This is done by the Operating System every now and then according to some predetermined rule] . We need to have a valid bit, same as in caches, which tells whether the page is valid or not. In VM we have fully associative placement of pages in the physical memory.To reduce chances to page fault. We also apply sophisticated algorithms for replacement of pages. Since the read/write time (from/to disk) is very long, we use s/w mechanism instead of h/w (used in caches). Also, we use write-back scheme and not write-through.

  31. The page table The operating system (OS) creates a copy of all the pages of a process on the disk. It loads the requested pages into the physical memory and keeps track on which page is loaded and which is not. The page table can be used to point at the pages on the disk .If the valid bit is on, the table has the physical page address. If the valid bit is off, the table has its disk address. When a page fault occurs, if all physical memory is used, the OS must choose which page to be replaced. LRU is often used. However, to simplify things, we set a “use” bit or “reference” bit by h/w every time a page is accessed. Every now and then these bits are cleared by the OS. So, according to these bits, the OS can decide which page has a higher chance of being used and keep it in memory. The page table could be very big. So there are technique to keep it small. We do not prepare room for all virtual addresses possible, but add an entry whenever a new page is requested. We sometimes have a page table with two parts the heap, growing upwards and the stack growing downwards. Some OS uses hashing to translate between the virtual page address and the page table. Sometimes the page table itself is allowed to be paged. Note that every access to the memory is made of two reads, 1st we read the physical page address from the page table, then we can perform the real read.

  32. TLB Note that every access to the memory is made of two reads, 1st we read the physical page address from the page table, then we can perform the real read. In order to avoid that we use a special cache for address translation. It is called a “Translation-Lookaside Buffer” (TLB). It is a amall cache (32-4096 entries) with blocks of 1 or 2 page addresses with a very fast hit time (less than 1/2 a CK cycle to leave enoufh time for getting the data according to the address from the TLB) and has a small miss rate (0.01%-1%) . TLB miss cayses a delay of 10-30 CK cycles to access the real page table and update the TLB. What about write? Whenever we write to a page in the physical memory, we must set a bit in the TLB (and eventually, when it is replaced in the TLB, in the page table). This bit is called the “dirty” bit. When a ‘dirty” page is removed from the physical memory, ir shopuld be copied to the disk to replace the old un-updated page that was originally on the disk. If the dirty bit is off, no copy is required since the original page is untoutched.

  33. TLB and Cache together So here is the complete picture: The CPU generates a virtual address (PC in fetch or ALUOut during lw or sw instruction), The bits goe directly to the TLB. If thereis a hit, the output of the TLB provides the physical page address. We combine these lines with the LSBs of the virtual address and use the resulting physical address to access the memory. This addrees is connected to the cache. If cache hit is detected, the data immediately appears at the output of the cahe. All of this took less than a CK cycle so we can use tha dats in the next rising edge of the CK.

  34. Protection During the process of having a page fault we can detect that a program is trying to access a virtual page that is not defined. A regular process cannot be allowed to access the page table itself, i.e., read and write to the page table. Only kernel (OS) processes can do that. There can also be restrictions on writing to certain pages. All this can be achieved with special bits in the TLB (kernel bit, write access bit etc.). Any violation should cause an exception that will be handled by the OS. In some OS and CPUs, not all pages have the same size. We then use the term segment instead of page. In such case we need to have h/w support that detects that the CPU tries to access an address which is beyond the limit of the segment.

  35. End of caches & VM

More Related