1 / 24

CS 152 Computer Architecture & Engineering

CS 152 Computer Architecture & Engineering. Section 7 Spring 2010. Andrew Waterman. University of California, Berkeley. Mystery Die. Mystery Die. Mystery Die. RISC II: 41K transistors, 4 micron NMOS @ 12 MHz 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz) . Agenda.

lunea-clark
Download Presentation

CS 152 Computer Architecture & Engineering

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 152Computer Architecture & Engineering Section 7 Spring 2010 Andrew Waterman University of California, Berkeley

  2. Mystery Die

  3. Mystery Die

  4. Mystery Die RISC II: 41K transistors, 4 micron NMOS @ 12 MHz 2.2x faster than VAX 11-780 (1500 TTL chips @ 5MHz)

  5. Agenda • Quiz 2 Post-Mortem • Mean: 53.1 • Standard Deviation: 9.0

  6. Quiz 2, Q1 • N=1024. Store/Load miss rate for 4KB 2-way cache w/LRU replacement? • LRU => no conflicts between loads/stores • Loads are unit-stride with no reuse • All misses compulsory => 1/8 • All stores miss because of capacity misses for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  7. Quiz 2, Q1 • What about FIFO replacement? • Stores and loads could now conflict. When? • Stores always use set i/8 % 64 • Loads always use set j/8 % 64 • Conflicts occur when these are equal for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  8. Quiz 2, Q1 • What about FIFO replacement? • Stores and loads could now conflict. When? • Stores always use set i/8 % 64 • Loads always use set j/8 % 64 • Conflicts occur when these are equal for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  9. Quiz 2, Q1 • Is Write-Allocate a good idea for this code? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  10. Quiz 2, Q1 • Is Write-Allocate a good idea for this code? • On every store miss, 32 bytes of data are read into cache then discarded, so no for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  11. Quiz 2, Q1 • Is Write-Back a good idea for this code? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  12. Quiz 2, Q1 • Is Write-Back a good idea for this code? • For Write-Allocate, bad (32 bytes written back for each 4 byte store (total of 64 bytes traffic)) • Otherwise, OK, except the Write-Through alternative had a write buffer, which will dramatically reduce miss penalty for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  13. Quiz 2, Q1 • If cache were fully associative, how could we improve code’s performance? for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  14. Quiz 2, Q1 • If cache were fully associative, how could we improve code’s performance? • Block the transpose • FA makes this easier; lots of solutions • Here’s one; let B = 8 (words in cache line) for(i = 0; i < N; i+=B) for(j = 0; j < N; j++) for(k = 0; k < B; k++) B[j*N+(i+k)] = A[(i+k)*N+j]; for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  15. Quiz 2, Q1 • What about TLB misses? • 4KB pages, 1024-entry DM TLB • Compulsory misses first • 2 matrices * (1024^2 words)/(1024 words/page) • = 2048 for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  16. Quiz 2, Q1 • What about TLB misses? • 4KB pages, 1024-entry DM TLB • Now consider some iteration 0 ≤ i < N-1 • After iteration i, TLB[i] = Ai, and TLB[k] = Bk, k≠i • During iteration i+1, store to Bi will miss • Then store to Bi+1 will miss, kicking out Ai+1 • Next load to Ai+1 will miss • 3 conflicts/iteration • 3072+2048 misses total for(i = 0; i < N; i++) for(j = 0; j < N; j++) B[j*N+i] = A[i*N+j];

  17. Quiz 2, Q2 • Basic idea of microtags: SA caches put tag check on critical path (data-out) • Reduce critical path by using subset of tag to select way • In this cache, microtag check -> data out remains critical path, but 1/6 faster

  18. Quiz 2, Q2 • AMAT = hit time + miss rate * miss penalty • Hit time not multiplied by hit rate • You have to pay the hit time even on a miss

  19. Quiz 2, Q2 • Microtag uniqueness affects conflict misses • Increases compared to 4-way SA • But still much better than DM • Otherwise, why would we build a microtagged cache? Just use DM

  20. Quiz 2, Q2 • Aliasing question was unintentionally tricky: microtags are a red herring • The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset • Aliases always map to the same set, which would be fine for DM, but with SA they can live in different ways

  21. Quiz 2, Q2 • Aliasing question was unintentionally tricky: microtags are a red herring • The aliasing problem is just the same as for any virtually-indexed physically-tagged cache with index+offset ≤ page offset • Simple fix: on a miss, you already have the physical tag and all physical tags in the set • Iff there’s a match, there’s an alias

  22. Quiz 2, Q3 • 2x associativity, capacity & line size constant • Increases hit time due to data-out muxing • Reduces conflict misses • Halving line size (associativity & #sets constant) • Reduces hit time (capacity down) • Increases miss rate (same reason) • Reduces miss penalty (shorter lines, less to fetch)

  23. Quiz 2, Q3 • Physical -> virtual cache • Hit time reduced (only real reason to do this is to remove TLB from hit path) • Effect on miss rate ambiguous • More misses for aliases • More misses for context switches w/o ASIDs • Fewer misses due to address space contiguity • Increased miss penalty because TLB lookup is moved to miss path, and for anti-aliasing

  24. Quiz 2, Q3 • Write buffer • Reduces both store miss penalty and hit time • HW prefetching • HW isn’t on hit path, so no effect on hit time • Reduces miss rate (main reason) • Prefetch buffer hit considered “slow hit”, not miss • Reduces miss penalty (prefetches can be in-flight when miss occurs)

More Related