1 / 23

Impulse Project DARPA Review – July 2000

Impulse Project DARPA Review – July 2000. University of Utah and University of Massachusetts at Amherst. Technology Trends. Disturbing trends (for a memory architect): Memory gap widening (CPUs improving 60%/year, DRAM only 7%) Internal CPU parallelism is escalating

kirk
Download Presentation

Impulse Project DARPA Review – July 2000

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Impulse ProjectDARPA Review – July 2000 University of Utah and University of Massachusetts at Amherst

  2. Technology Trends • Disturbing trends (for a memory architect): • Memory gap widening (CPUs improving 60%/year, DRAM only 7%) • Internal CPU parallelism is escalating • Emerging applications with poor locality (multimedia, databases, …) • Cache size growing much faster than TLB reach • Ugly CPIs: Perl and Sites, OSDI 1996 • Possible solutions: • Bigger, deeper cache hierarchies • Better latency-tolerating CPU features (non-blocking cache, OOO, …) • Migrate computation to the DRAMs • Let software control how data is managed (Impulse)

  3. Simple Example Problem • Sum of diagonal elements of dense matrix • Problems • Wasted bus bandwidth • Low cache utilization • Low cache hit ratio for (i = 0; i < n; i++) sum += A[i][i]; Physical Memory Cache Memory Controller Memory Bus

  4. The Impulse Idea • What if software could do the following? • Improvements • No wasted bus bandwidth • Better cache utilization • Higher cache and TLB hit ratios Create diag[*] corresponding to A[*][*] for (i = 0; i < n; i++) sum += diag[i]; Physical Memory Cache Memory bus Memory Controller

  5. virtualspace physical space real physical memory MMU/TLB Impulse MC Real physical space Shadow address space How? Add Extra Level of Mapping • Shadow address: “unused” physical address • MC maps shadow address to physical address • Applications configure MC through OS

  6. Address Translations Physical Memory Virtual Memory MMU/TLB Conventional System Impulse System MMU/TLB diagonal Pseudo Virtual Memory Shadow Memory Physical Memory Virtual Memory Word-grained Page-grained

  7. Impulse Features • Base-stride scatter/gather data • Walk columns or diagonals efficiently • Remap matrix tiles to contiguous memory without copying • Indirection vector accesses • Static vectors (e.g., perform A[index[i]] efficiently) • Dynamic cacheline assembly • Remap pages • Create superpages from disjoint base pages • No-copy page coloring • Aggressive controller-based prefetching • Prefetch data from DRAMs (sequential and pointer-directed)

  8. Exploiting Impulse • Application asks OS to setup remapping • OS allocates free shadow configuration register • sets up dense “page table” that points to target data • downloads address of this page table to configuration register • OS allocates free shadow and virtual address space • maps application virtual addresses to shadow physical addresses • returns virtual address corresponding to remapped data to app Setup TLB translation (VA to shadow) Fine-grained remapping (if any) Remapped addresses pass through MC-TLB DRAM scheduler “collects” data Application accesses (dense) remapped data Use

  9. Architecture Overview

  10. Benchmarks • Fine-grained remapping benchmarks • Conjugate gradient (core of DARPA vision benchmark) • Ray tracing • Page-grained remapping benchmarks • SPEC95 (dynamic superpage promotion) • Compress (no-copy page coloring) • Prefetching benchmarks • SPECint 95 suite (3-15% performance improvement) • Synthetic tree microbenchmarks

  11. Conjugate Gradient • Store logical sparse matrix A using Yale storage scheme • Data stores non-zero elements (much larger than P) • Row[i]indicates where theithrow begins in Data • Column[i] isthe column number of Data[i] Row A P => B 014 1 2 3 4 5 6 x Column 1 5 7 8 3 9 Data 1 2 3 4 5 6

  12. Issues: Data and Col are large streams P reusable, but forced out of cache Poor L1 cache hit rates Interference in L2 cache Optimizing Conjugate Gradient Original Code Optimized Code Pi = remap_indirect(P, Col, n, …); for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum; for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum; Issues: • Indirect access to P[Col[j]] turned into sequential streaming access • No reuse on P now • Side effect: eliminate access to Col • Significant improvement to hit rates (both L1 and TLB)

  13. Conjugate Gradient Results • Significant improvement in effective cache locality

  14. Volume Rendering: Ray Tracing • Problem: Ray traversals are “random” memory accesses • Solution: Calculate addresses of rays as “indirection vector Access rays via Impulse-remapped data structure

  15. Volume Rendering Results • A: rays follow natural memory layout (X axis) • B: rays perpendicular to natural memory layout (Z axis)

  16. Coarse Grained Remappings • Page-grained remapping • Aggressive use of synthetic superpages • modified kernel TLB miss handler to detect pages responsible for frequent TLB misses • create superpage by page-grained remapping on memory controller • no copying, therefore can be far more aggressive • No-copy page coloring • Problem: conflicts in the physically-indexed L2 cache • Normal solution: copy to non-conflicting pages • Impulse solution: remap to non-conflict pages

  17. Physical Addresses Shadow Addresses Virtual Addresses 0x04012000 0x80240000 0x00004000 0x80241000 0x00005000 0x80242000 0x00006000 0x06155000 0x80243000 0x00007000 0x12011000 0x40138000 Shadow-Backed Superpages • SPECint95 improves 5-20% • MTLB increases effective reach of CPU TLB • Superpage large and multiple arrays at compile time • at allocation time (cheapest) or dynamically

  18. MMC-Based Prefetching • Idea: Prefetch data off of DRAMs into SRAM on MMC • Misprediction penalties significantly reduced • conflict misses due to cache capacity limitations • system bus bandwidth • Exploits “free” DRAM bandwidth at MMC level • higher aggregate DRAM bandwidth than cache or bus bandwidth • Reduces latency of accesses that hit in prefetch cache

  19. Pointer-based Microbenchmarks • Random walk down tree w/ N-children per node • vary number of children from 1 (linked list) to 3 (trinary tree) • Baseline: compiler-directed prefetching • Impulse: MMC prefetches next nodes in tree (1-ahead) • allocate nodes in shadow region • tell MMC what offsets represent pointers Root ... Child1 Child2 ChildN ... Child1 Child2 ChildN

  20. Pointer Prefetching Results • P1(N): singly-linked list, no prefetching • P3(C): triply-linked list, compiler-directed prefetching • P#(I): Impulse MMC-directed prefetching

  21. Prototyping Status • Four stage prototype strategy • I: Slow conventional MMC • II: Fast conventional MMC • III: Impulse on an FPGA • IV: Impulse in an ASIC • Current Status: • Stage I complete (pictured) • Stage II imminent (final testing) • Stage III underway (3/01) • Stage IV next year (12/01)

  22. Summary • Impulse Benefits • Higher memory bus utilization • Higher cache utilization • Turns sparse memory operations into dense ones • Range of optimizations • Fine-grained data remapping • Page-grained data remapping • Memory-based prefetching • Impact • Performance increase for small increase in cost • Does not require changes to CPUs, caches, or DRAMs

  23. Questions? http://www.cs.utah.edu/impulse

More Related