1 / 18

Rough Schedule

Rough Schedule. 1:30-2:15 IRAM overview 2:15-3:00 ISTORE overview break 3:15-3:30 Financial 4:00-5:00 Future. IRAM Hardware and Software. Kathy Yelick Computer Science Division UC Berkeley. L o g i c. f a b. Proc. $. $. L2 $. Bus. Bus. D. R. A. M. I/O. I/O. I/O. I/O.

Download Presentation

Rough Schedule

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Rough Schedule • 1:30-2:15 IRAM overview • 2:15-3:00 ISTORE overview • break • 3:15-3:30 Financial • 4:00-5:00 Future

  2. IRAM Hardware and Software Kathy Yelick Computer Science Division UC Berkeley

  3. L o g i c f a b Proc $ $ L2$ Bus Bus D R A M I/O I/O I/O I/O Proc f a b D R A M Bus D R A M Intelligent RAM: IRAM Microprocessor & DRAM on a single chip: • 10X capacity vs. DRAM • on-chip memory latency 5-10X, bandwidth 50-100X • improve energy efficiency 2X-4X (no off-chip bus) • serial I/O 5-10X v. buses • smaller board area/volume IRAM advantages extend to: • a single chip system • a building block for larger systems

  4. C P U+$ 4 Vector Pipes/Lanes VIRAM: System on a Chip • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 17x17 mm, 2 Watts target • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 0.8 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar Memory(64 Mbits / 8 MBytes)

  5. IRAM Chip Update • IBM supplying embedded DRAM/Logic (100%) • Agreement in place and technology files available • MIPS supplying scalar core (100%) • MIPS processor, caches, TLB • MIT supplying FPU (100%) • VIRAM-1 Tape-out scheduled for late-2000 • Simplifications • Floating point • Network Interface

  6. MIPS scalar core Synthesizable RTL code received from MIPS Cache RAMs to be compiled for IBM technology FPU RTL code almost compete Vector unit RTL models for sub-blocks developed; currently integrated and tested Control logic to be compiled for IBM technology Full-custom layout for multipliers/adders developed; layout for shifters to be developed Memory system Synthesizable model for DRAM controllers done To be integrated with IBM DRAM macros Full-custom layout for crossbar under development Testing infrastructure Environment developed for automatic test & validation Directed tests for single/multiple instruction groups developed Random instruction sequence generator developed VIRAM-1 Chip Design Status

  7. IRAM Architecture Update • ISA mostly frozen since 6/99 • Changes in 2H 99 for better fixed-point model and some instructions for short vectors (auto increment and in-register permutations) • Minor changes in 1H 00 to address new co-processor interface in MIPS core • ISA manual publicly available • http://www.cs.berkeley.edu • Suite of simulators actively used • vsim-isa (functional) • Major rewrite underway for new scalar processor • All UCB code • vsim-p (performance), vsim-db (debugger), vsim-sync (memory synchronization)

  8. Vectorizer Code Generators Frontends C PDGCS C90 C++ IRAM Fortran IRAM Compiler Status • Retarget of Cray Backend • Steps in compiler development • Build MIPS backend (done) • Build VIRAM bacckend for vectorized loops (done) • Instruction scheduling for VIRAM-1 (works, but could be improved) • Insertion of memory barriers (using Cray strategy, improving) • Optimizations for short loops (reduce overhead) • Feedback results to Cray, new version from Cray (ongoing)

  9. IRAM Compiler Update • Study of compiler quality using 100 “Dongarra loops” • 70 vectorized • Average 10x reduction in dynamic instruction count • Average vector length of 42 • 30 did not, usually due to a dependence • Some reductions missed • Vector version of math libraries (sin, cos, etc.) needed • Some failed due to bugs in benchmark • Identified 2 specific areas for improvements in loop overhead • Use VL and MVL more carefully • Use auto-increment instruction more extensively

  10. Compiled Applications Update • Applications using compiler • Speech processing under development • Developed new small-memory algorithm for speech processing • Uses some existing kernels (FFT and MM) • Vector search algorithm is most challenging • DIS image understanding application under development • Compiles, but does not yet vectorize well • Singular Value Decomposition • Better than 2 VLIW machines (TI C67 and TM 1100) • Challenging BLAS-1,2 work well on IRAM because of memory BW • Kernels • SAXPY, MVM, etc. • Will include DIS stress-marks

  11. (10n x n SVD, rank 10) (From Herman, Loo, Tang, CS252 project)

  12. Hand-Coded Applications Update • Image processing kernels (old FPU model) • Note BLAS-2 performance

  13. 0 1 15 0 1 16 16 16 15 Problem: General Element Permutation • Hardware for a full vector permutation instruction (128 16b elements, 256b datapath) • Datapath: 16 x 16 (x 16b) crossbar; scales by 0(N^2) • Control: 16 16-to-1 multiplexors; scales by 0(N*logN) • Time/energy wasted on wide vector register file port

  14. 0 1 15 Simple Vector Permutations • Simple steps of butterfly permutations • A register provides the butterfly radix • Separate instructions for moving elements to left/right • Sufficient semantics for • Fast reductions of vector registers (dot products) • Fast FFT kernels

  15. 64 64 64 64 shift shift 0 3 Hardware for Simple Permutations • Hardware for 128 16b elements, 256b datapath • Datapath: 2 buses, 8 tristate drivers, 4 multiplexors, 4 shifters (by 0, 16b, 32b only); Scales by O(N) • Control: 6 control cases; scales by O(N) • Other benefits • Consecutive result elements written together; • Buses used only for small radices

  16. FFT: Uses In-Register Permutations Without in-register permutations

  17. Summary • IRAM takes advantage of high on-chip bandwidth • BLAS-2 performance confirms this • Vector IRAM ISA utilizes this bandwidth • Unit, strided, and indexed memory access patterns supported • Exploits fine-grained parallelism, even with pointer chasing • Compiler • Well-understood compiler model, semi-automatic • Still some work on code generation quality • Application benchmarks • Compiled and hand-coded • Include FFT, SVD, MVM, sparse MVM, and other kernels used in image and signal processing

  18. IRAM as Building Block for ISTORE • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • Target for + 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec disk (projected) • connected via crossbar switch • O(10) Gflops • 10,000+ nodes fit into one rack!

More Related