1 / 53

IRAM and ISTORE Projects

IRAM and ISTORE Projects.

samuel-buck
Download Presentation

IRAM and ISTORE Projects

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. IRAM and ISTORE Projects Aaron Brown, James Beck, Rich Fromm, Joe Gebis, Paul Harvey, Adam Janin, Dave Judd, Kimberly Keeton, Christoforos Kozyrakis, David Martin, Rich Martin, Thinh Nguyen, David Oppenheimer, Steve Pope, Randi Thomas, Noah Treuhaft, Sam Williams, John Kubiatowicz, Kathy Yelick, and David Patterson http://iram.cs.berkeley.edu/[istore] Fall 1999 DIS DARPA Meeting

  2. ISTORE Hardware Vision • System-on-a-chip enables computer, memory, redundant network interfaces without significantly increasing size of disk • Target for + 5-7 years: • building block: 2006 MicroDrive integrated with IRAM • 9GB disk, 50 MB/sec from disk • connected via crossbar switch • 10,000+ nodes fit into one rack!

  3. C P U+$ 4 Vector Pipes/Lanes VIRAM: System on a Chip • Prototype scheduled for tape-out 1H 2000 • 0.18 um EDL process • 16 MB DRAM, 8 banks • MIPS Scalar core and caches @ 200 MHz • 4 64-bit vector unit pipelines @ 200 MHz • 4 100 MB parallel I/O lines • 17x17 mm, 2 Watts • 25.6 GB/s memory (6.4 GB/s per direction and per Xbar) • 1.6 Gflops (64-bit), 6.4 GOPs (16-bit) Memory(64 Mbits / 8 MBytes) Xbar I/O Memory(64 Mbits / 8 MBytes)

  4. Intelligent PDA ( 2003?) Pilot PDA + gameboy, cell phone, radio, timer, camera, TV remote, am/fm radio, garage door opener, ... + Wireless data (WWW) + Speech, vision recog. + Voice output for conversations • Speech control +Vision to see, scan documents, read bar code, ...

  5. IRAM Update • IBM to supply embedded DRAM/Logic (99%) • DRAM macro added to 0.18 micron logic process • DRAM specs under NDA; final agreement soon • Sandcraft to supply scalar core • 64-bit MIPS embedded processor, caches, TLB, FPU • Test chip received from LG Semicon • ISA Manual and Simulator complete • better fixed-point model and instructions • better support for short vectors • auto-increment memory addressing • instructions for in-register reductions & butterfly-permutations • VIRAM-1 Tape-out scheduled for 1H 2000 • Writing Verilog of control now • Layout of multiplier, register file nearly complete

  6. IRAM Update • Vectorizing Compiler for VIRAM • preliminary version complete using SUIF • retargeting CRAY/SGI compiler • Scalar codegen validated on commercial suite (~100 tests) • Debug and test of vector instructions underway • Scheduling and memory barriers leverage Cray SV2 work • Speech & video applications & media library underway • Benchmarking results

  7. VIRAM-1 block diagram

  8. 2 arithmetic units both execute integer operations one executes FP operations 4 64-bit datapaths (lanes) per unit 2 flag processing units for conditional execution and speculation support 1 load-store unit optimized for strides 1,2,3, and 4 4 addresses/cycle for indexed and strided operations decoupled indexed and strided stores Memory system 8 DRAM banks 256-bit synchronous interface 1 sub-bank per bank 16 Mbytes total capacity Peak performance 3.2 GOPS64, 12.8 GOPS16 (w. madd) 1.6 GOPS64, 6.4 GOPS16 (wo. madd) 0.8 GFLOPS64, 3.2 GFLOPS32 (w. madd) 6.4 Gbyte/s memory bandwidth Microarchitecture configuration

  9. Media Kernel Performance

  10. Base-line system comparison • All numbers in cycles/pixel • MMX and VIS results assume all data in L1 cache

  11. IRAM/VSUIF Decryption (IDEA) • IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF • Note scalability of both #lanes and data width • Some hand-optimizations (unrolling) will be automated by Cray compiler # lanes Virtual processor width

  12. 1D FFT on IRAM FFT study on IRAM • bit-reversal time included; cost hidden using indexed store • Faster than DSPs on floating point (32-bit) FFTs • CRI Pathfinder does 24-bit fixed point, 1K points in 28 usec (2 Watts without SRAM)

  13. 3D FFT on ISTORE 2006 • Performance of large 3D FFT’s depend on 2 factors • speed of 1D FFT on a single node (next slide) • network bandwidth for “transposing” data • 1.3 Tflop FFT possible w/ 1K IRAM nodes, if network bisection bandwidth scales (!)

  14. Scaling to 10K Processors • IRAM + micro-disk offer huge scaling opportunities • Still many hard system problems (SAM) • Scalability • Dynamic scaling with plug-and-play components • Scalable performance, gracefully down as well as up • Machines become heterogeneous in performance at scale • Availability • 24 x7 databases without human intervention • Discrete vs. continuous model of machine being up • Maintainability • 42% of system failures are due to administrative errors • self-monitoring, tuning, and repair

  15. Disk Half-height canister ISTORE-1: Hardware for SAM Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware • intelligence used to collect and filter monitoring data • diagnostics and fault injection enhance robustness • networked to create a scalable shared-nothing cluster • Scheduled for 4Q 99 and 1Q 2000 Intelligent Disk “Brick” Portable PC Processor: Pentium II+ DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • Intelligent Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mb/s • 2 1 Gb/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibrartion sensors...

  16. ISTORE Software Approach • Two-pronged approach to providing reliability: 1) reactive self-maintenance: dynamic reaction to exceptional system events • self-diagnosing, self-monitoring hardware • software monitoring and problem detection • automatic reaction to detected problems 2) proactive self-maintenance:continuous online self- testing and self-analysis • automatic characterization of system components • in situ fault injection, self-testing, and scrubbing to detect flaky hardware components and to exercise rarely-taken application code paths before they’re used

  17. ISTORE Applications • Storage-intensive, reliable services for ISTORE-1 • infrastructure for “thin clients,” e.g., PDAs • web services • databases, including decision-support • Scalable memory-intensive computations for ISTORE in 2006 • DIS benchmarks • 3D FFT • 1.4 Gflops on IRAM nodes • Electromagnetic scattering (MoM) • Sparse matrix/vector multiply 500/250 Mflops on IRAM nodes • RT-STAP • QR Decomposition currently in use as test case for compiler • Performance estimates through IRAM simulation + model

  18. Performance Heterogeneity • System performance limited by the weakest link • NOW Sort experience: performance heterogeneity is the norm • disks: inner vs. outer track (50%), fragmentation • processors: load (1.5-5x) and heat • Virtual Streams: dynamically off-load I/O work from slower disks to faster ones

  19. Disk Half-height canister ISTORE-1: Prototype Hardware Hardware: plug-and-play intelligent devices with self-monitoring, diagnostics, and fault injection hardware • intelligence used to collect and filter monitoring data • diagnostics and fault injection enhance robustness • networked to create a scalable shared-nothing cluster • Scheduled for 4Q 99 and 1Q 2000 Intelligent Disk “Brick” Portable PC Processor: Pentium II+ DRAM Redundant NICs (4 100 Mb/s links) Diagnostic Processor • Intelligent Chassis • 80 nodes, 8 per tray • 2 levels of switches • 20 100 Mb/s • 2 1 Gb/s • Environment Monitoring: • UPS, redundant PS, • fans, heat and vibrartion sensors...

  20. ISTORE Brick Block Diagram Mobile Pentium II Module SCSI North Bridge CPU Disk (18 GB) South Bridge Diagnostic Net DUAL UART DRAM 256 MB Super I/O Monitor & Control Diagnostic Processor BIOS Ethernets 4x100 Mb/s PCI • Sensors for heat and vibration • Control over power to individual nodes Flash RTC RAM

  21. Conclusion • IRAM attractive for two Post-PC applications because of low power, small size, high memory bandwidth • Mobile consumer electronic devices • Scaleable infrastructure • IRAM benchmarking result: faster than DSPs • ISTORE: hardware/software architecture for single-use, introspective storage • Scaling systems requires • new continuous models of availability • performance not limited by the weakest link • self* systems to reduce human interaction

  22. Backup Slides

  23. Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf Brick shelf ISTORE-1 System Layout

  24. I/O I/O 100MB each Memory Crossbar Switch M M M M M M M M M M … M M M M M M M M M M 4 x 64 4 x 64 4 x 64 4 x 64 4 x 64 I/O … … … … … … … … … … I/O M M M M M M M M M M V-IRAM1: 0.18 µm, Fast Logic, 200 MHz1.6 GFLOPS(64b)/6.4 GOPS(16b)/32MB + x 2-way Superscalar Vector 4 x 64 or 8 x 32 or 16 x 16 Instruction ÷ Processor Queue Load/Store 16K I cache 16K D cache Vector Registers 4 x 64 4 x 64

  25. Fixed-point multiply-add model • Same basic model, different set of instructions • fixed-point: multiply & shift & round, shift right & round, shift left & saturate • integer saturated arithmetic: add or sub & saturate • added multiply-add instruction for improved performance and energy consumption Multiply half word & Shift & Round Add & Saturate z x n w + n/2 sat * n n Round y n n/2 a

  26. Other ISA modifications • Auto-increment loads/stores • a vector load/store can post-increment its base address • added base (16), stride (8), and increment (8) registers • necessary for applications with short vectors or scaled-up implementations • Butterfly permutation instructions • perform step of a butterfly permutation within a vector register • used for FFT and reduction operations • Miscellaneous instructions added • min and max instructions (integer and FP) • FP reciprocal and reciprocal square root

  27. Major architecture updates • Integer arithmetic units support multiply-add instructions • 1 load store unit • complexity Vs. benefit • Optimize for strides 2, 3, and 4 • useful for complex arithmetic and image processing functions • Decoupled strided and indexed stores • memory stalls due to bank conflicts do not stall the arithmetic pipelines • allows scheduling of independent arithmetic operations in parallel with stores that experience many stalls • implemented with address, not data, buffering • currently examining a similar optimization for loads

  28. Micro-kernel results: simulated systems • Note : simulations performed with 2 load-store units and without decoupled stores or optimizations for strides 2, 3, and 4

  29. Micro-kernels • Vectorization and scheduling performed manually

  30. Scaled system results • Near linear speedup for all application apart from iDCT • iDCT bottlenecks • large number of bank conflicts • 4 addresses/cycle for strided accesses

  31. iDCT scaling with sub-banks • Sub-banks reduce bank conflicts and increase performance • Alternative (but not as effective) ways to reduce conflicts: • different memory layout • different address interleaving schemes

  32. Compiling for VIRAM • Long-term success of DIS technology depends on simple programming model, i.e., a compiler • Needs to handle significant class of applications • IRAM: multimedia, graphics, speech and image processing • ISTORE: databases, signal processing, other DIS benchmarks • Needs to utilize hardware features for performance • IRAM: vectorization • ISTORE: scalability of shared-nothing programming model

  33. IRAM Compilers • IRAM/Cray vectorizing compiler [Judd] • Production compiler • Used on the T90, C90, as well as the T3D and T3E • Being ported (by SGI/Cray) to the SV2 architecture • Has C, C++, and Fortran front-ends (focus on C) • Extensive vectorization capability • outer loop vectorization, scatter/gather, short loops, … • VIRAM port is under way • IRAM/VSUIF vectorizing compiler [Krashinsky] • Based on VSUIF from Corinna Lee’s group at Toronto which is based on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford • This is a “research” compiler, not intended for compiling large complex applications • It has been working since 5/99.

  34. Vectorizer Code Generators Frontends C PDGCS C90 C++ IRAM Fortran IRAM/Cray Compiler Status • MIPS backend developed in this year • Validated using a commercial test suite for code generation • Vector backend recently started • Testing with simulator under way • Leveraging from Cray • Automatic vectorization

  35. VIRAM/VSUIF Matrix/Vector Multiply • VIRAM/VSUIF does reasonably well on long loops • 256x256 single matrix • Compare to 1600 Mflop/s (peak without multadd) • Note BLAS-2 (little reuse) • ~350 on Power3 and EV6 • Problems specific to VSUIF • hand strip-mining results in short loops • reductions • no multadd support mvm vmm

  36. ISTORE API Provided byApplication Reaction mechanisms Coordinationof reaction Policies Provided by ISTORE Runtime System Problem detection SW monitoring Self-monitoringhardware Reactive Self-Maintenance • ISTORE defines a layered system model for monitoring and reaction: • ISTORE API defines interface between runtime system and app. reaction mechanisms • Policies define system’s monitoring, detection, and reaction behavior

  37. Proactive Self-Maintenance • Continuous online self-testing of HW and SW • detects flaky, failing, or buggy components via: • fault injection: triggering hardware and software error handling paths to verify their integrity/existence • stress testing: pushing HW/SW components past normal operating parameters • scrubbing: periodic restoration of potentially “decaying” hardware or software state • automates preventive maintenance • Dynamic HW/SW component characterization • used to adapt to heterogeneous hardware and behavior of application software components

  38. ISTORE-0 Prototype and Plans • ISTORE-0: testbed for early experimentation with ISTORE research ideas • Hardware: cluster of 6 PCs • intended to model ISTORE-1 using COTS components • nodes interconnected using ISTORE-1 network fabric • custom fault-injection hardware on subset of nodes • Initial research plans • runtime system software • fault injection • scalability, availability, maintainability benchmarking • applications: block storage server, database, FFT

  39. Runtime System Software • Demonstrate simple policy-driven adaptation • within context of a single OS and application • software monitoring information collected and processed in realtime • e.g.,health & performance parameters of OS, application • problem detection and coordination of reaction • controlled by a stock set of configurable policies • application-level adaptation mechanisms • invoked to implement reaction • Use experience to inform ISTORE API design • Investigate reinforcement learning as technique to infer appropriate reactions from goals

  40. Record-breaking performance is not the common case • NOW-Sort records demonstrate peak performance • But perturb just 1 of 8 nodes and...

  41. Process Arbiter Virtual Streams Software Disk Virtual Streams:Dynamic load balancing for I/O • Replicas of data serve as second sources • Maintain a notion of each process’s progress • Arbitrate use of disks to ensure equal progress • The right behavior, but what mechanism?

  42. Before Slowdown After Slowdown Client0 B Client1 B Client2 B Client3 B Client0 7B/8 Client1 7B/8 Client2 7B/8 Client3 7B/8 To Client0 To Client0 B/2 B/4 B/2 B/2 B/2 B/2 5B/8 3B/8 B/2 B/2 B/2 B/2 3B/8 B/4 B/2 B/2 B/2 5B/8 From Server3 From Server3 0 0 1 1 1 1 2 2 2 2 3 3 3 3 0 0 Server0 B Server1 B Server2 B Server3 B Server0 B Server1 B/2 Server2 B Server3 B Graduated Declustering:A Virtual Streams implementation • Clients send progress, servers schedule in response

  43. Read Performance:Multiple Slow Disks

  44. Storage Priorities: Research v. Users Traditional Research Priorities 1) Performance 1’) Cost 3) Scalability 4) Availability 5) Maintainability ISTORE Priorities 1) Maintainability 2) Availability 3) Scalability 4) Performance 5) Cost } easy to measure } hard to measure

  45. Intelligent Storage Project Goals • ISTORE: a hardware/software architecture for building scaleable, self-maintaining storage • An introspective system: it monitors itself and acts on its observations • Self-maintenance: does not rely on administrators to configure, monitor, or tune system

  46. Self-maintenance • Failure management • devices must fail fast without interrupting service • predict failures and initiate replacement • failures  immediate human intervention • System upgrades and scaling • new hardware automatically incorporated without interruption • new devices immediately improve performance or repair failures • Performance management • system must adapt to changes in workload or access patterns

  47. ISTORE-I: 2H99 • Intelligent disk • Portable PC Hardware: Pentium II, DRAM • Low Profile SCSI Disk (9 to 18 GB) • 4 100-Mbit/s Ethernet links per node • Placed inside Half-height canister • Monitor Processor/path to power off components? • Intelligent Chassis • 64 nodes: 8 enclosures, 8 nodes/enclosure • 64 x 4 or 256 Ethernet ports • 2 levels of Ethernet switches: 14 small, 2 large • Small: 20 100-Mbit/s + 2 1-Gbit; Large: 25 1-Gbit • Just for prototype; crossbar chips for real system • Enclosure sensing, UPS, redundant PS, fans, ...

  48. Disk Limit • Continued advance in capacity (60%/yr) and bandwidth (40%/yr) • Slow improvement in seek, rotation (8%/yr) • Time to read whole disk Year Sequentially Randomly (1 sector/seek) 1990 4 minutes 6 hours 1999 35 minutes 1 week(!) • 3.5” form factor make sense in 5-7 years?

  49. Related Work • ISTORE adds to several recent research efforts • Active Disks, NASD (UCSB, CMU) • Network service appliances (NetApp, Snap!, Qube, ...) • High availability systems (Compaq/Tandem, ...) • Adaptive systems (HP AutoRAID, M/S AutoAdmin, M/S Millennium) • Plug-and-play system construction (Jini, PC Plug&Play, ...)

  50. Other (Potential) Benefits of ISTORE • Scalability: add processing power, memory, network bandwidth as add disks • Smaller footprint vs. traditional server/disk • Less power • embedded processors vs. servers • spin down idle disks? • For decision-support or web-service applications, potentially better performance than traditional servers

More Related