1 / 47

Berkeley NOW Project

Berkeley NOW Project. David E. Culler culler@cs.berkeley.edu http://now.cs.berkeley.edu/ Sun Visit May 1, 1998. Project Goals. Make a fundamental change in how we design and construct large-scale systems market reality:

fkerlin
Download Presentation

Berkeley NOW Project

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Berkeley NOW Project David E. Culler culler@cs.berkeley.edu http://now.cs.berkeley.edu/ Sun Visit May 1, 1998

  2. Project Goals • Make a fundamental change in how we design and construct large-scale systems • market reality: • 50%/year performance growth => cannot allow 1-2 year engineering lag • technological opportunity: • single-chip “Killer Switch” => fast, scalable communication • Highly integrated building-wide system • Explore novel system design concepts in this new “cluster” paradigm

  3. Remember the “Killer Micro” Linpack Peak Performance • Technology change in all markets • At many levels: Arch, Compiler, OS, Application

  4. Another Technological Revolution • The “Killer Switch” • single chip building block for scalable networks • high bandwidth • low latency • very reliable • if it’s not unplugged => System Area Networks

  5. One Example: Myrinet • 8 bidirectional ports of 160 MB/s each way < 500 ns routing delay • Simple - just moves the bits • Detects connectivity and deadlock Tomorrow: gigabit Ethernet?

  6. Potential: Snap together large systems • incremental scalability • time / cost to market • independent failure => availability Node Performance in Large System Engineering Lag Time

  7. Opportunity: Rethink O.S. Design • Remote memory and processor are closer to you than your own disks! • Networking Stacks ? • Virtual Memory ? • File system design ?

  8. $ $ Example: Traditional File System • Server resources at a premium • Client resources poorly utilized Server Fast Channel (HPPI) Clients $ RAID Disk Storage $$$ Global Shared File Cache ° ° °  Local Private File Cache Bottleneck • Expensive • Complex • Non-Scalable • Single point of failure

  9. P P P P P P P P File Cache File Cache File Cache File Cache File Cache File Cache File Cache File Cache Truly Distributed File System • VM: page to remote memory Scalable Low-Latency Communication Network Cluster Caching Local Cache Network RAID striping G = Node Comm BW / Disk BW

  10. Comm. Software Comm. Software Comm.. Software Comm. Software Network Interface Hardware Network Interface Hardware Network Interface Hardware Network Interface Hardware Fast Communication Challenge • Fast processors and fast networks • The time is spent in crossing between them Killer Platform ° ° ° ns ms µs Killer Switch

  11. P P P P P P P Opening: Intelligent Network Interfaces • Dedicated Processing power and storage embedded in the Network Interface • An I/O card today • Tomorrow on chip? Mryicom Net 160 MB/s Myricom NIC M M I/O bus (S-Bus) 50 MB/s M M $ M $ $ $ Sun Ultra 170 $

  12. Our Attack: Active Messages • Request / Reply small active messages (RPC) • Bulk-Transfer (store & get) • Highly optimized communication layer on a range of HW Request handler Reply handler

  13. NOW System Architecture Parallel Apps Large Seq. Apps Sockets, Split-C, MPI, HPF, vSM Global Layer UNIX Process Migration Distributed Files Network RAM Resource Management UNIX Workstation UNIX Workstation UNIX Workstation UNIX Workstation Comm. SW Comm. SW Comm. SW Comm. SW Net Inter. HW Net Inter. HW Net Inter. HW Net Inter. HW Fast Commercial Switch (Myrinet)

  14. Outline • Introduction to the NOW project • Quick tour of the NOW lab • Important new system design concepts • Conclusions • Future Directions

  15. First HP/fddi Prototype • FDDI on the HP/735 graphics bus. • First fast msg layer on non-reliable network

  16. SparcStation ATM NOW • ATM was going to take over the world. The original INKTOMI Today: www.hotbot.com

  17. 100 node Ultra/Myrinet NOW

  18. Massive Cheap Storage • Basic unit: 2 PCs double-ending four SCSI chains Currently serving Fine Art at http://www.thinker.org/imagebase/

  19. Cluster of SMPs (CLUMPS) • Four Sun E5000s • 8 processors • 3 Myricom NICs • Multiprocessor, Multi-NIC, Multi-Protocol

  20. Information Servers • Basic Storage Unit: • Ultra 2, 300 GB raid, 800 GB tape stacker, ATM • scalable backup/restore • Dedicated Info Servers • web, • security, • mail, … • VLANs project into dept.

  21. What’s Different about Clusters? • Commodity parts? • Communications Packaging? • Incremental Scalability? • Independent Failure? • Intelligent Network Interfaces? • Complete System on every node • virtual memory • scheduler • files • ...

  22. Three important system design aspects • Virtual Networks • Implicit co-scheduling • Scalable File Transfer

  23. Communication Performance  Direct Network Access Latency 1/BW • LogP: Latency, Overhead, and Bandwidth • Active Messages: lean layer supporting programming models

  24. Example: NAS Parallel Benchmarks • Better node performance than the Cray T3D • Better scalability than the IBM SP-2

  25. General purpose requirements • Many timeshared processes • each with direct, protected access • User and system • Client/Server, Parallel clients, parallel servers • they grow, shrink, handle node failures • Multiple packages in a process • each may have own internal communication layer • Use communication as easily as memory

  26. Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain.

  27. How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface

  28. Solaris System Abstractions • Segment Driver • manages portions of an address space • Device Driver • manages I/O device Virtual Network Driver

  29. Virtualization is not expensive

  30. Msg burst work Client Server Client Server Server Client Bursty Communication among many virtual networks

  31. Sustain high BW with many VN

  32. Perspective on Virtual Networks • Networking abstractions are vertical stacks • new function => new layer • poke through for performance • Virtual Networks provide a horizontal abstraction • basis for build new, fast services

  33. Beyond the Personal Supercomputer • Able to timeshare parallel programs • with fast, protected communication • Mix with sequential and interactive jobs • Use fast communication in OS subsystems • parallel file system, network virtual memory, … • Nodes have powerful, local OS scheduler • Problem: local schedulers do not know to run parallel jobs in parallel

  34. Local Scheduling • Local Schedulers act independently • no global control • Program waits while trying communicate with peers that are not running • 10 - 100x slowdowns for fine-grain programs! => need coordinated scheduling

  35. Traditional Solution: Gang Scheduling • Global context switch according to precomputed schedule • Inflexible, inefficient, fault prone

  36. GS GS LS LS A A GS GS LS LS A A A A Novel Solution: Implicit Coscheduling • Coordinate schedulers using only the communication in the program • very easy to build • potentially very robust to component failures • inherently “service on-demand” • scalable • Local service component can evolve.

  37. WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why it works • Infer non-local state from local observations • React to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block

  38. Example: Synthetic Pgms • Range of granularity and load imbalance • spin wait 10x slowdown

  39. Implicit Coordination • Surprisingly effective • real programs • range of workloads • simple an robust • Opens many new research questions • fairness • How broadly can implicit coordination be applied in the design of cluster subsystems?

  40. A look at Serious File I/O • Traditional I/O system • NOW I/O system • Benchmark Problem: sort large number of 100 byte records with 10 byte keys • start on disk, end on disk • accessible as files (use the file system) • Datamation sort: 1 million records • Minute sort: quantity in a minute Proc- Mem P-M P-M P-M P-M

  41. World-Record Disk-to-Disk Sort • Sustain 500 MB/s disk bandwidth and 1,000 MB/s network bandwidth

  42. Key Implementation Techniques • Performance Isolation: highly tuned local disk-to-disk sort • manage local memory • manage disk striping • memory mapped I/O with m-advise, buffering • manage overlap with threads • Efficient Communication • completely hidden under disk I/O • competes for I/O bus bandwidth • Self-tuning Software • probe available memory, disk bandwidth, trade-offs

  43. Towards a Cluster File System • Remote disk system built on a virtual network Client RD server RDlib Active msgs

  44. Conclusions • Fast, simple Cluster Area Networks are a technological breakthrough • Complete system on every node makes clusters a very powerful architecture. • Extend the system globally • virtual memory systems, • schedulers, • file systems, ... • Efficient communication enables new solutions to classic systems challenges.

  45. Millennium Computational Community Business SIMS BMRC Chemistry C.S. E.E. Biology Gigabit Ethernet Astro NERSC M.E. Physics N.E. Math IEOR Transport Economy C. E. MSME

  46. Millennium PC Clumps • Inexpensive, easy to manage Cluster • Replicated in many departments • Prototype for very large PC cluster

  47. Proactive Infrastructure Information appliances Stationary desktops Scalable Servers

More Related