NOW and Beyond

NOW and Beyond Workshop on Clusters and Computational Grids for Scientific Computing David E. Culler Computer Science Division Univ. of California, Berkeley http://now.cs.berkeley.edu/

NOW Project Goals • Make a fundamental change in how we design and construct large-scale systems • market reality: • 50%/year performance growth => cannot allow 1-2 year engineering lag • technological opportunity: • single-chip “Killer Switch” => fast, scalable communication • Highly integrated building-wide system • Explore novel system design concepts in this new “cluster” paradigm HPDC Panel

Berkeley NOW • 100 Sun UltraSparcs • 200 disks • Myrinet SAN • 160 MB/s • Fast comm. • AM, MPI, ... • Ether/ATM switched external net • Global OS • Self Config HPDC Panel

Landmarks • Top 500 Linpack Performance List • MPI, NPB performance on par with MPPs • RSA 40-bit Key challenge • World Leading External Sort • Inktomi search engine • NPACI resource site HPDC Panel

Taking Stock • Surprising successes • virtual networks • implicit co-scheduling • reactive IO • service-based applications • automatic network mapping • Surprising unsuccesses • global system layer • xFS file system • New directions for Millennium • Paranoid construction • Computational Economy • Smart Clients HPDC Panel

Fast Communication • Fast communication on clusters is obtained through direct access to the network, as on MPPs • Challenge is make this general purpose • system implementation should not dictate how it can be used HPDC Panel

Virtual Networks • Endpoint abstracts the notion of “attached to the network” • Virtual network is a collection of endpoints that can name each other. • Many processes on a node can each have many endpoints, each with own protection domain. HPDC Panel

How are they managed? • How do you get direct hardware access for performance with a large space of logical resources? • Just like virtual memory • active portion of large logical space is bound to physical resources Host Memory Process n Processor *** Process 3 Process 2 Process 1 NIC Mem P Network Interface HPDC Panel

Network Interface Support • NIC has endpoint frames • Services active endpoints • Signals misses to driver • using a system endpont Frame 0 Transmit Receive Frame 7 EndPoint Miss HPDC Panel

Msg burst work Client Server Client Server Server Client Communication under Load => Use of networking resources adapts to demand. HPDC Panel

GS GS LS LS A A GS GS LS LS A A A A Implicit Coscheduling • Problem: parallel programs designed to run in parallel => huge slowdowns with local scheduling • gang scheduling is rigid, fault prone, and complex • Coordinate schedulers implicitly using the communication in the program • very easy to build, robust to component failures • inherently “service on-demand”, scalable • Local service component can evolve. HPDC Panel

WS 1 Job A sleep Job A request response WS 2 Job B Job A WS 3 Job B Job A spin WS 4 Job B Job A Why it works • Infer non-local state from local observations • React to maintain coordination observation implication action fast response partner scheduled spin delayed response partner not scheduled block HPDC Panel

Example • Range of granularity and load imbalance • spin wait 10x slowdown HPDC Panel

I/O Lessons from NOW sort • Complete system on every node powerful basis for data intensive computing • complete disk sub-system • independent file systems • MMAP not read, MADVISE • full OS => threads • Remote I/O (with fast comm.) provides same bandwidth as local I/O. • I/O performance is very tempermental • variations in disk speeds • variations within a disk • variations in processing, interrupts, messaging, ... HPDC Panel

A D A A A A D D D D A D Distributed Queue A D A D Reactive I/O • Loosen data semantics • ex: unordered bag of records • Build flows from producers (eg. Disks) to consumers (eg. Summation) • Flow data to where it can be consumed Adaptive Parallel Aggregation Static Parallel Aggregation HPDC Panel

Performance Scaling • Allows more data to go to faster consumer HPDC Panel

Service Based Applications • Application provides services to clients • Grows/Shrinks according to demand, availability, and faults Transcend Transcoding Proxy Service request Front-end service threads User Profile Database Manager Physical processor Caches HPDC Panel

On the other hand • Glunix • offered much that was not available elsewhere • interactive use, load balancing, transparency (partial), … • straightforward master-slaves architecture • millions of jobs served, reasonable scalability, flexible partitioning • crash-prone, inscrutable, unaware, … • xFS • very sophisticated co-operative caching + network RAID • integrated at vnode layer • never robust enough for real use Both are hard, outstanding problems HPDC Panel

Lessons • Strength of clusters comes from • complete, independent components • incremental scalability (up and down) • nodal isolation • Performance heterogeneity and change are fundamental • Subsystems and applications need to be reactive and self-tuning • Local intelligence + simple, flexible composition HPDC Panel

Business SIMS BMRC Chemistry C.S. E.E. Biology Gigabit Ethernet Astro NERSC M.E. Physics N.E. Math IEOR Transport Economy C. E. MSME Millennium • Campus-wide cluster of clusters • PC based (Solaris/x86 and NT) • Distributed ownership and control • Computational science and internet systems testbed HPDC Panel

Paranoid Construction • What must work for RSH, dCOM, RMI, read, …? • A page of C to safely read a line from a socket! => carefully controlled set of cluster system op’s => non-blocking with timeout and full error checking • even if need a watcher thread => optimistic with fail-over of implementation => global capability at physical level => indirection used for transparency must track fault envelope, not just provide mapping HPDC Panel

Computational Economy Approach • System has a supply of various resources • Demand for resources revealed in price • distinct from the cost of acquiring the resources • User has unique assessment of value • Client agent negotiates for system resources on user’s behalf • submits requests, receives bids or participates in auctions • selects resources of highest value at least cost HPDC Panel

Advantages of the Approach • Decentralized load balancing • according to user’s perception of importance, not system’s • adapts to system and workload changes • Creates Incentive to adopt efficient modes of use • maintain resources in usable form • avoid excessive usage when needed by others • exploit under-utilized resources • maximize flexibility (e.g., migratable, restartable applications) • Establishes user-to-user feedback on resource usage • basis for exchange rate across resources • Powerful framework for system design • Natural for client to be watchful, proactive, and wary • Generalizes from resources to services • Rich body of theory ready for application HPDC Panel

Resource Allocation Stream of (partial, delayed, or incomplete) resource status information • Traditional approach allocates requests to resources to optimize some system utility function • e.g., put work on least loaded, most free mem, short queue, ... • Economic approach views each user as having a distinct utility function • e.g., can exchange resource and have both happy! Stream of (incomplete) Client Requests Allocator HPDC Panel

Pricing and all that • What’s the value of a CPU-minute, a MB-sec, a GB-day? • Many iterative market schemes • raise price till load drops • Auctions avoid setting a price • Vikrey (second price sealed bid) will cause resources to go to where they are most valued at the lowest price • In self-interest to reveal true utility function! • Small problem: auctions are awkward for most real allocation problems • Big problem: people (and their surrogates) don’t know what value to place on computation and storage! HPDC Panel

Smart Clients • Adopt the NT “everything is two-tier, at least” • UI stays on the desktop and interacts with computation “in the cluster of clusters” via distributed objects • Single-system image provided by wrapper • Client can provide complete functionality • resource discovery, load balancing • request remote execution service • Flexible appln’s will monitor availability and adapt. • Higher level services 3-tier optimization • directory service, membership, parallel startup HPDC Panel

Everything is a service • Load-balancing • Brokering • Replication • Directories => they need to be cost-effective or client will fall back to “self support” • if they are cost-effective, competitors might arise • Useful applications should be packaged as services • their value may be greater than the cost of resources consumed HPDC Panel

Conclusions • We’ve got the building blocks for very interesting clustered systems • fast communication, authentication, directories, distributed object models • Transparency and uniform access are convenient, but... • It is time to focus on exploiting the new characteristics of these systems in novel ways. • We need to get real serious about availability. • Agility (wary, reactive, adaptive) is fundamental. • Gronky “F77 + MPI and no IO” codes will seriously hold us back • Need to provide a better framework for cluster applications HPDC Panel

NOW and Beyond

NOW and Beyond

Presentation Transcript

AND NOW...

and NOW!

Earth and Beyond

The Future of Functional Skills: The Regulatory Picture Now and Beyond

IDS Now, Cheetah, and Beyond

Netflix and Beyond

EMDB and beyond

Women’s Health Now and Beyond Pregnancy

Now and Beyond London 2012:

GeoWalls and beyond…

MCHS and Beyond

Regulatory Update Now and Beyond 2010

Now and Beyond London 2012:

AND NOW

And now...

and beyond.

WADO and beyond

And now?