Evolution of Distributed Systems: NOW and Condor Projects in Collective Computing

1. Distributed Systems Early Examples

2. Projects NOW � a Network Of WorkstationsUniversity of California, BerkelyTerminated about 1997 after demonstrating the feasibility of their approach Condor � University of Wisconsin-MadisonStarted about 1988Is now an ongoing, world-wide system of shared computing clusters.

3. NOW: a Network of Workstationshttp://now.cs.berkeley.edu/ Scale: A NOW system consists of a building-wide collection of machines providing memory, disks, and processors. Basic Ideas Use idle CPU cycles for parallel processing on clusters of workstations Use memories as disk cache to break the I/O bottleneck (slow disk access times) Share the resources over fast LANs

4. NOW �Opportunities�: Memory Network RAM: fast networks, high bandwidth make it reasonable to page across the network. Instead of paging out to slow disks, send over fast networks to RAM in an idle machine Cooperative file caching: improve performance by using neetwork RAM as a very large file cache Shared files can be fetched from another client�s memory rather than server�s disk Active clients can extend their disk cache size by using memory of idle clients.

5. NOW Opportunities: �RAWD�(Redundant Arrays of Workstation Disks) RAID systems provide fast performance by connecting arrays of small disks. By reading/ writing data in parallel, throughput is increased. Instead of a hardware RAID, build software version by writing data across the work stations in the network Especially useful for parallel programs running on separate machines in the network

6. NOW Opportunities: �Parallel Computing� Harnessing the power of multiple idle workstations in a NOW can support high- performance parallel applications. NOW principles: avoid going to disk by using RAM on other network nodes (assumes network faster than disk) Further speedup may be achieved by parallelizing the computation and striping the data to multiple disks. allow user processes to access the network directly rather than going through the operating system

7. Berkeley NOW Features GLUnix (Global Layer UNIX) is a layer on top of UNIX OS�s running on the workstations Applications running on GLUnix have a protected virtual operating system layer which catches UNIX system calls and translates them into GLUnix calls. Serverless Network File System � xFS Avoids central server bottleneck Cooperative file system (basically, peer-to-peer)

8. Summary Successful, in their opinion. Ran for several years in the late 90s on the Berkeley CS system Key enabling technologies Scalable, high performance network Fast access to the network for user processes Global operating system layer to support system resources as a true shared pool.

9. CONDORhttp://research.cs.wisc.edu/condor/ Goal: ��to develop, implement, deploy, and evaluate mechanisms and policies that support High Throughput Computing (HTC) on large collections of distributively owned computing resources� HTC computing � �� problems that require weeks or months of computation to solve. �this type of research need a computing environment that delivers large amounts of computational power over a long period of time.� Compare to High Performance computing (HPC) which ��delivers a tremendous amount of power over a short period of time.�

10. Overview Condor can be used to manage computing clusters. It is designed to take advantage of idle machines Condor lets users submit many jobs at the same time. Result: tremendous amounts of computation with very little user intervention. No need to rewrite code - just link to Condor libraries

11. Features Checkpoints: save complete state of the process Critical for programs that run for long periods of time to recover from crashes or to vacate machines whose user has returned, or for process migration due to other reasons. Remote system calls: data resides on the home machine and system calls are directed there. Provides protection for host machines.

12. Features Jobs can run anywhere in the cluster (which can be a physical cluster, a virtual cluster, or even a single machine) Different machines have different capabilities; when submitting a job Condor uses can specify the kind of machine they wish to run on. When sets of jobs are submitted it�s possible to define dependencies; i.e., �don�t run Job 3 until jobs 1 and 2 have completed.�

13. Slides From a talk by Myron Livny, �The Principles and Power of Distributed Computing�,International Winter School on Grid Computing 2010.http://research.cs.wisc.edu/condor/talks.html Livny is a professor at the U of Wisconsin-Madison where he heads the Condor Project, and other grid/distributed computing projects or centers.

14. Condor Daemons Title unknown, by Hans Holbein the Younger, from Historiarum Veteris Testamenti icones, 1543

15. Condor Daemons

16. Condor today http://research.cs.wisc.edu/condor/map/

Evolution of Distributed Systems: NOW and Condor Projects in Collective Computing

Evolution of Distributed Systems: NOW and Condor Projects in Collective Computing

Presentation Transcript

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed Multimedia Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems

Distributed Systems Course Distributed File Systems

Distributed Systems

Distributed Systems