Introduction to Condor

Introduction to Condor

Доброе утро! • Thank you for having me! • I am: • Alain Roy • Computer Science Ph.D. in Quality of Service, with Globus Project • Working with the Condor Project

Condor Tutorials • Today (Sunday) 10:00-12:30 • A general introduction to Condor • Monday 17:00-19:00 • Using and administering Condor • Tuesday 17:00-19:00 • Using Condor on the Grid

A General Introduction to Condor

The Condor Project(Established 1985) Distributed Computing research performed by a team of about 30 faculty, full time staff, and students who: • face software engineering challenges in a Unix and Windows environment, • are involved in national and international collaborations, • actively interact with users, • maintain and support a distributed production environment, • and educate and train students.

A Multifaceted Project • Harnessing clusters—opportunistic and dedicated (Condor) • Job management for Grid applications (Condor-G, DaPSched) • Fabric management for Grid resources (Condor, GlideIns, NeST) • Distributed I/O technology (PFS, Kangaroo, NeST) • Job-flow management (DAGMan, Condor) • Distributed monitoring and management (HawkEye) • Technology for Distributed Systems (ClassAD, MW)

Harnessing Computers • We have more than 300 pools with more than 8500 CPUs worldwide. • We have more than 1800 CPUs in 10 pools on our campus. • Established a “complete” production environment for the UW CMS group • Adopted by the “real world” (Galileo, Maxtor, Micron, Oracle, Tigr, … )

The Grid … • Close collaboration and coordination with the Globus Project—joint development, adoption of common protocols, technology exchange, … • Partner in major national Grid R&D2 (Research, Development and Deployment) efforts (GriPhyN, iVDGL, IPG, TeraGrid) • Close collaboration with Grid projects in Europe (EDG, GridLab, e-Science)

User/Application Grid Fabric (processing, storage, communication)

Condor Globus Toolkit Condor User/Application Grid Fabric (processing, storage, communication)

distributed I/O … • Close collaboration with the Scientific Data Management Group at LBL. • Provide management services for distributed data storage resources • Provide management and scheduling services for Data Placement jobs (DaPs) • Effective, secure and flexible remote I/O capabilities • Exception handling

job flow management … • Adoption of Directed Acyclic Graphs (DAGs) as a common job flow abstraction. • Adoption of the DAGMan as an effective solution to job flow management.

For the Rest of Today • Condor • Condor and the Grid • Related Technologies • DAGMan • ClassAds • Master-Worker • NeST • DaP Scheduler • Hawkeye • Today: Just the “Big Picture”

What is Condor? • Condor converts collections of distributively owned workstations and dedicated clusters into a distributed high-throughput computing facility. • Run lots of jobs over a long period of time, • Not a short burst of “high-performance” • Condor manages both machines and jobs with ClassAd Matchmaking to keep everyone happy

Condor Takes Care of You • Condor does whatever it takes to run your jobs, even if some machines… • Crash (or are disconnected) • Run out of disk space • Don’t have your software installed • Are frequently needed by others • Are far away & managed by someone else

What is Unique about Condor? • ClassAds • Transparent checkpoint/restart • Remote system calls • Works in heterogeneous clusters • Clusters can be: • Dedicated • Opportunistic

What’s Condor Good For? • Managing a large number of jobs • You specify the jobs in a file and submit them to Condor, which runs them all and sends you email when they complete • Mechanisms to help you manage huge numbers of jobs (1000’s), all the data, etc. • Condor can handle inter-job dependencies (DAGMan)

What’s Condor Good For? (cont’d) • Robustness • Checkpointing allows guaranteed forward progress of your jobs, even jobs that run for weeks before completion • If an execute machine crashes, you only lose work done since the last checkpoint • Condor maintains a persistent job queue - if the submit machine crashes, Condor will recover • (Story)

What’s Condor Good For? (cont’d) • Giving your job the agility to access more computing resources • Checkpointing allows your job to run on “opportunistic resources” (not dedicated) • Checkpointing also provides “migration” - if a machine is no longer available, move! • With remote system calls, run on systems which do not share a filesystem - You don’t even need an account on a machine where your job executes

Other Condor features • Implement your policy on when the jobs can run on your workstation • Implement your policy on the execution order of the jobs • Keep a log of your job activities

A Condor Pool In Action

A Bit of Condor Philosophy • Condor brings more computing to everyone • A small-time scientist can make an opportunistic pool with 10 machines, and get 10 times as much computing done. • A large collaboration can use Condor to control it’s dedicated pool with hundreds of machines.

The Condor Idea Computing power is everywhere,we try to make it usable by anyone.

Meet Frieda. She is a scientist. But she has a big problem.

Frieda’s Application … Simulate the behavior of F(x,y,z) for 20 values of x, 10 values of y and 3 values of z (20*10*3 = 600 combinations) • F takes on the average 3 hours to compute on a “typical” workstation (total = 1800 hours) • F requires a “moderate” (128MB) amount of memory • F performs “moderate” I/O - (x,y,z) is 5 MB and F(x,y,z) is 50 MB

I have 600simulations to run.Where can I get help?

Install a Personal Condor!

Installing Condor • Download Condor for your operating system • Available as a free download from http://www.cs.wisc.edu/condor • Not labelled as “Personal” Condor, just “Condor”. • Available for most Unix platforms and Windows NT

So Frieda Installs Personal Condor on her machine… • What do we mean by a “Personal” Condor? • Condor on your own workstation, no root access required, no system administrator intervention needed—easy to set up. • So after installation, Frieda submits her jobs to her Personal Condor…

Personal Condor?!What’s the benefit of a Condor “Pool” with just one user and one machine?

Your Personal Condor will ... • Keep an eye on your jobs and will keep you posted on their progress • Keep a log of your job activities • Add fault tolerance to your jobs • Implement your policy on when the jobs can run on your workstation

Frieda is happy until…She realizes she needs to run a post-analysis on each job, after it completes.

Condor DAGMan • Directed Acyclic Graph Manager • DAGMan allows you to specify the dependencies between your Condor jobs, so it can manage them automatically for you. • (e.g., “Don’t run job “B” until job “A” has completed successfully.”)

Job A Job B Job C Job D What is a DAG? • A DAG is the datastructure used by DAGMan to represent these dependencies. • Each job is a “node” in the DAG. • Each node can have any number of “parent” or “children” nodes – as long as there are no loops!

Running a DAG • DAGMan acts as a “meta-scheduler”, managing the submission of your jobs to Condor based on the DAG dependencies. DAGMan A Condor Job Queue .dag File A B C D

Running a DAG (cont’d) • DAGMan holds & submits jobs to Condor at the appropriate times. DAGMan A Condor Job Queue B B C C D

Running a DAG (cont’d) • In case of a job failure, DAGMan continues until it can no longer make progress, and then creates a “rescue” file with the current state of the DAG. DAGMan A Condor Job Queue Rescue File B X D

Recovering a DAG • Once the failed job is ready to be re-run, the rescue file can be used to restore the prior state of the DAG. DAGMan A Condor Job Queue Rescue File B C C D

Recovering a DAG (cont’d) • Once that job completes, DAGMan will continue the DAG as if the failure never happened. DAGMan A Condor Job Queue B C D D

Finishing a DAG • Once the DAG is complete, the DAGMan job itself is finished, and exits. DAGMan A Condor Job Queue B C D

Frieda wants more… • She decides to use the graduate students’ computers when they aren’t, and get done sooner. • In exchange, they can use the Condor pool too.

Frieda’s Condor pool… Frieda’s Computer: Central Manager Graduate Student’s Desktop Computers

Frieda’s Pool is Flexible • Since Frieda’s is a professor, her jobs are preferred. • Frieda doesn’t always have jobs, so now the graduate students have access to more computing power. • Frieda’s pool has enabled more work to be done by everyone.

How does this work? • Frieda submits a job. Condor makes a ClassAd and give it to the Central Manager: • Owner = “Frieda” • MemoryUsed = 40M • ImageSize=20M • Requirements=(Opsys==“Linux” && Memory > MemoryUsed) • Central Manager collects machine ClassAds: • Memory=128M • Requirements=(ImageSize < 50M) • Rank=(Owner==“Frieda”) • Central Manager finds best match

After a match is found • Central Manager tells both parties about the match • Frieda’s computer and the remote computer cooperate to run Frieda’s job.

Lots of flexibility • Machines can: • Only run jobs when I have been idle for at least 15 minutes—or always run them. • Kick off jobs when someone starts using the computer—or never kick them off. • Jobs can: • Require or prefer certain machines • Use checkpointing, remote I/O, etc…

Happy Day! Frieda’s organization purchased a Beowulf Cluster! • Other scientists in her department have realized the power of Condor and want to share it.. • The Beowulf cluster and the graduate student computers can be part of a single Condor pool.

Frieda’s Condor pool… Graduate Student’s Desktop Computers Frieda’s Computer: Central Manager Beowulf Cluster

Frieda’s Big Condor Pool • Jobs can prefer to run in the Beowulf cluster by using “Rank”. • Jobs can run just on “appropriate machines” based on: • Memory, disk space, software, etc. • The Beowulf cluster is dedicated. • The student computers are still useful. • Everyone’s computing power is increased.

Frieda collaborates… • She wants to share her Condor pool with scientists from another lab.

Introduction to Condor