1 / 31

Job Submission

Job Submission. Andrew Pangborn & Myles Maxfield. Rochester Institute of Technology. 01/19/09. Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu. 1. 1. The Grid. Virtual organizations spanning multiple administrative domains Different organizations and administrators

ismet
Download Presentation

Job Submission

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Job Submission Andrew Pangborn & Myles Maxfield Rochester Institute of Technology 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1 1

  2. The Grid • Virtual organizations spanning multiple administrative domains • Different organizations and administrators • Different hardware • Different queuing systems • How do we make sense of it all? 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 2 2

  3. The Problem • At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware • At the other end are end-users and their jobs/applications • Need software and protocols for submitting jobs to the computing resources • Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 3 3

  4. Grid Architecture Job Submission Image from Ian Foster paper (The Anatomy of the Grid) 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 4 4

  5. Batch Queuing Systems • Submitting a job directly to the batch queuing system • One or more queues • Priorities • Two common architectures • Client/server • Dynamic offloading • User credential (delegation) • Jobs have states (e.g. Pending, Running) 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 5 5

  6. Batch Queuing Systems • Important examples: • Portable Batch System • TORQUE • Xgrid • Sun Grid Engine • Load Sharing Facility • Condor 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 6 6

  7. Portable Batch System (PBS) • Originally developed for NASA • Client/server architecture • Server: pbs_server • Client: pbs_mom • Works with MPI with built-in shell script variables 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 7 7

  8. PBS Example litherum@gras:~$ cat test.sh #!/bin/sh #testpbs echo This is a test echo today is `date` echo This is `hostname` echo The current working directory is `pwd` ls -alF /home uptime 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 8 8

  9. PBS Example litherum@gras:~$ qsub test.sh 6.gras.carrion.rit.edu litherum@gras:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6 This is a test today is Sat Jan 17 18:20:20 EST 2009 This is carrion02 The current working directory is /home/litherum total 20 drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00, 0.00 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 9 9

  10. Torque • Built on top of PBS • Supports reservations, where you can reserve specific resources for specific times. • Supports partitions, where you can partition a cluster into smaller sub-clusters. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 10 10

  11. Torque litherum@gras:~$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 11 11

  12. Xgrid • Apple • Essentially the same as Condor • GUI! =) • Client/server model http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 12 12

  13. Sun Grid Engine • Open source, like everything new Sun puts out • Supports • Reservations • Job dependencies, • Checkpointing • Multiple scheduling algorithms • Web interface • Professional! 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 13 13

  14. Middleware • These queuing systems are hard to use • There may be many systems employed in a given grid • Wouldn’t it be nice if all this were unified in a single implementation? • Middleware that handles job submission in a virtual organization across resources spread throughout multiple administration domains would be useful! 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 14 14

  15. A tool for pooling and “scavenging” computing resources and distributing jobs • Similar to a batch queuing system [2] • job management • scheduling policy • priority scheme • resource monitoring • resource management. • Also focuses on high-throughput and “opportunistic computing” [2] • Utilize computing resources whenever they are available Condor image from: http://www.cs.wisc.edu/condor/ 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 15 15

  16. Condor Universes [1] • Standard • Check pointing, fault tolerance • Link job against condor libraries • Vanilla • Simpler, can run universal binaries (do not need to be “condor compiled”) • No support for partial execution or job relocation • Others • PVM • MPI • Java 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 16 16

  17. Condor Submission File Example [1] #hello.sub #condor job file example Universe = Vanilla Executable = hello Output = hello.out Input = hello.in Error = hello.err Log = hello.log Queue 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 17 17

  18. Some Condor Commands [5] • condor_submit <job_file.sub> • Submit a condor job • condor_q • View condor job queue • condor_status • Check status of jobs in queue • condor_compile • Re-links jobs for use in standard universe 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 18 18

  19. Condor job structures Master-Worker Programming models for larger scale jobs using condor agent DAG (Directed Acyclic Graph) • Single master process coordinates all the independent tasks • Collects results as workers finish, distributes new jobs to workers 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 19 19

  20. GRAM [4] • Globus Resource Allocation Manager (GRAM) • Resource allocation • Process creation • Monitoring • Management • Maps requests expressed in a Resource Specification Language (RSL) into commands to local schedulers and computers. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 20 20

  21. GRAM • Pluggable! • Can’t make up their mind how to describe jobs • Will submit jobs to: • Condor • LSF • PBS/Torque • ??? • Unified interface, identifier for which cluster/service to use • Job submission file 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 21 21

  22. GRAM Example maxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-login.ornl.teragrid.org:84 44/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-command /bin/ hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6 Termination time: 01/18/2009 23:57 GMT Current job state: Pending Current job state: Active tg-c15 Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 22 22

  23. GRAM Input Example <job> <executable>/bin/echo</executable> <argument>this is an example string </argument> <argument>Globus was here</argument> <stdout>${GLOBUS_USER_HOME}/stdout</stdout> <stderr>${GLOBUS_USER_HOME}/stderr</stderr> </job> http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 23 23

  24. Condor-G [4] • Condor-G is a Globus-enabled version of the Condor scheduler. • It uses Globus to handle inter-organizational problems like: • Security • Resource management for supercomputers, • Executable staging. • The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites. • It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as: • GSI for security • GRAM protocol for job submission • GASS for file transfer • Condor-G can be used to submit jobs to systems managed by Globus. • Globus tools can be used to submit jobs to systems managed by Condor 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 24 24

  25. Condor-G 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 25 25

  26. Using Condor-G • Set condor universe=globus in submit file • Also need to specify the globus scheduler hostname, for example:globusscheduler = example.org/jobmanager • Still use globus_submit command • TeraGrid Condor-G example here: • http://www.teragrid.org/userinfo/jobs/condorg.php 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 26 26

  27. UNICORE • Alternative to Globus • Primarily used in Europe • Uses web services, similar to GT4 • GUI • Abstract Job Objects • User -> Server -> Virtual Site • X.509 and SSL 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 27 27

  28. UNICORE GUI 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 28 28

  29. Upperware • Abstract Job Objects? Workflows? What is all this nonsense?! • Scientist (primary user) doesn’t care about this stuff • Shouldn’t have to deal with writing XML description files or creating a complicated workflow • Simply let them run their program 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 29 29

  30. GridShell • Unified command line interface • Defer to resident experts  01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 30 30

  31. References • http://www.linuxjournal.com/node/9058/print - Getting started with Condor • Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. • http://grid.rit.edu/seminar/lib/exe/fetch.php/users:jeremy_espenshade:condorjobsubmission.ppt • http://iag.iucc.ac.il/presentations/front2.ppt • http://www.cs.wisc.edu/condor/manual/v7.2/ • http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd • http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg • Wikipedia • http://www.isgtw.org/images/Rudolph_expert_client_screenshot2.jpg • http://upload.wikimedia.org/wikipedia/commons/a/a4/Double_curvature_steel_lattice_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 31 31

More Related