job submission n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Job Submission PowerPoint Presentation
Download Presentation
Job Submission

Loading in 2 Seconds...

play fullscreen
1 / 31
ismet

Job Submission - PowerPoint PPT Presentation

123 Views
Download Presentation
Job Submission
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Job Submission Andrew Pangborn & Myles Maxfield Rochester Institute of Technology 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 1 1

  2. The Grid • Virtual organizations spanning multiple administrative domains • Different organizations and administrators • Different hardware • Different queuing systems • How do we make sense of it all? 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 2 2

  3. The Problem • At one end are computing resources (the grid fabric) managed by batch queuing systems and middleware • At the other end are end-users and their jobs/applications • Need software and protocols for submitting jobs to the computing resources • Also want to be able to monitor jobs after submission and efficiently schedule them to achieve high-throughput 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 3 3

  4. Grid Architecture Job Submission Image from Ian Foster paper (The Anatomy of the Grid) 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 4 4

  5. Batch Queuing Systems • Submitting a job directly to the batch queuing system • One or more queues • Priorities • Two common architectures • Client/server • Dynamic offloading • User credential (delegation) • Jobs have states (e.g. Pending, Running) 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 5 5

  6. Batch Queuing Systems • Important examples: • Portable Batch System • TORQUE • Xgrid • Sun Grid Engine • Load Sharing Facility • Condor 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 6 6

  7. Portable Batch System (PBS) • Originally developed for NASA • Client/server architecture • Server: pbs_server • Client: pbs_mom • Works with MPI with built-in shell script variables 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 7 7

  8. PBS Example litherum@gras:~$ cat test.sh #!/bin/sh #testpbs echo This is a test echo today is `date` echo This is `hostname` echo The current working directory is `pwd` ls -alF /home uptime 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 8 8

  9. PBS Example litherum@gras:~$ qsub test.sh 6.gras.carrion.rit.edu litherum@gras:~$ qstat Job id Name User Time Use S Queue ------------------------- ---------------- --------------- -------- - ----- 6.gras test.sh litherum 00:00:00 C batch litherum@gras:~$ cat test.sh.o6 This is a test today is Sat Jan 17 18:20:20 EST 2009 This is carrion02 The current working directory is /home/litherum total 20 drwxr-xr-x 31 litherum litherum 4096 Jan 17 18:19 litherum/ 18:20:20 up 131 days, 21:20, 0 users, load average: 0.00, 0.00, 0.00 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 9 9

  10. Torque • Built on top of PBS • Supports reservations, where you can reserve specific resources for specific times. • Supports partitions, where you can partition a cluster into smaller sub-clusters. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 10 10

  11. Torque litherum@gras:~$ showq ACTIVE JOBS-------------------- JOBNAME USERNAME STATE PROC REMAINING STARTTIME 0 Active Jobs 0 of 4 Processors Active (0.00%) 0 of 2 Nodes Active (0.00%) IDLE JOBS---------------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME 0 Idle Jobs BLOCKED JOBS---------------- JOBNAME USERNAME STATE PROC WCLIMIT QUEUETIME Total Jobs: 0 Active Jobs: 0 Idle Jobs: 0 Blocked Jobs: 0 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 11 11

  12. Xgrid • Apple • Essentially the same as Condor • GUI! =) • Client/server model http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 12 12

  13. Sun Grid Engine • Open source, like everything new Sun puts out • Supports • Reservations • Job dependencies, • Checkpointing • Multiple scheduling algorithms • Web interface • Professional! 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 13 13

  14. Middleware • These queuing systems are hard to use • There may be many systems employed in a given grid • Wouldn’t it be nice if all this were unified in a single implementation? • Middleware that handles job submission in a virtual organization across resources spread throughout multiple administration domains would be useful! 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 14 14

  15. A tool for pooling and “scavenging” computing resources and distributing jobs • Similar to a batch queuing system [2] • job management • scheduling policy • priority scheme • resource monitoring • resource management. • Also focuses on high-throughput and “opportunistic computing” [2] • Utilize computing resources whenever they are available Condor image from: http://www.cs.wisc.edu/condor/ 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 15 15

  16. Condor Universes [1] • Standard • Check pointing, fault tolerance • Link job against condor libraries • Vanilla • Simpler, can run universal binaries (do not need to be “condor compiled”) • No support for partial execution or job relocation • Others • PVM • MPI • Java 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 16 16

  17. Condor Submission File Example [1] #hello.sub #condor job file example Universe = Vanilla Executable = hello Output = hello.out Input = hello.in Error = hello.err Log = hello.log Queue 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 17 17

  18. Some Condor Commands [5] • condor_submit <job_file.sub> • Submit a condor job • condor_q • View condor job queue • condor_status • Check status of jobs in queue • condor_compile • Re-links jobs for use in standard universe 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 18 18

  19. Condor job structures Master-Worker Programming models for larger scale jobs using condor agent DAG (Directed Acyclic Graph) • Single master process coordinates all the independent tasks • Collects results as workers finish, distributes new jobs to workers 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 19 19

  20. GRAM [4] • Globus Resource Allocation Manager (GRAM) • Resource allocation • Process creation • Monitoring • Management • Maps requests expressed in a Resource Specification Language (RSL) into commands to local schedulers and computers. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 20 20

  21. GRAM • Pluggable! • Can’t make up their mind how to describe jobs • Will submit jobs to: • Condor • LSF • PBS/Torque • ??? • Unified interface, identifier for which cluster/service to use • Job submission file 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 21 21

  22. GRAM Example maxfield@tg-login1:~> globusrun-ws -submit -factory https://tg-login.ornl.teragrid.org:84 44/wsrf/services/ManagedJobFactoryService -factory-type PBS -streaming -job-command /bin/ hostname Delegating user credentials...Done. Submitting job...Done. Job ID: uuid:89538014-e4f2-11dd-81df-0010180bb4e6 Termination time: 01/18/2009 23:57 GMT Current job state: Pending Current job state: Active tg-c15 Current job state: CleanUp-Hold Current job state: CleanUp Current job state: Done Destroying job...Done. Cleaning up any delegated credentials...Done. 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 22 22

  23. GRAM Input Example <job> <executable>/bin/echo</executable> <argument>this is an example string </argument> <argument>Globus was here</argument> <stdout>${GLOBUS_USER_HOME}/stdout</stdout> <stderr>${GLOBUS_USER_HOME}/stderr</stderr> </job> http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 23 23

  24. Condor-G [4] • Condor-G is a Globus-enabled version of the Condor scheduler. • It uses Globus to handle inter-organizational problems like: • Security • Resource management for supercomputers, • Executable staging. • The same Condor tools that access local resources are now able to use the Globus protocols to access resources at multiple sites. • It communicates with these resources and transfers files to and from these resources using Globus mechanisms, such as: • GSI for security • GRAM protocol for job submission • GASS for file transfer • Condor-G can be used to submit jobs to systems managed by Globus. • Globus tools can be used to submit jobs to systems managed by Condor 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 24 24

  25. Condor-G 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 25 25

  26. Using Condor-G • Set condor universe=globus in submit file • Also need to specify the globus scheduler hostname, for example:globusscheduler = example.org/jobmanager • Still use globus_submit command • TeraGrid Condor-G example here: • http://www.teragrid.org/userinfo/jobs/condorg.php 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 26 26

  27. UNICORE • Alternative to Globus • Primarily used in Europe • Uses web services, similar to GT4 • GUI • Abstract Job Objects • User -> Server -> Virtual Site • X.509 and SSL 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 27 27

  28. UNICORE GUI 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 28 28

  29. Upperware • Abstract Job Objects? Workflows? What is all this nonsense?! • Scientist (primary user) doesn’t care about this stuff • Shouldn’t have to deal with writing XML description files or creating a complicated workflow • Simply let them run their program 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 29 29

  30. GridShell • Unified command line interface • Defer to resident experts  01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 30 30

  31. References • http://www.linuxjournal.com/node/9058/print - Getting started with Condor • Thain, D., Tannenbaum, T., & Livny, M. (2005). Distributed computing in practice: the Condor experience. • http://grid.rit.edu/seminar/lib/exe/fetch.php/users:jeremy_espenshade:condorjobsubmission.ppt • http://iag.iucc.ac.il/presentations/front2.ppt • http://www.cs.wisc.edu/condor/manual/v7.2/ • http://www.globus.org/toolkit/docs/4.2/4.2.1/execution/gram4/user/#gram4-user-usagescenarios-jdd • http://upload.wikimedia.org/wikipedia/en/6/62/XgridAdminTool.jpg • Wikipedia • http://www.isgtw.org/images/Rudolph_expert_client_screenshot2.jpg • http://upload.wikimedia.org/wikipedia/commons/a/a4/Double_curvature_steel_lattice_Shell_by_Shukhov_in_Vyksa_1897_shell.jpg 01/19/09 Service Oriented Cyberinfrastructure Lab, http://grid.rit.edu 31 31