1 / 40

OGF 19 Condor Software Forum Condor-G

OGF 19 Condor Software Forum Condor-G. What Is It?. Condor-G is a specialization of Condor. It is also known as the “grid universe”. Condor-G speaks many different job management protocols. Condor-G benefits from all the wonderful Condor features, like a real job queue. Grid Fault-Tolerance.

alena
Download Presentation

OGF 19 Condor Software Forum Condor-G

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OGF 19Condor Software ForumCondor-G

  2. What Is It? • Condor-G is a specialization of Condor. It is also known as the “grid universe”. • Condor-G speaks many different job management protocols. • Condor-G benefits from all the wonderful Condor features, like a real job queue.

  3. Grid Fault-Tolerance • Condor-G does whatever it takes to run your jobs, even if … • Your local machine machine crashes • The grid service is temporarily unavailable • The network goes down

  4. Remote Resource Access: Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B

  5. Globus Globus JobManager Globus GRAM Protocol “globusrun myjob …” fork() Organization A Organization B

  6. Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun myjob …” Submit to Condor Condor Pool Organization A Organization B

  7. Globus + Condor Globus JobManager Globus GRAM Protocol “globusrun …” Submit to Condor Condor Pool Organization A Organization B

  8. Condor-G + Globus + Condor Globus JobManager Condor-G Globus GRAM Protocol myjob1 myjob2 myjob3 myjob4 myjob5 … Submit to Condor Condor Pool Organization A Organization B

  9. Condor-G Fault-Tolerance:Lost Contact with Remote Jobmanager Can we contact gatekeeper? Yes - jobmanager crashed No – retry until we can talk to gatekeeper again… Can we reconnect to jobmanager? No – machine crashed or job completed Yes – network was down Restart jobmanager Has job completed? No – is job still running? Yes – update queue

  10. Just to be fair… • The gatekeeper doesn’t have to submit to a Condor pool. • It could be PBS, LSF, Sun Grid Engine… • Condor-G will work fine whatever the remote batch system is.

  11. Other Grid Protocols Works with WS-GRAM, NorduGrid, Unicore Credential Management Pull refreshed credentials from MyProxy Push refreshed credentials to remote systems Job Scheduling Use Matchmaking to select resources for jobs GlideIn Allows late binding of resources and job checkpoint/migration Other Condor-G Features

  12. GT2 [.1|2|4] GT4 Condor PBS/LSF NorduGrid Unicore HTTPS WSRF Condor-G Job Description (Job ClassAd) Condor-G

  13. Pre-WS GRAM • Submit filegrid_resource = gt2 \ foo.edu/jobmanager-pbsglobus_rsl = (queue=long)\ (condor_submit=(universe java))

  14. OGSA GRAM • Submit filegrid_resource = gt3 http://foo.edu/\ ogsa/services/base/gram/\ PBSManagedJobFactoryServiceglobus_rsl = (queue=long)\ (condor_submit=(universe java)) • Museum mode

  15. WS GRAM • Submit filegrid_resource = gt4 foo.edu PBSglobus_xml = <queue>long</queue>

  16. NorduGrid • Submit filegrid_resource = nordugrid foo.edunordugrid_rsl = (queue=long)

  17. Unicore • Submit filegrid_resource = unicore usite.org vsitekeystore_file = keystorekeystore_passphrase_file = keystore.pwkeystore_alias = my cert

  18. Condor • Submit filegrid_resource = condor schedd.foo.edu \ cm.foo.eduremote_universe = java

  19. PBS • Submit filegrid_resource = pbs

  20. LSF • Submit filegrid_resource = lsf

  21. Grid Universe Fault-Tolerance: Credential Management • Authentication in many grid protocols is done with limited-lifetime X509 proxies • Proxy may expire before jobs finish executing • Condor can put jobs on hold and email user to refresh proxy • Condor can automatically retrieve new proxies from MyProxy • When the proxy is refreshed, Condor forwards it to the jobs

  22. MyProxy • Submit fileMyProxyHost = foo.edu:12345MyProxyServerDN = /DC=org/DC=doegrids…MyProxyCredentialName = proxy_fileMyProxyRefreshThreshold = 240 #minsMyProxyNewProxyLifetime = 12 #hrsMyProxyPassword = password • Or give password on command linecondor_submit -p password submit.desc

  23. Condor-G Matchmaking • Use Condor-G matchmaking with grid universe jobs • Allows Condor-G to dynamically assign computing jobs to grid sites • An example of lazy planning

  24. Condor-G Matchmaking, cont. • Normally a grid universe job must specify the site in the submit description file via the “grid_resource” attribute like so: Executable = foo Universe = grid Grid_Resource = gt2 \ beak.cs.wisc.edu/jobmanager-pbs queue

  25. Condor-G Matchmaking, cont. • With matchmaking, grid universe jobs can use requirements and rank: Executable = foo Universe = grid Grid_Resource = $$(ResourceName) Requirements = arch == LINUX Rank = NumberOfNodes * random() Queue • The $$(x) syntax inserts information from the target ClassAd when a match is made.

  26. Condor-G Matchmaking, cont. • Where do these target ClassAds representing Globus gatekeepers come from? Several options: • Simple script on gatekeeper publishes an ad via condor_advertise command-line utility (method used by D0 JIM, USCMS) • Program to query Globus MDS and convert information into ClassAd (method used by EDG) • Run HawkEye with appropriate plugins on the gatekeeper • For explanation of Condor-G matchmaking setup for USCMS, seehttp://www.cs.wisc.edu/condor/USCMS_matchmaking.html

  27. Condor-G Matchmaking: Creating the Resource Ad • Machine AdMyType = “Machine”TargetType = “Job”Name = “foo.edu”Machine = “foo.edu”ResourceName = “gt4 foo.edu PBS”UpdateSequenceNumber = 4Requirements = TARGET.JobUniverse == 9 && \ CurMatches < 10CurMatches = 0NumberOfNodes = 300Rank = 0.0CurrentRank = 0.0WantAdRevaluate = True

  28. Condor-G Matchmaking: Creating the Resource Ad • Advertising a resourcecondor_advertise UPDATE_STARTD_AD \ ad-file • Call periodically • Use unix time for UpdateSequenceNumber

  29. But Wait, There’s More… • What if you want to run standard universe jobs on grid resources • For matchmaking and dynamic scheduling of jobs • For job checkpointing and migration • For remote system calls • What if you don’t want to send a job to a site until the moment the job will start running (late binding)

  30. One Solution: Condor-G GlideIn • You can use the Grid Universe to run Condor daemons on grid resources • When the resources run these GlideIn jobs, they will temporarily join your Condor Pool • You can then submit Standard, Vanilla, PVM, or MPI Universe jobs and they will be matched and run on the grid resources

  31. personal Condor Globus Grid your workstation 600 Condor jobs LSF PBS glide-in jobs Condor Condor Pool Friendly Condor Pool

  32. GlideIn Concerns • What if a grid resource kills my GlideIn job? • That resource will disappear from your pool and your jobs will be rescheduled on other machines • Standard universe jobs will resume from their last checkpoint like usual • What if all my jobs are completed before a GlideIn job runs? • If a GlideIn Condor daemon is not matched with a job in 10 minutes, it terminates, freeing the resource

  33. matchmaker condor_submit schedd (Job caretaker) Startd (Runs job) Condor

  34. condor_submit schedd (Job caretaker) Globus gatekeeper gahp gridmanager PBS or LSF Condor-G

  35. matchmaker condor_submit schedd (Job caretaker) schedd startd condor-gahp gridmanager Condor-C

  36. schedd condor_submit schedd (Job caretaker) gridmanager condor-gahp gridmanager pbs/lsf-gahp PBS or LSF Condor-C to non-Condor

  37. schedd Globus gatekeeper condor_submit schedd (Job caretaker) gridmanager gahp gridmanager pbs/lsf-gahp condor-gahp PBS or LSF Gliding in Condor-C 1. Glide-in 2. Submit jobs

  38. Matchmaking with Condor-C • In all of these examples, Condor-C went to a specific remote schedd • This is not required: you can do matchmaking

  39. schedd condor_submit schedd (Job caretaker) matchmaker … submit job schedd condor-gahp gridmanager Matchmaking with Condor-C

More Related