1 / 48

What’s New in Condor-G

What’s New in Condor-G. Outline. What is Condor-G Released New Features In Development. What Is Condor-G. Use Condor to run jobs on the Grid Uses Globus Toolkit GRAM (submit a remote job) GASS (transfer job’s files) Two components Globus Universe GlideIn. Globus Universe.

Download Presentation

What’s New in Condor-G

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s New in Condor-G

  2. Outline • What is Condor-G • Released New Features • In Development

  3. What Is Condor-G • Use Condor to run jobs on the Grid • Uses Globus Toolkit • GRAM (submit a remote job) • GASS (transfer job’s files) • Two components • Globus Universe • GlideIn

  4. Globus Universe • Run a job on a Grid resource • Features • Job management • Fault tolerance • Credential management • Roughly equivalent to the vanilla universe

  5. How It Works Condor-G Grid Resource Schedd LSF

  6. 600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF

  7. 600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF GridManager

  8. 600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager

  9. 600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager User Job

  10. GlideIn • Run the Condor daemons on Grid resources as user jobs • Create your own personal Condor pool from temporarily-acquired Grid resources • Brings the full power of Condor to the Grid

  11. Globus Grid LSF PBS Condor Condor-G

  12. Globus Grid 600 Condor jobs LSF PBS Condor Condor-G

  13. Globus Grid Condor-G 600 Condor jobs LSF PBS Condor

  14. Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor

  15. Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor

  16. Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor

  17. Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor

  18. Released New Features • Stuff we’ve added in the past year • Released and ready for use in Condor 6.6

  19. Globus ASCII Helper Protocol (GAHP) • Encapsulates Globus libraries in separate process • Simple ASCII protocol • Easy for legacy applications to use Globus when they can’t link directly with the libraries

  20. How It Works - GAHP Condor-G Grid Resources JobManager Schedd GridManager JobManager GAHP Client JobManager GAHP Server

  21. File Staging • Arbitrary input and output files can be staged to and from execution site • Same syntax as other universes • Limitation • Output files must be explicitly named

  22. File Staging (cont) • Input, Output, and Error can be URLs • Files will be transferred directly to and from execution site • Output and Error can be staged or streamed

  23. Credential Refresh • Renewed credentials are used by Condor-G and forwarded to the execution site automatically • No processes need to be restarted

  24. Better Credential Management • One GridManager process can handle multiple credential files with same subject • More efficient when you want to have different credential lifetimes for different jobs

  25. Grid Match-Making • Globus jobs matched with Globus resources by the Condor match-maker using ClassAds • Current limitation • User/admin must create resources ads

  26. Fault Tolerance • Condor-G does its best to automatically recover from failures • User can guide decisions with job policy expressions • Periodic Release • GlobusResubmit • Rematch

  27. PeriodicRelease Expression • Condor-G puts problematic jobs on hold • This expression tells Condor-G when to release and retry such jobs

  28. GlobusResubmit Expression • Tells Condor-G when a problematic job submission should be abandoned • When this expression becomes true • Best effort is made to clean up current job submission • New job submission is attempted

  29. Rematch Expression • Tells Condor-G when a problematic resource should be abandoned • Evaluated when GlobusResubmit evaluates to true • When this expression becomes true • Best effort is made to clean up current job submission • Job is rematched

  30. Job Ad Example GlobusContactString = TARGET.gatekeeper_url Requirements = TARGET.Arch == “LINUX” && TARGET.OpSys == “LINUX” Rank = TARGET.Mflops PeriodicRelease = ((NumMatches < 10) && ((CurrentTime-EnteredCurrentStatus) > 600)) GlobusResubmit = NumSystemHolds >= NumMatches Rematch = True

  31. Hardening • Regular testing on the CMS testbed with real applications • Many bugs and integration issues found and fixed • Hostile Environment

  32. Hostile Environment • Full disks • Machine crashes • File server lock-ups • Network outages • Power outages

  33. One CMS Dataset Run • 300 jobs • Last fall • ~50 (16%) of the jobs stalled and required human recovery • Multiple service restarts (20 daemon crashes over 6 hours) • Now • 0 jobs stalled • 0 service restarts

  34. Integration Work • Dozens of Condor-G improvements and bug fixes • Over 40 Globus “bugzilla” incidents, many with patches • Globus 2.2.4 has 21 “Advisories” as of 4/11/04 • Use latest version of both

  35. Scalability • Submitting several hundred jobs produced high load on server • Machine became unresponsive • We saw a load average of 1000 at one point • Caused Globus JobManager processes

  36. Grid Manager Monitor Agent • New tool Condor-G can use to reduce this load • Efficient job status polling program • Allows Condor-G to shut down JobManager processes when they’re not needed

  37. Load Reduced • 400 jobs (/bin/sleep 900) • Without Grid Monitor • 42 hours to complete • Peak load average of 610 • With Grid Monitor • 40 minutes • Peak load average of 104

  38. Miscellaneous Stuff • Email notification on job completion • Port range restrictions • Problem jobs put on hold

  39. In Development • Stuff we’re currently working on • Will be released sometime in the next year

  40. Job Policy Expressions • PeriodicHold • PeriodicRemove • OnExitHold • OnExitRemove

  41. Improved GlideIn • MDS use optional • User specifies necessary information • Automatic setup • GlideIn job transfers and installs binaries if needed • Binaries can come from submit machine

  42. New Job Types • Submit jobs directly to other schedulers (not through Globus) • Why? • Richer interface semantics • Not supported by Globus

  43. NorduGrid • Grid batch system designed by Nordic countries • Globus GRAM didn’t offer necessary semantics • Client control of file staging • Automatic cleanup of abandoned jobs

  44. Oracle • Oracle DBMS supports a job queue • Run this query in 5 hours • Run this query every Monday • Condor can add more management features

  45. Generic Job Interface • Re-arrange GridManager to allow easy addition of new job types • Define appropriate interface • Plug-ins for new job types?

  46. Globus Toolkit 3.0 • OGSA (Open Grid Services Architecture) • Submit jobs to GT3 sites • Grid Service client interface to Condor-G

  47. Miscellaneous • Condor-G for Windows • MyProxy credential management • URLs for executable, staged files

  48. Thank You! • Questions? • Also… • Condor-G & Globus Q/A session • Wednesday, 9am-12pm, room TBA • E-mail condor-admin@cs.wisc.edu

More Related