480 likes | 499 Views
Learn about the new features released in Condor-G, how it functions, and its benefits in managing jobs on the Grid. Understand the job management, fault tolerance, and credential management aspects of Condor-G. Explore its components and compatibility with the Globus Toolkit. Discover how Condor-G brings the full power of Condor to the Grid environment. Find out about job scheduling, resource utilization, and the easy integration of legacy applications with Globus. Stay informed about the latest improvements and bug fixes in Condor-G and Globus, ensuring efficient and secure job execution on the Grid.
E N D
Outline • What is Condor-G • Released New Features • In Development
What Is Condor-G • Use Condor to run jobs on the Grid • Uses Globus Toolkit • GRAM (submit a remote job) • GASS (transfer job’s files) • Two components • Globus Universe • GlideIn
Globus Universe • Run a job on a Grid resource • Features • Job management • Fault tolerance • Credential management • Roughly equivalent to the vanilla universe
How It Works Condor-G Grid Resource Schedd LSF
600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF
600 Globus jobs How It Works Condor-G Grid Resource Schedd LSF GridManager
600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager
600 Globus jobs How It Works Condor-G Grid Resource JobManager Schedd LSF GridManager User Job
GlideIn • Run the Condor daemons on Grid resources as user jobs • Create your own personal Condor pool from temporarily-acquired Grid resources • Brings the full power of Condor to the Grid
Globus Grid LSF PBS Condor Condor-G
Globus Grid 600 Condor jobs LSF PBS Condor Condor-G
Globus Grid Condor-G 600 Condor jobs LSF PBS Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Globus Grid Condor-G 600 Condor jobs LSF PBS glide-ins Condor
Released New Features • Stuff we’ve added in the past year • Released and ready for use in Condor 6.6
Globus ASCII Helper Protocol (GAHP) • Encapsulates Globus libraries in separate process • Simple ASCII protocol • Easy for legacy applications to use Globus when they can’t link directly with the libraries
How It Works - GAHP Condor-G Grid Resources JobManager Schedd GridManager JobManager GAHP Client JobManager GAHP Server
File Staging • Arbitrary input and output files can be staged to and from execution site • Same syntax as other universes • Limitation • Output files must be explicitly named
File Staging (cont) • Input, Output, and Error can be URLs • Files will be transferred directly to and from execution site • Output and Error can be staged or streamed
Credential Refresh • Renewed credentials are used by Condor-G and forwarded to the execution site automatically • No processes need to be restarted
Better Credential Management • One GridManager process can handle multiple credential files with same subject • More efficient when you want to have different credential lifetimes for different jobs
Grid Match-Making • Globus jobs matched with Globus resources by the Condor match-maker using ClassAds • Current limitation • User/admin must create resources ads
Fault Tolerance • Condor-G does its best to automatically recover from failures • User can guide decisions with job policy expressions • Periodic Release • GlobusResubmit • Rematch
PeriodicRelease Expression • Condor-G puts problematic jobs on hold • This expression tells Condor-G when to release and retry such jobs
GlobusResubmit Expression • Tells Condor-G when a problematic job submission should be abandoned • When this expression becomes true • Best effort is made to clean up current job submission • New job submission is attempted
Rematch Expression • Tells Condor-G when a problematic resource should be abandoned • Evaluated when GlobusResubmit evaluates to true • When this expression becomes true • Best effort is made to clean up current job submission • Job is rematched
Job Ad Example GlobusContactString = TARGET.gatekeeper_url Requirements = TARGET.Arch == “LINUX” && TARGET.OpSys == “LINUX” Rank = TARGET.Mflops PeriodicRelease = ((NumMatches < 10) && ((CurrentTime-EnteredCurrentStatus) > 600)) GlobusResubmit = NumSystemHolds >= NumMatches Rematch = True
Hardening • Regular testing on the CMS testbed with real applications • Many bugs and integration issues found and fixed • Hostile Environment
Hostile Environment • Full disks • Machine crashes • File server lock-ups • Network outages • Power outages
One CMS Dataset Run • 300 jobs • Last fall • ~50 (16%) of the jobs stalled and required human recovery • Multiple service restarts (20 daemon crashes over 6 hours) • Now • 0 jobs stalled • 0 service restarts
Integration Work • Dozens of Condor-G improvements and bug fixes • Over 40 Globus “bugzilla” incidents, many with patches • Globus 2.2.4 has 21 “Advisories” as of 4/11/04 • Use latest version of both
Scalability • Submitting several hundred jobs produced high load on server • Machine became unresponsive • We saw a load average of 1000 at one point • Caused Globus JobManager processes
Grid Manager Monitor Agent • New tool Condor-G can use to reduce this load • Efficient job status polling program • Allows Condor-G to shut down JobManager processes when they’re not needed
Load Reduced • 400 jobs (/bin/sleep 900) • Without Grid Monitor • 42 hours to complete • Peak load average of 610 • With Grid Monitor • 40 minutes • Peak load average of 104
Miscellaneous Stuff • Email notification on job completion • Port range restrictions • Problem jobs put on hold
In Development • Stuff we’re currently working on • Will be released sometime in the next year
Job Policy Expressions • PeriodicHold • PeriodicRemove • OnExitHold • OnExitRemove
Improved GlideIn • MDS use optional • User specifies necessary information • Automatic setup • GlideIn job transfers and installs binaries if needed • Binaries can come from submit machine
New Job Types • Submit jobs directly to other schedulers (not through Globus) • Why? • Richer interface semantics • Not supported by Globus
NorduGrid • Grid batch system designed by Nordic countries • Globus GRAM didn’t offer necessary semantics • Client control of file staging • Automatic cleanup of abandoned jobs
Oracle • Oracle DBMS supports a job queue • Run this query in 5 hours • Run this query every Monday • Condor can add more management features
Generic Job Interface • Re-arrange GridManager to allow easy addition of new job types • Define appropriate interface • Plug-ins for new job types?
Globus Toolkit 3.0 • OGSA (Open Grid Services Architecture) • Submit jobs to GT3 sites • Grid Service client interface to Condor-G
Miscellaneous • Condor-G for Windows • MyProxy credential management • URLs for executable, staged files
Thank You! • Questions? • Also… • Condor-G & Globus Q/A session • Wednesday, 9am-12pm, room TBA • E-mail condor-admin@cs.wisc.edu