1 / 14

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Condor-G: A Computation Management Agent for Multi-Institutional Grids. James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke. Overview of Condor-G. Leverages: Security, resource discovery, resource access in multi-domain environments provided by Globus Toolkit

jacob
Download Presentation

Condor-G: A Computation Management Agent for Multi-Institutional Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke

  2. Overview of Condor-G • Leverages: • Security, resource discovery, resource access in multi-domain environments provided by Globus Toolkit • Management of computation, harnessing of resources within a single administrative domain provided by Condor • Condor-G: • Allows users to harness multi-domain resources as if they belong to one personal domain Condor-G

  3. Challenges for Building and Managing a Multi-Site Computation • Different sites have different • Policies for security and resource usage • Schedulers • Hardware • Operating systems • File systems • User may have limited knowledge of resources at other sites • Failures can occur remotely • Monitoring is challenging Condor-G

  4. Condor-G Approach • Separation of concerns among the following 1. Remote Resource Access • Require remote resources to speak standard protocols for discovery and management • Use protocols defined by Globus Toolkit 2. Comptuation Management • Introduce user computation management agent • Responsible for resource discovery, job submission, job management, error recovery • From the Condor system 3. Remote Execution Environment • Use of mobile sandboxing technology to create tailored execution environment on remote node • From the Condor system Condor-G

  5. Grid Protocols Used for Remote Resource Access • GSI (Security) • GRAM (remote submission of computational reqeusts) • Security • Two-phase commit (added by Condor team) • provides exactly-once execution semantics • client reqeusts for resources include sequence numbers • client receives response from resource and sends a commit message to indicate execution can begin • Fault tolerance (added by Condor team) • GRAM implementation stores information about active jobs in stable storage on the client side • retrieves state if GRAM server crashes and restarts • MDS-2 (GRRP and GRIP protocols) • GASS (Global Access to Secondary Storage) • Obsolete, replaced by gridftp Condor-G

  6. Comptuation Management: The Condor-G Agent • User interface • API and command line tool • Submit jobs • Query job status • Cancel job • Be informed of job termination or problems via callbacks • Get job logs Condor-G

  7. Comptuation Management: The Condor-G Agent (Continued) • Supporting remote execution • Agent executes user computations on remote resources on the user’s behalf • Stages job’s standard I/O and executable using GASS (now GridFTP) • Submits job to remote machine using GRAM • Monitors job status and remote failures using GRAM • Authenticates all requests using GSI • Res bumits failed jobs • Communicates with user concerning errors • Records state about computation on stable storage to support restart in event of agent failure Condor-G

  8. Condor-G Agent Implementation • Scheduler responds to user request • Creates new GridManager daemon to submit and manage jobs • One process handles all jobs for a single user • Terminates when all jobs are complete • Each GridManager job submission requests results in creation of one Globus JobManager daemon • one gatekeeper at remote site, one JobManager per job submission • Job managers communicate with gridmanager to transfer job executables and perform I/O • Job Manager submits jobs to execution site’s local scheduler • Updates on job status are sent by Job Manager to Grid Manager and then to Condor-G Scheduler Condor-G

  9. Failures Tolerated by Condor-G 1. Crash of Globus JobManager 2. Crash of Machine that manages remote resource (i.e., hosts GateKeeper and JobManager) 3. Crash of machine on which GridManager is executing (or crash of GridManager) 4. Failures in network connecting two machines Failure detection • Detected by GridManager, which periodically probes all its JobManagers • If JobManagers don’t respond, probes Gatekeepe • if Gatekeeper responds, then knows JobManager has crashed • if not, knows remote resource has crashed OR there is a network failure (can’t distinguish) Condor-G

  10. Failure Recovery • If only JobManager crashed • GridManager attempts to start new JobManager to resume watching job • If no contact with remote machine, GridManager waits until it can reestablish contact • Then attempts to reconnect to JobManager • JobManager may have crashed or exited normally because job completed during a network failure • If fail to connect to running JobManager, start new JobManager • JobManager will resume watching job or may tell GridManager that the job completed successfully • To protect against local failure, state for submitted jobs is stored persistently in Condor-G scheduler’s job queue • GridManager can recover from local crash • Reconnect to any JobManagers running at time of crash Condor-G

  11. Condor-G Agent and Credential Management • GSI Proxy Credential used by Condor-G agent to authetnicate with remote resources on user’s behalf • Short-lived credential • Long-running Condor-G computations must deal with credential expiration • Condor-G agent periodically analyses credentials for all users with queued jobs • If expired or near expiration, agent places job in hold state and informs user • Need to forward refreshed credentials to any remote sites that are running computations • MyProxy: lets user store long-lived proxy credential on secure server • Remote services acting on behalf of user can obtain short-lived proxies from MyProxy • Condor-G refreshes credentials from MyProxy server Condor-G

  12. Resource Discovery and Scheduling • How does Condor-G agent determine where to execute user jobs? • Initial implementation: user-supplied list of GRAM servers • More sophisticated: resource broker that combines information about user authorization, application requirements, resource status (from MDS) • Condor Matchmaker (next week) • Describe resource capabilities and job requirements using “Classified Ads” • Matchmaker finds compatible classads Condor-G

  13. Glide-In Mechanism • What happens when a job executes on a remote platform where • required files are not available • local policy may not permit access to local file system • local policy may impose restrictions on running time of a job • Mobile sandboxing • Start on remote computer not a user job but a daemon process that does the following: • Advertise its availability to Condor Collector, which provides info about available resources to scheduler • Match locally queued jobs with resources advertised by daemons and remotely execute them • Runs each user task in a “sandbox” using system call trapping technologies to redirect system calls issued by the task back to originating system (increases portability and protects local system) Condor-G

  14. Glide-In Mechanism (cont.) • Periodically checkpoints job to another location • Migrates the job to another location if requested to do so • These functions are same as any computer participating in a Condor pool • In Condor-G, these daemon processes are started by GRAM rather than by the user • Condor-G Glide-in uses Grid protocols to dynamically create a personal Condor pool out of Grid resources by “gliding in” Condor daemons to the remote resources • Implementation: • initial GlideIn executable is a portable shell script • uses GridFTP to retrieve Condor executables from a central rpository, so individual users don’t store binaries for all potential architectures Condor-G

More Related