condor g a computation management agent for multi institutional grids n.
Skip this Video
Loading SlideShow in 5 Seconds..
Condor-G: A Computation Management Agent for Multi-Institutional Grids PowerPoint Presentation
Download Presentation
Condor-G: A Computation Management Agent for Multi-Institutional Grids

Loading in 2 Seconds...

play fullscreen
1 / 14

Condor-G: A Computation Management Agent for Multi-Institutional Grids - PowerPoint PPT Presentation

  • Uploaded on

Condor-G: A Computation Management Agent for Multi-Institutional Grids. James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke. Overview of Condor-G. Leverages: Security, resource discovery, resource access in multi-domain environments provided by Globus Toolkit

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

Condor-G: A Computation Management Agent for Multi-Institutional Grids

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
condor g a computation management agent for multi institutional grids

Condor-G: A Computation Management Agent for Multi-Institutional Grids

James Frey, Todd Tannenbaum, Miron Livny,

Ian Foster, Steven Tuecke

overview of condor g
Overview of Condor-G
  • Leverages:
    • Security, resource discovery, resource access in multi-domain environments provided by Globus Toolkit
    • Management of computation, harnessing of resources within a single administrative domain provided by Condor
  • Condor-G:
    • Allows users to harness multi-domain resources as if they belong to one personal domain


challenges for building and managing a multi site computation
Challenges for Building and Managing a Multi-Site Computation
  • Different sites have different
    • Policies for security and resource usage
    • Schedulers
    • Hardware
    • Operating systems
    • File systems
  • User may have limited knowledge of resources at other sites
  • Failures can occur remotely
  • Monitoring is challenging


condor g approach
Condor-G Approach
  • Separation of concerns among the following

1. Remote Resource Access

    • Require remote resources to speak standard protocols for discovery and management
    • Use protocols defined by Globus Toolkit

2. Comptuation Management

    • Introduce user computation management agent
    • Responsible for resource discovery, job submission, job management, error recovery
    • From the Condor system

3. Remote Execution Environment

    • Use of mobile sandboxing technology to create tailored execution environment on remote node
    • From the Condor system


grid protocols used for remote resource access
Grid Protocols Used for Remote Resource Access
  • GSI (Security)
  • GRAM (remote submission of computational reqeusts)
    • Security
    • Two-phase commit (added by Condor team)
      • provides exactly-once execution semantics
      • client reqeusts for resources include sequence numbers
      • client receives response from resource and sends a commit message to indicate execution can begin
    • Fault tolerance (added by Condor team)
      • GRAM implementation stores information about active jobs in stable storage on the client side
      • retrieves state if GRAM server crashes and restarts
  • MDS-2 (GRRP and GRIP protocols)
  • GASS (Global Access to Secondary Storage)
    • Obsolete, replaced by gridftp


comptuation management the condor g agent
Comptuation Management: The Condor-G Agent
  • User interface
    • API and command line tool
    • Submit jobs
    • Query job status
    • Cancel job
    • Be informed of job termination or problems via callbacks
    • Get job logs


comptuation management the condor g agent continued
Comptuation Management: The Condor-G Agent (Continued)
  • Supporting remote execution
  • Agent executes user computations on remote resources on the user’s behalf
    • Stages job’s standard I/O and executable using GASS (now GridFTP)
    • Submits job to remote machine using GRAM
    • Monitors job status and remote failures using GRAM
    • Authenticates all requests using GSI
    • Res bumits failed jobs
    • Communicates with user concerning errors
    • Records state about computation on stable storage to support restart in event of agent failure


condor g agent implementation
Condor-G Agent Implementation
  • Scheduler responds to user request
  • Creates new GridManager daemon to submit and manage jobs
    • One process handles all jobs for a single user
    • Terminates when all jobs are complete
  • Each GridManager job submission requests results in creation of one Globus JobManager daemon
    • one gatekeeper at remote site, one JobManager per job submission
    • Job managers communicate with gridmanager to transfer job executables and perform I/O
  • Job Manager submits jobs to execution site’s local scheduler
  • Updates on job status are sent by Job Manager to Grid Manager and then to Condor-G Scheduler


failures tolerated by condor g
Failures Tolerated by Condor-G

1. Crash of Globus JobManager

2. Crash of Machine that manages remote resource (i.e., hosts GateKeeper and JobManager)

3. Crash of machine on which GridManager is executing (or crash of GridManager)

4. Failures in network connecting two machines

Failure detection

  • Detected by GridManager, which periodically probes all its JobManagers
  • If JobManagers don’t respond, probes Gatekeepe
    • if Gatekeeper responds, then knows JobManager has crashed
    • if not, knows remote resource has crashed OR there is a network failure (can’t distinguish)


failure recovery
Failure Recovery
  • If only JobManager crashed
    • GridManager attempts to start new JobManager to resume watching job
  • If no contact with remote machine, GridManager waits until it can reestablish contact
    • Then attempts to reconnect to JobManager
      • JobManager may have crashed or exited normally because job completed during a network failure
    • If fail to connect to running JobManager, start new JobManager
      • JobManager will resume watching job or may tell GridManager that the job completed successfully
  • To protect against local failure, state for submitted jobs is stored persistently in Condor-G scheduler’s job queue
    • GridManager can recover from local crash
    • Reconnect to any JobManagers running at time of crash


condor g agent and credential management
Condor-G Agent and Credential Management
  • GSI Proxy Credential used by Condor-G agent to authetnicate with remote resources on user’s behalf
  • Short-lived credential
  • Long-running Condor-G computations must deal with credential expiration
  • Condor-G agent periodically analyses credentials for all users with queued jobs
  • If expired or near expiration, agent places job in hold state and informs user
  • Need to forward refreshed credentials to any remote sites that are running computations
  • MyProxy: lets user store long-lived proxy credential on secure server
    • Remote services acting on behalf of user can obtain short-lived proxies from MyProxy
    • Condor-G refreshes credentials from MyProxy server


resource discovery and scheduling
Resource Discovery and Scheduling
  • How does Condor-G agent determine where to execute user jobs?
  • Initial implementation: user-supplied list of GRAM servers
  • More sophisticated: resource broker that combines information about user authorization, application requirements, resource status (from MDS)
  • Condor Matchmaker (next week)
    • Describe resource capabilities and job requirements using “Classified Ads”
    • Matchmaker finds compatible classads


glide in mechanism
Glide-In Mechanism
  • What happens when a job executes on a remote platform where
    • required files are not available
    • local policy may not permit access to local file system
    • local policy may impose restrictions on running time of a job
  • Mobile sandboxing
    • Start on remote computer not a user job but a daemon process that does the following:
    • Advertise its availability to Condor Collector, which provides info about available resources to scheduler
    • Match locally queued jobs with resources advertised by daemons and remotely execute them
    • Runs each user task in a “sandbox” using system call trapping technologies to redirect system calls issued by the task back to originating system (increases portability and protects local system)


glide in mechanism cont
Glide-In Mechanism (cont.)
  • Periodically checkpoints job to another location
  • Migrates the job to another location if requested to do so
  • These functions are same as any computer participating in a Condor pool
  • In Condor-G, these daemon processes are started by GRAM rather than by the user
  • Condor-G Glide-in uses Grid protocols to dynamically create a personal Condor pool out of Grid resources by “gliding in” Condor daemons to the remote resources
  • Implementation:
    • initial GlideIn executable is a portable shell script
    • uses GridFTP to retrieve Condor executables from a central rpository, so individual users don’t store binaries for all potential architectures