1 / 19

Condor-G: A Computation Management Agent for Multi-Institutional Grids

Condor-G: A Computation Management Agent for Multi-Institutional Grids. James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke. Reporter: Fu-Jiun Lu 2009/12/17. Abstract.

bowie
Download Presentation

Condor-G: A Computation Management Agent for Multi-Institutional Grids

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor-G: A Computation Management Agent for Multi-Institutional Grids James Frey, Todd Tannenbaum, Miron Livny, Ian Foster, Steven Tuecke Reporter: Fu-Jiun Lu 2009/12/17

  2. Abstract We present the Condor-G system, which leverages software from Globus and Condor to allow users to harness multi-domain resources as if they all belong to one personal domain. We describe the structure of Condor-G and how it handles job management, resource selection, security, and fault tolerance.

  3. Outline • Introduction • Large-scale sharing of computational resources • Grid protocol overview • GSI、GRAM、MDS、GASS • Computation management : Condor-G Agent • “GlideIn” mechanism • Conclusion

  4. Introduction • They want to be able to discover, acquire, and reliably manage computational resources dynamically, in the course of their everyday activities. • They do not want to be bothered with the location of these resources, the mechanisms that are required to use them, with keeping track of the status of computational tasks operating on these resources, or with reacting to failure. • They do care about how long their tasks are likely to run and how much these tasks will cost.

  5. Introduction(Cont.) • We combine the inter-domainresource management protocols of the Globus Toolkit andthe intra-domain resource managementmethods ofCondor to allow the user to harness multi-domainresources as if they all belong to onepersonal domain.

  6. Large-scale sharing of computational resources • Different sites may feature different authentication and authorization mechanisms, schedulers, operating systems, etc.. • The user has little knowledge of the characteristics of resources at remote sites, and no easy means of obtaining this information. • Due to the distributed nature of the multi-site computing environment, computers, networks, and subcomputations can fail in various ways. • Keeping track of the status of different elements of a computation involves tedious bookkeeping, especially in the event of failure and dependencies among subcomputations.

  7. Large-scale sharing of computational resources(Cont.) • Remote resource access • Computation management • Remote execution environment

  8. Grid security infrastructure (GSI) • This authentication and authorization system makes it possible to authenticate a user just once, using public key infrastructure (PKI) mechanisms to verify a user-supplied “Grid credential.”

  9. GRAM protocol and implementation • GSI security mechanisms • GSI security mechanisms are used in all operations to authenticate the requestor and for authorization • Two-phase commit • Each request from a client is accompanied by a unique sequence number, which is also included in the associated response. • Resource-side fault tolerance • resource may often contain multiple processors with specialized “interface” machines running the GRAM server.

  10. MDS protocols and implementation • GRRP • A resource uses the GRRP to notify other entities that it is part of the Grid. • GRIP • Those entities can then use the GRIP to obtain information about resource status.

  11. GASS • The GASS service provides mechanisms for transferring data between a remote HTTP, FTP, or GASS server.

  12. Computation management: The Condor-G agent • User interface • submit jobs, indicating an executable name, input/ output files and arguments. • query a job’s status, or cancel the job. • be informed of job termination or problems, via callbacks or asynchronous mechanisms such as email. • obtain access to detailed logs, providing a complete history of their jobs’ execution.

  13. Condor-G

  14. Supporting remote execution • stages a job's standard I/O and executable using GASS. • submits a job to a remote machine using the revised GRAM job request protocol. • subsequently monitors job status and recovers from remote failures using the revised GRAM protocol and GRAM callbacks and status calls. • authenticating all requests via GSI mechanisms.

  15. Condor-G is built to tolerate four types of failure • Crashof the Globus JobManager • Crash of the machine thatmanages the remote resource • Crash of the machine onwhich the GridManager is executing • Failures in the networkconnecting the two machines.

  16. Credential management • periodically analyzing the credentials forall users with currently queued jobs. • To reduce user hassle in dealing withexpiredcredentials, Condor-G could be enhanced to work with asystem like “MyProxy.”

  17. Resource discovery and scheduling • Available resources can be ranked by userpreferences such as allocation cost andexpected start orcompletion time. • One promising approach to constructingsuch a resource broker is to use the CondorMatchmakingframeworkto implement the brokering algorithm.

  18. GlideIn mechanism • Condor-G uses standard Condormechanisms to match locally queued jobs with theresources advertised . • It runs each user task received in a “sandbox,”using system call trapping technologies provided. • It periodically checkpoints the job to anotherlocationand migrates the job to anotherlocation.

  19. Conclusion • Condor-G mechanisms complementthis work by addressing issues of uniformremote access,failure, credential expiry, etc. Condor-Gcould potentiallybe used as a backend for an applicationlevel schedulingsystem.

More Related