1 / 22

Condor

Condor. High Throughput Computing System Sean Blackbourn Teresia Djunaedi. Outline. What is Condor ? Fault Tolerance (MW & DAGMan ) Resource Discovery (Matchmaking) Job Deployment (Universe) Communication (Remote System Calls) Applications Contributions Critique. What is Condor?.

satin
Download Presentation

Condor

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor High Throughput Computing System Sean Blackbourn TeresiaDjunaedi

  2. Outline • What is Condor? • Fault Tolerance (MW & DAGMan) • Resource Discovery (Matchmaking) • Job Deployment (Universe) • Communication (Remote System Calls) • Applications • Contributions • Critique

  3. What is Condor? • High Throughput Computing • “Deliver large amounts of processing capacity over long periods of time.” (HTCondor) • Developed at the University of Wisconsin around 1983. • Goal: utilize as many idle resources as possible in order to gain increased performance. • Renamed to HTCondor in 2012

  4. Condor Philosophy • Let communities grow naturally. • Plan accordingly, but don’t be too overly concerned about choosing a perfect match. • Let the owner retain control. • Lend expertise to the research community while integrating knowledge from other sources. • Build on top of previous research.

  5. Core Condor Components Matchmaker (central manager) Problem Solver (DAGMan) (Master-Worker) Resource (startd) Agent (schedd) User Sandbox (starter) Shadow (shadow) Job

  6. Master-Worker • Workers can leave at any time during computation • Machines can arrive at any time and suspend/resume computation • Checkpoint state of computation on user-defined frequency • Master manages a set of user-defined tasks and a pool of workers. • Match tasks to worker

  7. (Directed Acyclic Graph Manager) DAGMan • Meta-scheduler • Jobs do not start until their parent has finished. • Each node requires its own HTCondor submit description file. • Responsible for scheduling, recovery, and reporting

  8. Resource Discovery • Jobs are submitted to an Agent, which is responsible for remembering jobs and managing user policies. (schedd) • Agents must find a Resource capable of executing the job. Resources contain submission site policies. (startd) • Agents and Resources are matched according to a Matchmaker, who manages community policies.

  9. Matchmaking Step 1: Agents and resources advertise themselves to the matchmaker. Step 2: Matchmaker finds potential matches and informs the respective candidates. Step 3: Agent and resource contact each other to confirm match. R M R 1 2 1 A R 3

  10. ClassAd Agents and resources advertise themselves using schema-free classified advertisements (ClassAds) ClassAds contain attributes that use three-value logic, in which expressions may evaluate to true, false, or undefined. Matchmaking algorithm places importance on two particular attributes. • Requirements - conditions for appropriate match. • Rank - arbitrary number used to choose among potential matches.

  11. ClassAd Example

  12. Gateway Flocking • Retain existing community policies enforced by established matchmakers • Not necessarily bidirectional • Transparent to participants - allow cross-pool matches between adjacent pools • Prevents a user from joining multiple communities • Complex

  13. Direct Flocking • Jobs are not required to be assigned to a single community; may execute if resources are available • Agent may report itself to multiple matchmakers • Only benefits user who takes initiative • Easier for users to understand & deploy

  14. Gliding • Allows user to create personal Condor pool from remote resources • Accessible via Globus GRAM protocol

  15. Job Deployment Once connection has been agreed upon by agent and resource, two major components are needed: Shadow - Represents the user; provides the resource all of the arguments it needs to successfully complete the job. Sandbox - Provides the job with the environmental resources it needs, and protects it from malicious use.

  16. Split Execution • Matched shadows and sandboxes are called universes. • I/O is handled through Secure RPC. • Condor C Library converts local system calls into remote procedural calls. • Both sandbox and Condor Library must gain shadow’s permission before making decisions.

  17. Two-Phase Open 2: Where is file ‘alpha’? 3: compress:remote:/data/newalpha.gz 4: Open ‘/data/newlapha.gz’ 5: Success 6: Success 1: Open ‘alpha’

  18. Applications • Scientific community research • Dreamworks Animation - rendering farms • C.O.R.E. Digital Pictures

  19. Contributions • Clearly outlines the philosophies, goals, and main focal points of HTCondor. • Provides case studies that offer insight on how Condor has been used to increase productivity and efficiency. • Offers performance analysis on real-world problems, such as NUG30 (10+ years vs 1 week).

  20. Critique Drawbacks Suggestions • Security – prone to attacks • Current applications do not extend far beyond the scientific research community. • Include more performance comparisons to similar systems, such as Globus, Legion, PVM, etc. • Include more tutorials in order to alleviate difficult learning curve.

  21. Questions?

More Related