1 / 105

Condor Administration Paradyn-Condor Week UW Campus March 2002

Condor Administration Paradyn-Condor Week UW Campus March 2002. Outline. Other sources of Information User Priorities Policy Expressions Life-cycle of a job – submit to complete Daemons – what they do and require Startd states and activities Useful admin commands

Download Presentation

Condor Administration Paradyn-Condor Week UW Campus March 2002

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor Administration Paradyn-Condor WeekUW CampusMarch 2002

  2. Outline • Other sources of Information • User Priorities • Policy Expressions • Life-cycle of a job – submit to complete • Daemons – what they do and require • Startd states and activities • Useful admin commands • Authorization and Authentication • General Security Comments/Worries

  3. Outline, cont. • Installation Layout • Contrib Modules • Walk-thru of UW-Madison’s condor_config files

  4. Other Sources • Condor Manual • Condor Web Site • “How to Build a Beowulf Cluster on Linux” by Thomas Sterling, MIT Press, published in 2001 • Email to condor-admin@cs.wisc.edu

  5. User Priorities • Command condor_userprio • How it all works • About nice_user • Config file Settings: • Priority_Halflife, Default_Prio_Factor, Nice_User_Prio_Factor, Remote_Prio_Factor, Account_local_Domain

  6. Introduction to Condor’s Configuration Files • Condor’s configuration is a concatenation of multiple files, in order - definitions in later files overwrites previous definitions • Layout and purpose of the different files: • Global config file • Other shared files • Local config file • Root config file (optional)

  7. Global Config File • All shared settings across your entire pool • Found either in file pointed to with the CONDOR_CONFIG environment variable, /etc/condor/condor_config, or the home directory of the “condor” user • Most settings can be in this file • Only works as a “global” file if it is on a shared file system

  8. Other shared files • You can configure a number of other shared config files: • files to hold common settings to make it easier to maintain (for example, all policy expressions, which we’ll see later) • platform-specific config files

  9. Local config file • Any machine-specific settings • local policy settings for a given owner • different daemons to run (for example, on the Central Manager!) • Can either be on the local disk of each machine, or have separate files in a shared directory, each named by hostname

  10. Root config file (optional) • You can specify a “root” config file, which is always processed after all other files • This allows root to specify certain settings which cannot be changed by another user (like the path to the Condor daemons) • Only useful if daemons are started as root but someone else has access to edit Condor’s config files

  11. Basic syntax • # is a comment • A “\” at the end of a line is a line-continuation, so both lines are treated as one big entry • All names are case insensitive • “Macros” have the form: • Attribute_Name = value • You reference other macros with: • A = $(B)

  12. Policy ExpressionsBack to Frieda

  13. Policy Configuration I am adding nodes to the Cluster… but the Engineering Department has priority on these nodes. (Boss Fat Cat)

  14. The Machine (Startd) Policy Expressions START – When is this machine willing to start a job RANK - Job Preferences SUSPEND - When to suspend a job CONTINUE - When to continue a suspended job PREEMPT – When to nicely stop running a job KILL - When to immediately kill a preempting job

  15. Freida’s Current Settings START = True RANK = SUSPEND = False CONTINUE = PREEMPT = False KILL = False

  16. Freida’s New Settings for the Chemistry nodes START = True RANK = Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False

  17. Submit file with Custom Attribute Executable = charm-run Universe = standard +Department = Chemistry queue

  18. What if “Department” not specified? START = True RANK = Department =!= UNDEFINED && Department == “Chemistry” SUSPEND = False CONTINUE = PREEMPT = False KILL = False

  19. Another example START = True RANK = Department =!= UNDEFINED && ((Department == “Chemistry”)*2 + Department == “Physics”) SUSPEND = False CONTINUE = PREEMPT = False KILL = False

  20. Policy Configuration, cont The Cluster is fine. But not the desktop machines. Condor can only use the desktops when they would otherwise be idle. (Boss Fat Cat)

  21. So Frieda decides she wants the desktops to: • START jobs when their has been no activity on the keyboard/mouse for 5 minutes and the load average is low • SUSPEND jobs as soon as activity is detected • PREEMPT jobs if the activity continues for 5 minutes or more • KILL jobs if they take more than 5 minutes to preempt

  22. Macros in the Config File NonCondorLoadAvg = (LoadAvg - CondorLoadAvg) BackgroundLoad = 0.3 HighLoad = 0.5 KeyboardBusy = (KeyboardIdle < 10) CPU_Busy = ($(NonCondorLoadAvg) >= $(HighLoad)) MachineBusy = ($(CPU_Busy) || $(KeyboardBusy)) ActivityTimer = (CurrentTime - EnteredCurrentActivity)

  23. Desktop Machine Policy START = $(CPU_Idle) && KeyboardIdle > 300 SUSPEND = $(MachineBusy) CONTINUE = $(CPU_Idle) && KeyboardIdle > 120 PREEMPT = (Activity == "Suspended") && $(ActivityTimer) > 300 KILL = $(ActivityTimer) > 300

  24. Policy Review • Users submitting jobs can specify Requirements and Rank expressions • Administrators can specify Startd Policy expressions individually for each machine (Start,Suspend,etc) • Expressions can use any job or machine ClassAd attribute • Custom attributes easily added • Bottom Line: Enforce almost any policy!

  25. Additional Policy Parameters • WANT_SUSPEND • WANT_VACATE

  26. True True True True True False START Road Map of the Policy Expressions WANT SUSPEND SUSPEND = Expression PREEMPT = Activity WANT VACATE False Vacating KILL True Killing

  27. Negotiator Policy Expressions • PREEMPTION_REQUIREMENTS • PREEMPTION_RANK Examples: PREEMPTION_REQUIREMENTS = $(StateTimer) > (1 * $(HOUR)) && RemoteUserPrio > SubmittorPrio * 1.2 PREEMPTION_RANK = (RemoteUserPrio * 1000000) - ImageSize

  28. The Condor Daemons • condor_master (controls everything else) • condor_startd (executing jobs) • condor_starter (helper for starting jobs) • condor_schedd (submitting jobs) • condor_shadow (submit-side helper) • condor_collector (only on Central Manager) • condor_negotiator (only on CM) • You only have to run the daemon(s) for the service(s) you want to provide

  29. condor_master • Starts up all other Condor daemons • If there are any problems and a daemon exists, it restarts the daemon and sends email to the administrator • Checks the time stamps on the binaries it is configured to spawn, and if new binaries appear, the master will gracefully shutdown the currently running version and start the new version

  30. condor_master (cont’d) • Provides access to many remote administration commands: • condor_reconfig • condor_restart, condor_off, condor_on • Default server for many other commands: • condor_config_val, etc. • Periodically runs condor_preen to clean up any files Condor might have left on the machine (the rest of the daemons clean up after themselves, as well)

  31. condor_startd • Represents a machine to the Condor pool • Enforces the wishes of the machine owner (the owner’s “policy”) • Responsible for starting, suspending, and stopping jobs • Spawns the appropriate condor_starter, depending on the type of job • Provides other administrative commands: (for example, condor_vacate)

  32. condor_starter • Spawned by the condor_startd to handle all the details of starting and managing the job (for example, transferring the job’s binary to the executing machine or sending back exit status) • On SMP machines, you get one condor_starter per CPU • For PVM jobs, the starter also spawns a PVM daemon (condor_pvmd)

  33. condor_schedd • Represents users to the Condor pool • Maintains persistent queue of jobs • Queue is not strictly FIFO (priority based) • Responsible for contacting available machines and spawning waiting jobs • Services most user commands: • condor_submit • condor_rm • condor_q

  34. condor_shadow • Represents the job on the submit machine • Services requests from “standard” jobs for “remote system calls”, including all file I/O • Is responsible for making decisions on behalf of the job (for example, where to store the checkpoint file) • There will be one condor_shadow process running on your submit machine for each currently running Condor job

  35. condor_shadow (cont’d) • The shadow doesn’t put much load on your submit machine: • Almost always blocked waiting for requests from the job or doing I/O • Relatively small memory footprint • Still, you can limit the impact of the shadows on a given submit machine: • They can be started by Condor with a “nice-level” that you configure (renice) • Can put a limit on the total number of shadows running on a machine

  36. condor_collector • Collects information from all other Condor daemons in the pool • Each daemon sends a periodic update called a “ClassAd” to the collector • Services queries for information: • Queries from other Condor daemons • Queries from users (condor_status)

  37. condor_negotiator • Performs “matchmaking” in Condor • Gets information from the collector about all available machines and all idle jobs • Tries to match jobs with machines that will serve them • Both the job and the machine must satisfy each other’s requirements (this is called “2-way matching”) • Handles User Priorities

  38. Execute-Only Execute-Only Submit-Only Regular Node Regular Node Central Manager = Process Spawned negotiator collector schedd schedd schedd schedd master master master master master master startd startd startd startd startd Layout of a General Condor Pool = ClassAd Communication Pathway

  39. Job Startup Startd Schedd Starter Customer Job Shadow Condor Syscall Lib Submit

  40. PREEMPTING UNCLAIMED CLAIMED OWNER begin MATCHED Machine States

  41. Useful Admin Commands

  42. Viewing things with condor_status • condor_status has lots of different options to display various kinds of info • Supports “-constraint” so you can only view ClassAds that match an expression you specify • Supports “-format” so you can get the data in whatever form you want (very useful for writing scripts) • View any kind of daemon ClassAd

  43. Viewing things with condor_q • View the job queue • The “-long” option is useful to see the entire ClassAd for a given job • Also supports the “-constraint” option • Can view job queues on remote machines with the “-name” option

  44. Looking at condor_q -analyze • You specify a job or set of jobs you want to analyze • condor_q will try to figure out why the job isn’t running • The output is not as user-friendly as we’d like (though we’re working on it) • Good at finding errors in Requirements expressions set by users

  45. Host/IP Security in Condor • You can configure each machine in your pool to allow or deny certain actions from different groups of machines: • “read” access - querying information • condor_status, condor_q, etc • “write” access - updating information • condor_submit, adding a node to the pool, etc • “administrator” access • condor_on, off, reconfig, restart... • “owner” access • Things a machine owner can do (vacate)

  46. Setting up Host/IP-address Security in Condor (part 1) • To configure, you list what hosts are allowed or denied to perform each action • If you list hosts that are allowed, everything else is denied • If you list hosts that are denied, everything else is allowed • If you list both, only hosts that are listed in “allow” but not in “deny” are allowed

  47. Setting up Host/IP-address Security in Condor (part 2) • There are many possibilities for specifying which hosts are allowed or denied: • Host names, domain names • IP addresses, subnets • Wildcards • ‘*’ can be used anywhere (once) in a host name (for example, “infn-corsi*.corsi.infn.it) • ‘*’ can be used at the end of any IP address (e.g. “128.105.101.*” or “128.105.*”)

  48. Setting up Host/IP-address Security in Condor (part 3) • Can define values that effect all daemons: • HOSTALLOW_WRITE, HOSTDENY_READ, HOSTALLOW_ADMINISTRATOR, etc. • Can define daemon-specific settings: • HOSTALLOW_READ_SCHEDD, HOSTDENY_WRITE_COLLECTOR, etc. • Write access doesn’t automatically provide read access: you must grant both!

  49. Example Host/IP Security Settings HOSTALLOW_WRITE = *.infn.it HOSTALLOW_ADMINISTRATOR = infn-corsi1*, \ $(CONDOR_HOST), axpb07.bo.infn.it, \ $(FULL_HOSTNAME) HOSTDENY_ADMINISTRATOR = infn-corsi15 HOSTDENY_READ = *.gov, *.mil HOSTDENY_ADMINISTRATOR_NEGOTIATOR = *

  50. New Security Features in v6.3 • AUTHENTICATION_METHODS • Kerberos, GSI (X.509 certs), FS, NTSSPI • Strong Encryption • Demo/BoF in 3397

More Related