1 / 58

What’s new in Condor? Condor Week 2006

What’s new in Condor? Condor Week 2006. So Todd… where is v6.8? Well, v6.7 has been a challenge…. Around since the 80’s. Around since the 80’s. 80’s Mullet Boy. 100 people surveyed! Favorite “ility” ?. 100 people surveyed! Favorite “ility” ?. Deployability!. Existing Ports.

glackey
Download Presentation

What’s new in Condor? Condor Week 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. What’s new in Condor?Condor Week 2006

  2. So Todd… where is v6.8?Well, v6.7 has been a challenge…

  3. Around since the 80’s

  4. Around since the 80’s 80’s Mullet Boy

  5. 100 people surveyed! Favorite “ility” ?

  6. 100 people surveyed!Favorite “ility” ? Deployability!

  7. Existing Ports • Digital UNIX 4.0        Alpha • AIX 5.2 (clipped) PowerPC         • Tru64 5.1 (clipped)      Alpha • HP UNIX 10.20 PA RISC • HP UNIX 11.00 (clipped using hpux10.20 32 bit) PA RISC • Irix 6.5 (clipped) SGI • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 (clipped) Alpha • Linux 2.4.x (glibc 2.2) - Red Hat 7.1, 7.2, 7.3 Intel x86 • Linux 2.4.x (glibc 2.2) - Red Hat 8     Intel x86 • Linux 2.4.x (glibc 2.3) - Red Hat 9     Intel x86        • Enterprise Server 8.1  Intel Itanium • Solaris 8       Sparc    • Solaris 9       Sparc • Microsoft Windows 2000 or XP (clipped)   Intel x86 CondorWeek 2005

  8. New Ports Sigh… • Introduced in v6.6.x • MacOSX (“clipped") PowerPC • Debian Linux 3.1 Intel x86 • Fedora Core 1 Intel x86     • Red Hat Enterprise Linux 3  Intel x86 • SuSE Linux Enterprise Server 8.1  Intel Itanium  • Introduced in v6.7.x • AIX 5.1 (“clipped") PowerPC • Fedora Core 2 on x86 • Fedora Core 3 on x86 • SuSE 8.0 ("clipped") on AMD64 • Solaris 10 ("clipped") on Sparc • Scientific Linux (Release 303) on x86 • Still to be introduced in v6.7.x (before v6.8.0) • HPUX 11i 64-bit pa-risc • RHEL 4 on x86 • “native” 64 bit AMD Linux CondorWeek 2005 “Psilord” – The Condor porting doctor. Talk to him in person tomorrow.

  9. Porting Table • See http://www.cs.wisc.edu/condor/porting/port_table.html • Highlights • Almost every 32-bit Linux flavor as “full” • Every other Unix, MacOS and Windows available as “clipped” • Solaris 10 and HP-UX 11.x now “clipped” • FreeBSD 4 contribution from Yahoo!, added 5 and 6 • X86_64 Linux: “full” running in the lab

  10. Backfill Jobs • Execute machines will run a locally staged executable when otherwise idle. • Currently designed for BOINC. # Turn on backfill functionality, and use BOINC ENABLE_BACKFILL = TRUE BACKFILL_SYSTEM = BOINC # Spawn a backfill job if we've been Unclaimed for more than 5 minutes START_BACKFILL = $(StateTimer) > (5 * $(MINUTE)) # Evict a backfill job if the machine is busy (based on keyboard # activity or cpu load) EVICT_BACKFILL = $(MachineBusy)

  11. Joining Condor’s Einstein@Home Compute Team • If you’re running BOINC backfill jobs in Condor and want to use your cycles to help another UW project, please join the Einstein@Home computation • Join the “Condor Backfill” team: • http://einstein.phys.uwm.edu/team_display.php?teamid=5994 • http://einstein.phys.uwm.edu/create_account_form.php?teamid=5994

  12. More “deployability” • “Personal” Condor Support on Win32 • LocalSystem not required • MSI installer on Win32 (thanks Micron!) • New tools Safe, dynamic Condor service deployment. More info @ Research BOF 9am Rm219 • condor_cold_start and • condor_cold_stop

  13. 100 people surveyed! Favorite “ility” ?

  14. 100 people surveyed!Favorite “ility” ? Availability!

  15. translate GCB Condor with Firewalls and NATS:GCB in v6.8.0! listen accept connect Client app Server app GCB layer GCB layer TCP/IP TCP/IP Relay point

  16. Job Progress continues if connection is interrupted • Now for Vanilla, Java, and Grid universe jobs, Condor supports reestablishment of the connection between the submitting and executing machines. • If network outage between execute and submit machine • If submit machine restarts • Grid Universe was tricky… • To take advantage of this feature, put the following line into their job’s submit description file: JobLeaseDuration = <N seconds> For example: job_lease_duration = 1200

  17. Job Progress continues if submit machine fails • Condor can now support a submit machine “hot spare” (schedd failover) • If your submit machine A is down for longer than N minutes, a second machine B can take over • Requires shared filesystem between machines A and B

  18. Central Manager Failover • Condor Central Manager has two services • condor_collector • Now a list of collectors is supported • condor_negotiator (matchmaker) • If fails, election process, another takes over • Accounting state is peridocially replicated • Contributed technology from Technion

  19. Reliability, cont. • Time shifts • Quill • Closing windows of vulnerability

  20. 100 people surveyed! Favorite “ility” ?

  21. 100 people surveyed!Favorite “ility” ? Lighweight?

  22. 100 people surveyed!Favorite “ility” ? X Lighweight?

  23. 100 people surveyed! Favorite “ility” ?

  24. 100 people surveyed!Favorite “ility” ? Functionality!

  25. Security • Common Authentication Methods between Condor on Unix and Win32 • Kerberos 1.4 • Additional hopeful benefit: Authentication against MS Active Directory! • SSL • Password (shared secret) • Starter only runs known executables • More powerful, unified map file(s) • GSI credentials delegated

  26. With Condor on Win32, it be nice if … • My jobs could access my files just like the condor_shadow can • I didn’t have to tie my execute machines to a single account • I didn’t have to run condor_store_cred from every machine where my credential is needed (thank you Optena)

  27. The Windows CredD • A centralized repository for user passwords C:\>condor_store_cred add Account: gquinn@CROW Enter password: Operation succeeded. myp4sswd y0urs “store password” credd <password>

  28. The Windows CredD schedd myp4sswd “fetch password” y0urs <password> shadow Submit machines can use the CredD to impersonate the user in the shadow

  29. The Windows CredD starter “fetch password” myp4sswd y0urs <password> condor_exec.exe Execute machines can use the CredD to run jobs as the submitting user!

  30. Running Jobs as Submitting User • In submit file: • Run_job_as_owner = true • In config file on submit and execute nodes: CREDD_HOST = vault.cs.wisc.edu STARTER_ALLOW_RUNAS_OWNER = True CREDD_CACHE_LOCALLY = True

  31. Some Condor APIs • Command Line tools • condor_submit, condor_q, etc • -format, -constraint, -xml • Condor Perl Module • Chirp • Checkpoint Library API • MW --- improved! • DRMAA (Works w/ Win32, on SourceForge) • Condor Grid ASCII Protocol (GAHP) • Web Service Interface

  32. DRMAA • Distributed Resource Management Application API (DRMAA) • GGF Working Group • An API specification for the submission and control of jobs to one or more Distributed Resource Management (DRM) systems • An API with C and Java bindings • not a protocol • Scope • Does: job submission, monitoring, control, final status • Does not: file staging, reservations, security, …

  33. Condor GAHP • The Condor GAHP is a relatively low-level protocol based on simple ASCII messages through stdin and stdout • Supports a rich feature set including two-phase commits, transactions, and optional asynchronous notification of events

  34. GAHP, cont Example: R: $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: E S: RESULTS R: E S: COMMANDS R: S COMMANDS GRAM_JOB_CANCEL GRAM_JOB_REQUEST GRAM_JOB_SIGNAL GRAM_JOB_STATUS GRAM_PING INITIALIZE_FROM_FILE QUIT RESULTS VERSION S: VERSION R: S $GahpVersion: 1.0.0 Nov 26 2001 NCSA\ CoG\ Gahpd $ S: INITIALIZE_FROM_FILE /tmp/grid_proxy_554523.txt R: S S: GRAM_PING 100 vulture.cs.wisc.edu/fork R: S S: RESULTS R: S 0 S: RESULTS R: S 1 R: 100 0 S: QUIT R: S

  35. Web Service Interfaces • SOAP over http or https to the Condor daemons • Use any language or platform (where you can find a decent SOAP library) • Functionality Exposed in current release • Submit jobs • Retrieve job output • Remove/hold/release jobs • Query machine status (fetch ads from collector) • Query job status (fetch ads from the schedd)

  36. Getting machine status viaSOAP (in Java with Axis) locator = new CondorCollectorLocator(); collector = locator.getcondorCollector(new URL(“http://machine:port”)); ads = collector.queryStartdAds(“Memory>512“); Because we give you WSDL information you don’t have to write any of these functions.

  37. More Functionality changes.. • FINALLY, clean/consistent cross-platform quoting rules for arguments and environment variables (see condor_submit man page) • Schedd can run HawkEye modules, just like the Startd • Enables monitoring on the submit machine • condor_history : now faster than a snail, and cleans up droppings. • DeferralTime, DeferralWindow • Coordinated starts • BIND_ALL_INTERFACES in config file • WANT_REMOTE_IO in job ClassAd

  38. ClassAd Functions in Condor! • Conditionals • IfThenElse(condition,then,else) • String functions • Strcat(), strcmp(), toUpper(), etc. • StringList functions • Example of a “string list” (CSV style) • Mylist = “Joe, Jon, Jeff, Jim, Jake” • StrListContains(), StrListAppend(), StrListRemove(), etc. • Others • Regular expressions, arithmetic, etc…

  39. Accounting Groups andGroup Quota Support • Account Group (w/ CORE Feature Animation) • Account Group Quota (inspiration CDF @ Fermi) • Sample Problem: Cluster w/ 500 nodes, Chemistry Dept purchased 100 of them, Chemistry users must always be able to use them • Could use Machine Rank… • but this ties to specific machines • Or could use new group support • Each group can be given a quota in config file • Job ads can specify group membership • Group quotas are satisfied first • Accounting by user and by group

  40. 100 people surveyed! Favorite “ility” ?

  41. 100 people surveyed!Favorite “ility” ? Universability!

  42. Grid Universe • With new Grid Universe, always specify a ‘gridtype’. So the old “globus” Universe is now declared as: universe = grid gridtype = gt2 • Other gridtypes? • GT2 (Globus Toolkit 2) • GT3 (Globus Toolkit 3.2) • GT4 (Globus Toolkit 3.9.5+) • UNICORE • Nordugrid • PBS (OpenPBS, PBSPro – technology from INFN) • LSF (Platform LSF – technology from INFN) • CONDOR (thanks gLite!) ‘Condor-G’ ‘Condor-C’

  43. Other Grid Universe improvements • Condor-G has support for credential refresh via the MyProxy Online Credential Management in NMI http://grid.ncsa.uiuc.edu/myproxy (both GT2 and GT4) • GT4 : we start a GridFTP server behind the scenes • GridFTP server bundled w/ Condor nowadays • Some functionality present in Condor-G added to Condor-C • Forwarding of refreshed credentials (EGEE) • GSI authentication support • Cleaner ClassAd representation (URL)

  44. Parallel Universe • Replaces the “MPI” universe • Allows running arbitrary programs that need to gang-schedule multiple machines • MPICH, LAM, … • FT-MPICH (Seoul National Univ) • Great for testing environments

  45. Hey Jobs! We’re watching you! Submit Execute • Local Universe • Just like Scheduler Universe, but there is a condor_starter • All advantages of the starter startd schedd starter starter job job Hey, job, behave or else!

  46. 100 people surveyed! Favorite “ility” ?

  47. 100 people surveyed!Favorite “ility” ? Scalability!

  48. Faster Negotiation • SIGNIFICANT_ATTRIBUTES determined automatically • Job attributes AutoClusterId and AutoClusterAttributes • Rounding of Attributes • Schedd uses non-blocking TCP connects to the startd • Negotiator caching • Collector Forks for queries • More coming…

More Related