1 / 20

Condor Usage at Brookhaven National Lab

Condor Usage at Brookhaven National Lab. Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005. About Brookhaven National Lab. One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE.

Download Presentation

Condor Usage at Brookhaven National Lab

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Condor Usage at Brookhaven National Lab Alexander Withers (talk given by Tony Chan) RHIC Computing Facility Condor Week - March 15, 2005

  2. About Brookhaven National Lab • One of a handful of Laboratories supported and managed by the U.S. gov’t through DOE. • Multi-disciplinary Lab with 2,700+ employees, Physics being the largest department. • Physics Dept. has its own computing division (30+ FTE’s) to support physics (HEP) projects. • RHIC (nuclear) and ATLAS (HEP) are largest projects currently being supported.

  3. Computing Facility Resources • Full service facility: central/distributed storage capacity, large Linux Farm, robotic system for data storage, data backup, etc. • 6+ PB permanent tape storage capacity. • 500+ TB central/distributed disk storage capacity. • 1.4 million SpecInt2000 aggregrate computing power in Linux Farm.

  4. History of Condor at Brookhaven • First looked at Condor in 2003 as a replacement for LSF and in-house batch software. • Installed 6.4.7 in August 2003. • Upgraded to 6.6.0 in February 2004. • Upgraded to 6.6.6 (with 6.7.0 startd binary) in August 2004. • User base grew from 12 (April 2004) to 50+ (March 2005).

  5. The Rise in Condor Usage

  6. The Rise in Condor Usage

  7. Condor Cluster Usage

  8. BNL’s modified Condorview

  9. Overview of Computing Resources • Total of 2750 CPUs (growing to 3400+ in 2005). • Two central managers with one acting as a backup. • Three specialized submit machines which handle ~600 simultaneous jobs each on average. • 131 of the execute nodes can also act as submission nodes. • One monitoring/Condorview server.

  10. Overview of Computing Resources, cont. • Six GLOBUS gateway machines for remote job submission. • Most machines run SL-3.0.2 on the x86 platform, some still using RH 7.3. • Running 6.6.6 with 6.7.0 startd binary to take advantage of multiple VM feature.

  11. Overview of Configuration • Computing resources divided into 6 pools. • Two configuration models: • Split pool resources into two parts and restrict which jobs can run in each part. • More complex version of the Bologna Batch System. • A pool uses one or both of these models. • Some pools employ user priority preemption. • Use “drop queue” method to fill fast machines first. • Have tools to easily reconfigure nodes. • All jobs use vanilla universe (no checkpointing).

  12. Two Part Model • Nodes are assigned one of two tasks irrespective of Condor: analysis or reconstruction. • Within Condor, a node advertises itself as either an analysis node or a reconstruction node. • A job must advertise itself in the same manner to match with an appropriate node. • Only certain users may run reconstruction jobs but anyone can run an analysis job.

  13. Analysis/Reconstruction Group 5 Group 3 vm1 Fast vm2 Group 4 Group 2 Group 3 • No suspension • No preemption • Will start a job if CPU is free Group 2 Group 1 Slow Group 1 Reconstruction Job: wants group <= 2

  14. A More Complex Version of the Bologna Model • Two CPU nodes each with 8 VMs. • 2 VMs per CPU. • Only two jobs running at a time. • Four job categories, each with its own priority. • A high priority VM will suspend a random VM of lower priority. • The random aspect is to prevent the same VM from always getting suspended.

  15. High (vm7/vm8) Analysis/Reconstruction High Prio Group 5 Group 3 Med (vm5/vm6) Fast Low (vm3/vm4) Group 4 Low Prio MC (vm1/vm2) Group 2 Group 3 • Low priority VMs suspended • No preemption • Will start a job if CPU is free or is of higher priority Group 2 Group 1 Slow Group 1 Reconstruction Job: wants group == 3 Med. Priority (vm5/vm6)

  16. Issues We've Had to Deal With • Tune parameters to alleviate scalability problems. • MATCH_TIMEOUT • MAX_CLAIM_ALIVES_MISSED • Panasas (proprietary file system) creates kernel threads with whitespace in process name. Breaks an fscanf in procapi.C Panasas fixed bug. • High-volume users can dominate pool, partially solved with PREEMPTION_REQUIREMENTS.

  17. Issues We’ve Had to Deal With, cont. • Dagman problems (latency, termination)  changed from dagman for plain Condor. • Created own ClassAds and JobAds to create batch queues and handy management tools (ie, our version of condor_off). • Modified Condorview to meet our accounting & monitoring requirements.

  18. Issues Not Yet Resolved • Need job ClassAd which gives user's primary group --> better control over cluster usage. • Transfer output files for debugging when job is evicted. • Need option to force the schedd to release its claim after each job. • Allow schedd to set mandatory periodic_remove policy  avoid manual cleanup.

  19. Issues Not Yet Resolved, cont. • Shadow seems to make a large number of NIS calls. Possible problem with caching  address shadows in vanilla universe? • Need Kerberos support to comply with security mandates. • Interested in Condor on Demand (COD), but lack of functionality prevents more usage. • Need more (and effective)cluster management tools  condor_off works?

  20. Near-Term Plans & Summary • Waiting for 6.8.x series (late 2005?) to upgrade. • Scalability concerns as usage rises. • High availability more critical as usage rises. • Integration of BNL Condor pools with external pools, but concerned about security. • Need some functionalities listed above for a meaningful upgrade and to improve cluster management capability.

More Related