A year of htcondor at the ral tier 1
1 / 27

A Year of HTCondor at the RAL Tier-1 - PowerPoint PPT Presentation

  • Uploaded on

A Year of HTCondor at the RAL Tier-1. Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop. Outline. Overview of HTCondor at RAL Computing elements Multi -core jobs Monitoring. Introduction. RAL is a Tier-1 for all 4 LHC experiments

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' A Year of HTCondor at the RAL Tier-1' - timothy-bradshaw

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
A year of htcondor at the ral tier 1

A Year of HTCondor at the RAL Tier-1

Ian Collier, Andrew Lahiff

STFC Rutherford Appleton Laboratory

HEPiX Spring 2014 Workshop


  • Overview of HTCondor at RAL

  • Computing elements

  • Multi-core jobs

  • Monitoring


  • RAL is a Tier-1 for all 4 LHC experiments

    • In terms of Tier-1 computing requirements, RAL provides

      • 2% ALICE

      • 13% ATLAS

      • 8% CMS

      • 32% LHCb

    • Also support ~12 non-LHC experiments, including non-HEP

  • Computing resources

    • 784 worker nodes, over 14K cores

    • Generally have 40-60K jobs submitted per day

  • Torque / Maui had been used for many years

    • Many issues

    • Severity & number of problems increased as size of farm increased

    • In 2012 decided it was time to start investigating moving to a new batch system

Choosing a new batch syste m
Choosing a new batch system

  • Considered, tested & eventually rejected the following

    • LSF, Univa Grid Engine*

      • Requirement: avoid commercial products unless absolutely necessary

    • Open source Grid Engines

      • Competing products, not sure which has the next long term future

      • Communities appear less active than HTCondor & SLURM

      • Existing Tier-1s running Grid Engine using the commercial version

    • Torque 4 / Maui

      • Maui problematic

      • Torque 4 seems less scalable than alternatives (but better than Torque 2)

    • SLURM

      • Carried out extensive testing & comparison with HTCondor

      • Found that for our use case

        • Very fragile,easy to break

        • Unable to get reliably working above 6000 running jobs

* Only tested open source Grid Engine, not Univa Grid Engine

Choosing a new batch system
Choosing a new batch system

  • HTCondor chosen as replacement for Torque/Maui

    • Has the features we require

    • Seems very stable

    • Easily able to run 16,000 simultaneous jobs

      • Didn’t do any tuning – it “just worked”

      • Have since tested > 30,000 running jobs

    • Is more customizable than all other batch systems

Migration to htcondor
Migration to HTCondor

  • Strategy

    • Start with a small test pool

    • Gain experience & slowly move resources from Torque / Maui

  • Migration

    2012 Aug Started evaluating alternatives to Torque / Maui

    (LSF, Grid Engine, Torque 4, HTCondor, SLURM)

    2013 Jun Began testing HTCondor with ATLAS & CMS

    ~1000 cores from old WNs beyond MoU commitments

    2013 Aug Choice of HTCondor approved by management

    2013 Sep HTCondor declared production service

    Moved 50% of pledged CPU resources to HTCondor

    2013 Nov Migrated remaining resources to HTCondor

Experience so far
Experience so far

  • Experience

    • Very stable operation

      • Generally just ignore the batch system & everything works fine

      • Staff don’t need to spend all their time fire-fighting problems

    • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores

    • Job start rate much higher than Torque / Maui even when throttled

      • Farm utilization much better

    • Very good support


  • A few issues found, but fixed quickly by developers

    • Found job submission hung when one of a HA pair of central managers was down

      • Fixed & released in 8.0.2

    • Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS

      • Fixed & released in 8.0.5

    • Experienced jobs dying 2 hours after network break between CEs and WNs

      • Fixed & released in 8.1.4

Computing e lements
Computing elements

  • All job submission to RAL is via the Grid

    • No local users

  • Currently have 5 CEs

    • 2 CREAM CEs

    • 3 ARC CEs

  • CREAM doesn’t currently support HTCondor

    • We developed the missing functionality ourselves

    • Will feed this back so that it can be included in an official release

  • ARC better

    • But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, …

    • We wrote lots of patches, all included in the recent 4.1.0 release

  • Will make it easier for more European sites to move to HTCondor

Computing e lements1
Computing elements

  • ARC CE experience

    • Have run almost 9 million jobs so far across our 3 ARC CEs

    • Generally ignore them and they “just work”

    • VOs

      • ATLAS & CMS fine from the beginning

      • LHCb added ability to submit to ARC CEs to DIRAC

        • Seem to be ready to move entirely to ARC

      • ALICE not yet able to submit to ARC

        • They have said they will work on this

      • Non-LHC VOs

        • Some use DIRAC, which now can submit to ARC

        • Others use EMI WMS, which can submit to ARC

  • CREAM CE status

    • Plan to phase-out CREAM CEs this year

Htcondor arc in the uk
HTCondor& ARC in the UK

  • Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC

    • RAL T2 HTCondor + ARC (in production)

    • Bristol HTCondor + ARC (in production)

    • Oxford HTCondor + ARC (small pool in production, migration in progress)

    • Durham SLURM + ARC

    • Glasgow Testing HTCondor + ARC

    • Liverpool Testing HTCondor

  • 7 more sites considering moving to HTCondor or SLURM

  • Configuration management: community effort

    • The Tier-2s using HTCondor and ARC have been sharing Puppet modules

Multi core jobs
Multi-core jobs

  • Current situation

    • ATLAS have been running multi-core jobs at RAL since November

    • CMS started submitting multi-core jobs in early May

    • Interest so far only for multi-core jobs, not whole-node jobs

      • Only 8-core jobs

  • Our aims

    • Fully dynamic

      • No manual partitioning of resources

    • Number of running multi-core jobs determined by fairshares

Getting multi core jobs to work
Getting multi-core jobs to work

  • Job submission

    • Haven’t setup dedicated multi-core queues

    • VO has to request how many cores they want in their JDL, e.g.


  • Worker nodes configured to use partitionable slots

    • Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs

  • Setup multi-core groups & associated fairshares

    • HTCondor configured to assign multi-core jobs to the appropriate groups

  • Adjusted the order in which the negotiator considers groups

    • Consider multi-core groups before single core groups

      • 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly

Getting multi core jobs to work1
Getting multi-core jobs to work

  • If lots of single-core jobs are idle & running, how does a multi-core job start?

    • By default it probably won’t

  • condor_defrag daemon

    • Finds WNs to drain, triggers draining & cancels draining as required

    • Configuration changes from default:

      • Drain 8-cores only, not whole WNs

      • Pick WNs to drain based on how many cores they have that can be freed up

        • E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN

    • Demand for multi-core jobs not known by condor_defrag

      • Setup simple cron to adjust number of concurrent draining WNs based on demand

        • If many idle multi-core jobs but few running, drain aggressively

        • Otherwise very little draining


  • Effect of changing the way WNs to drain are selected

    • No change in the number of concurrent draining machines

    • Rate in increase in number of running multi-core jobs much higher

Running multi-core jobs


  • Recent ATLAS activity

Running & idle multi-core jobs

Gaps in submission by ATLAS results

in loss of multi-core slots.

Number of WNs running multi-

core jobs & draining WNs

  • Significantly reduced CPU wastage

  • due to the cron

  • Aggressive draining: 3% waste

  • Less-aggressive draining: < 1% waste

Worker node health check
Worker node health check

  • Startdcron

    • Checks for problems on worker nodes

      • Disk full or read-only

      • CVMFS

      • Swap

    • Prevents jobs from starting in the event of problems

      • If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting

    • Information about problems made available in machine ClassAds

      • Can easily identify WNs with problems, e.g.

        # condor_status–constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS

        lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch

        lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch

        lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

        lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

        lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

        lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free

Worker node health check1
Worker node health check

  • Also can put this data into ganglia

    • RAL tests new CVMFS releases

      • Therefore it’s important for us to detect increases in CVMFS problems

    • Generally have only small numbers of WNs with issues:

    • Example: a user’s “problematic” jobs affected CVMFS on many WNs:

Jobs monitoring
Jobs monitoring

  • CASTOR team at RAL have been testing Elasticsearch

    • Why not try using it with HTCondor?

  • Elasticsearch ELK stack

    • Logstash: parses log files

    • Elasticsearch: search & analyze data in real-time

    • Kibana: data visualization

  • Hardware setup

    • Test cluster of 13 servers (old diskservers & worker nodes)

      • But 3 servers could handle 16 GB of CASTOR logs per day

  • Adding HTCondor

    • Wrote config file for Logstash to enable history files to be parsed

    • Add Logstash to machines running schedds

HTCondor history files





Jobs monitoring1
Jobs monitoring

  • Can see full job ClassAds

Jobs monitoring2
Jobs monitoring

  • Custom plots

    • E.g. completed jobs by schedd

Jobs monitoring3
Jobs monitoring

  • Custom dashboards

Jobs monitoring4
Jobs monitoring

  • Benefits

    • Easy to setup

      • Took less than a day to setup the initial cluster

    • Seems to be able to handle the load from HTCondor

      • For us (so far): < 1 GB, < 100K documents per day

    • Arbitrary queries

      • Seem faster than using native HTCondorcommands(condor_history)

    • Horizontal construction

      • Need more capacity? Just add more nodes


  • Due to scalability problems with Torque/Maui, migrated to HTCondor last year

  • We are happy with the choice we made based on our requirements

    • Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future

  • Multi-core jobs working well

    • Looking forward to ATLAS and CMS running multi-core jobs at the same time

Future plans
Future plans

  • HTCondor

    • Phase in cgroups onto WNs

  • Integration with private cloud

    • When production cloud is ready, want to be able to expand the batch system into the cloud

    • Using condor_rooster for provisioning resources

      • HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205

  • Monitoring

    • Move Elasticsearch into production

    • Try sending allHTCondor & ARC CE log files to Elasticsearch

      • E.g. could easily find information about a particular job from any log file

Future plans1
Future plans

  • Ceph

    • Have setup a 1.8 PB test Ceph storage system

    • Accessible from some WNs using Ceph FS

    • Setting up an ARC CE with shared filesystem (Ceph)

      • ATLAS testing with arcControlTower

        • Pulls jobs from PanDA, pushes jobs to ARC CEs

        • Unlike the normal pilot concept, jobs can have more precise resource requirements specified

      • Input files pre-staged & cached by ARC on Ceph