a year of htcondor at the ral tier 1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
A Year of HTCondor at the RAL Tier-1 PowerPoint Presentation
Download Presentation
A Year of HTCondor at the RAL Tier-1

Loading in 2 Seconds...

play fullscreen
1 / 27

A Year of HTCondor at the RAL Tier-1 - PowerPoint PPT Presentation


  • 197 Views
  • Uploaded on

A Year of HTCondor at the RAL Tier-1. Ian Collier, Andrew Lahiff STFC Rutherford Appleton Laboratory HEPiX Spring 2014 Workshop. Outline. Overview of HTCondor at RAL Computing elements Multi -core jobs Monitoring. Introduction. RAL is a Tier-1 for all 4 LHC experiments

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'A Year of HTCondor at the RAL Tier-1' - timothy-bradshaw


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
a year of htcondor at the ral tier 1

A Year of HTCondor at the RAL Tier-1

Ian Collier, Andrew Lahiff

STFC Rutherford Appleton Laboratory

HEPiX Spring 2014 Workshop

outline
Outline
  • Overview of HTCondor at RAL
  • Computing elements
  • Multi-core jobs
  • Monitoring
introduction
Introduction
  • RAL is a Tier-1 for all 4 LHC experiments
    • In terms of Tier-1 computing requirements, RAL provides
      • 2% ALICE
      • 13% ATLAS
      • 8% CMS
      • 32% LHCb
    • Also support ~12 non-LHC experiments, including non-HEP
  • Computing resources
    • 784 worker nodes, over 14K cores
    • Generally have 40-60K jobs submitted per day
  • Torque / Maui had been used for many years
    • Many issues
    • Severity & number of problems increased as size of farm increased
    • In 2012 decided it was time to start investigating moving to a new batch system
choosing a new batch syste m
Choosing a new batch system
  • Considered, tested & eventually rejected the following
    • LSF, Univa Grid Engine*
      • Requirement: avoid commercial products unless absolutely necessary
    • Open source Grid Engines
      • Competing products, not sure which has the next long term future
      • Communities appear less active than HTCondor & SLURM
      • Existing Tier-1s running Grid Engine using the commercial version
    • Torque 4 / Maui
      • Maui problematic
      • Torque 4 seems less scalable than alternatives (but better than Torque 2)
    • SLURM
      • Carried out extensive testing & comparison with HTCondor
      • Found that for our use case
        • Very fragile,easy to break
        • Unable to get reliably working above 6000 running jobs

* Only tested open source Grid Engine, not Univa Grid Engine

choosing a new batch system
Choosing a new batch system
  • HTCondor chosen as replacement for Torque/Maui
    • Has the features we require
    • Seems very stable
    • Easily able to run 16,000 simultaneous jobs
      • Didn’t do any tuning – it “just worked”
      • Have since tested > 30,000 running jobs
    • Is more customizable than all other batch systems
migration to htcondor
Migration to HTCondor
  • Strategy
    • Start with a small test pool
    • Gain experience & slowly move resources from Torque / Maui
  • Migration

2012 Aug Started evaluating alternatives to Torque / Maui

(LSF, Grid Engine, Torque 4, HTCondor, SLURM)

2013 Jun Began testing HTCondor with ATLAS & CMS

~1000 cores from old WNs beyond MoU commitments

2013 Aug Choice of HTCondor approved by management

2013 Sep HTCondor declared production service

Moved 50% of pledged CPU resources to HTCondor

2013 Nov Migrated remaining resources to HTCondor

experience so far
Experience so far
  • Experience
    • Very stable operation
      • Generally just ignore the batch system & everything works fine
      • Staff don’t need to spend all their time fire-fighting problems
    • No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores
    • Job start rate much higher than Torque / Maui even when throttled
      • Farm utilization much better
    • Very good support
problems
Problems
  • A few issues found, but fixed quickly by developers
    • Found job submission hung when one of a HA pair of central managers was down
      • Fixed & released in 8.0.2
    • Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS
      • Fixed & released in 8.0.5
    • Experienced jobs dying 2 hours after network break between CEs and WNs
      • Fixed & released in 8.1.4
computing e lements
Computing elements
  • All job submission to RAL is via the Grid
    • No local users
  • Currently have 5 CEs
    • 2 CREAM CEs
    • 3 ARC CEs
  • CREAM doesn’t currently support HTCondor
    • We developed the missing functionality ourselves
    • Will feed this back so that it can be included in an official release
  • ARC better
    • But didn’t originally handle partitionable slots, passing CPU/memory requirements to HTCondor, …
    • We wrote lots of patches, all included in the recent 4.1.0 release
  • Will make it easier for more European sites to move to HTCondor
computing e lements1
Computing elements
  • ARC CE experience
    • Have run almost 9 million jobs so far across our 3 ARC CEs
    • Generally ignore them and they “just work”
    • VOs
      • ATLAS & CMS fine from the beginning
      • LHCb added ability to submit to ARC CEs to DIRAC
        • Seem to be ready to move entirely to ARC
      • ALICE not yet able to submit to ARC
        • They have said they will work on this
      • Non-LHC VOs
        • Some use DIRAC, which now can submit to ARC
        • Others use EMI WMS, which can submit to ARC
  • CREAM CE status
    • Plan to phase-out CREAM CEs this year
htcondor arc in the uk
HTCondor& ARC in the UK
  • Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC
    • RAL T2 HTCondor + ARC (in production)
    • Bristol HTCondor + ARC (in production)
    • Oxford HTCondor + ARC (small pool in production, migration in progress)
    • Durham SLURM + ARC
    • Glasgow Testing HTCondor + ARC
    • Liverpool Testing HTCondor
  • 7 more sites considering moving to HTCondor or SLURM
  • Configuration management: community effort
    • The Tier-2s using HTCondor and ARC have been sharing Puppet modules
multi core jobs
Multi-core jobs
  • Current situation
    • ATLAS have been running multi-core jobs at RAL since November
    • CMS started submitting multi-core jobs in early May
    • Interest so far only for multi-core jobs, not whole-node jobs
      • Only 8-core jobs
  • Our aims
    • Fully dynamic
      • No manual partitioning of resources
    • Number of running multi-core jobs determined by fairshares
getting multi core jobs to work
Getting multi-core jobs to work
  • Job submission
    • Haven’t setup dedicated multi-core queues
    • VO has to request how many cores they want in their JDL, e.g.

(count=8)

  • Worker nodes configured to use partitionable slots
    • Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs
  • Setup multi-core groups & associated fairshares
    • HTCondor configured to assign multi-core jobs to the appropriate groups
  • Adjusted the order in which the negotiator considers groups
    • Consider multi-core groups before single core groups
      • 8 free cores are “expensive” to obtain, so try not to lose them to single core jobs too quickly
getting multi core jobs to work1
Getting multi-core jobs to work
  • If lots of single-core jobs are idle & running, how does a multi-core job start?
    • By default it probably won’t
  • condor_defrag daemon
    • Finds WNs to drain, triggers draining & cancels draining as required
    • Configuration changes from default:
      • Drain 8-cores only, not whole WNs
      • Pick WNs to drain based on how many cores they have that can be freed up
        • E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than draining a full 8-core WN
    • Demand for multi-core jobs not known by condor_defrag
      • Setup simple cron to adjust number of concurrent draining WNs based on demand
        • If many idle multi-core jobs but few running, drain aggressively
        • Otherwise very little draining
results
Results
  • Effect of changing the way WNs to drain are selected
    • No change in the number of concurrent draining machines
    • Rate in increase in number of running multi-core jobs much higher

Running multi-core jobs

results1
Results
  • Recent ATLAS activity

Running & idle multi-core jobs

Gaps in submission by ATLAS results

in loss of multi-core slots.

Number of WNs running multi-

core jobs & draining WNs

  • Significantly reduced CPU wastage
  • due to the cron
  • Aggressive draining: 3% waste
  • Less-aggressive draining: < 1% waste
worker node health check
Worker node health check
  • Startdcron
    • Checks for problems on worker nodes
      • Disk full or read-only
      • CVMFS
      • Swap
    • Prevents jobs from starting in the event of problems
      • If problem with ATLAS CVMFS, then only prevents ATLAS jobs from starting
    • Information about problems made available in machine ClassAds
      • Can easily identify WNs with problems, e.g.

# condor_status–constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUS

lcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.ch

lcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.ch

lcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

lcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

lcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch

lcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free

worker node health check1
Worker node health check
  • Also can put this data into ganglia
    • RAL tests new CVMFS releases
      • Therefore it’s important for us to detect increases in CVMFS problems
    • Generally have only small numbers of WNs with issues:
    • Example: a user’s “problematic” jobs affected CVMFS on many WNs:
jobs monitoring
Jobs monitoring
  • CASTOR team at RAL have been testing Elasticsearch
    • Why not try using it with HTCondor?
  • Elasticsearch ELK stack
    • Logstash: parses log files
    • Elasticsearch: search & analyze data in real-time
    • Kibana: data visualization
  • Hardware setup
    • Test cluster of 13 servers (old diskservers & worker nodes)
      • But 3 servers could handle 16 GB of CASTOR logs per day
  • Adding HTCondor
    • Wrote config file for Logstash to enable history files to be parsed
    • Add Logstash to machines running schedds

HTCondor history files

Logstash

Elastic

search

Kibana

jobs monitoring1
Jobs monitoring
  • Can see full job ClassAds
jobs monitoring2
Jobs monitoring
  • Custom plots
    • E.g. completed jobs by schedd
jobs monitoring3
Jobs monitoring
  • Custom dashboards
jobs monitoring4
Jobs monitoring
  • Benefits
    • Easy to setup
      • Took less than a day to setup the initial cluster
    • Seems to be able to handle the load from HTCondor
      • For us (so far): < 1 GB, < 100K documents per day
    • Arbitrary queries
      • Seem faster than using native HTCondorcommands(condor_history)
    • Horizontal construction
      • Need more capacity? Just add more nodes
summary
Summary
  • Due to scalability problems with Torque/Maui, migrated to HTCondor last year
  • We are happy with the choice we made based on our requirements
    • Confident that the functionality & scalability of HTCondor will meet our needs for the foreseeable future
  • Multi-core jobs working well
    • Looking forward to ATLAS and CMS running multi-core jobs at the same time
future plans
Future plans
  • HTCondor
    • Phase in cgroups onto WNs
  • Integration with private cloud
    • When production cloud is ready, want to be able to expand the batch system into the cloud
    • Using condor_rooster for provisioning resources
      • HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205
  • Monitoring
    • Move Elasticsearch into production
    • Try sending allHTCondor & ARC CE log files to Elasticsearch
      • E.g. could easily find information about a particular job from any log file
future plans1
Future plans
  • Ceph
    • Have setup a 1.8 PB test Ceph storage system
    • Accessible from some WNs using Ceph FS
    • Setting up an ARC CE with shared filesystem (Ceph)
      • ATLAS testing with arcControlTower
        • Pulls jobs from PanDA, pushes jobs to ARC CEs
        • Unlike the normal pilot concept, jobs can have more precise resource requirements specified
      • Input files pre-staged & cached by ARC on Ceph