Containers & Other Technology at Tier-1: Overview & Future

Containers and othertechnology at the Tier-1 Andrew Lahiff 7th April 2017, GridPP38 - Sussex

Overview • Worker nodes & containers at RAL • past, present & future • Container cluster managers • Other technology

Introduction • Long history of trying to isolate jobs in our batch system • protect the machine from jobs • protect the jobs from the machine • protect one job from another • Back in the Days of Torque, things were very simple • jobs run with different uids • jobs which used too much memory killed • jobs which used too much CPU or wall time killed

Introduction • Since migrating to HTCondor Linux kernel functionality has improved our ability to isolate jobs • cgroups for resource limits & monitoring, ensuring processes can’t escape the batch system • CPU, memory, ... • PID namespaces • processes in a job can’t see any other processes on the host • mount namespaces • /tmp, /var/tmp inside each job is unique • This has (mostly) worked well

Introduction • Limitation: all jobs use the same root filesystem as the host • this strongly ties the jobs to the host OS • SL6 host: can only run SL6 jobs • software/OS dictated by LHC experiments • Possible solution? HTCondor’s named chroot functionality • specify a directory containing an alternative root filesystem • Problem • difficult to create the environments • never really took off • Successfully tested at RAL with CMS jobs in early 2015 • SL6 jobs running on SL7 machines

HTCondor Docker universe • Docker universe • By default HTCondor checks if Docker installed • HTCondor runs each job in a Docker container • History • Introduced in HTCondor 8.3.6 in June 2015 • Successfully ran LHC jobs at RAL in 2015 • jobs in SL6 containers on SL7 worker nodes • Lots of bug fixes & improvements made • Nebraska Tier 2 migrated fully to Docker universe in summer 2016

HTCondor Docker universe • containers run as pool account users, not root • users don’t have access to the Docker daemon at all • no way for users to specify arbitrary images via the Grid Worker node HTCondor Docker engine job job job

HTCondor Docker universe • Running jobs in containers on worker nodes now much more important for us • Echo • in order to get the best performance we want to run an xrootd gateway on every worker node • this requires SL7 on worker nodes now

Worker nodes • First step: move to SL7 but with as few changes as possible • many things (CVMFS, config) bind-mounted into the containers HTCondor machine job features HTCondor machine job features CVMFS /etc/grid-security CVMFS /etc/grid-security /etc/arc RPMs (VO dependencies) /etc/arc /etc/<vo> glexec /etc/<vo> grid config files grid config files SL6 worker node SL7 worker node RPMs (VO dependencies) glexec CentOS 6 image

CVMFS • There are a few options, e.g. • static CVMFS mounts, bind mount from host • CERN’s CVMFS Docker volume plugin • autofs • bind mount from host, using shared mount propagation • We’re using autofs • for multi-VO sites this seems the most sensible choice • discovered a problem when restarting autofs on SL7 • https://sft.its.cern.ch/jira/browse/CVM-1200 • a workaround is available & is included in CVMFS 2.3.5

It’s complicated • With latest Docker for RHEL7 • default storage driver is OverlayFS: using a standard XFS filesystem • get lots of kernel errors • host eventually dies (future Docker releases will refuse to run in this situation) • with device-mapper storage driver • bugs in the RHEL7.3 kernel (it’s old!) result in occasional problems deleting containers • We’re using OverlayFS with an XFS partition formatted correctly (“ftype=1”) • no problems so far

Worker nodes • Using the Docker universe, the pilot jobs are isolated from the host • but what about the payload jobs? Worker node Container Pilot job Payload Payload

Containers & unprivileged users • Docker engine • daemon runs as root • need root access to run containers • Many tools have been developed to run containers on batch systems as unprivileged users • Shifter (NERSC) • Singularity (LBL) • udocker (INDIGO - DataCloud) • bdocker (INDIGO – DataCloud, upcoming INDIGO-2 release) • WLCG has settled on Singularity • also very popular in US HPC sites

Singularity • What it does • allows a user to run a process (as the same user) in a specified environment • provides • file isolation • process isolation • How does this compare to Docker? Docker has more features, including: • more namespaces • cgroups for resource monitoring & limiting • CPU, memory, swap, disk IO, ... • network isolation • Linux capabilities

Singularity • Experiments can run Singularity containers themselves • E.g. CMS model: Worker node • payload jobs cannot see other processes on the host or even processes from the pilot • payload jobs cannot see any files from the pilot Container Pilot job Container Container Payload Payload Docker container Singularity container

Computing elements • My view (ARC CE, HTCondor CE) • experiments use CEs to acquire & provision resources • e.g. ATLAS & CMS can request CPUs & memory as needed • keep just a single queue per CE • could specify OS using RTE in XRSL, e.g. • (runtimeenvironment=ENV/OS/EL6) • (runtimeenvironment=ENV/OS/EL7) • DIRAC has a gliteWMS-style view of the Grid • possibly will need to setup dedicated CEs for EL7

Monitoring & traceability • Greater visibility into what each job is doing • including networking see what processes are running in each job (without relying on uids) resource usage metrics

Example CMS job • s

Current status • ~30% batch farm has been migrated to SL7 • Have run jobs from • all LHC experiments • other VOs (ENMR, ILC, SNO+, Pheno, LSST, ...)

Plans • Things in progress or planned (short/medium term) • Ceph xrootd gateways on worker nodes • also Ceph xrootd proxies on worker nodes • configure CEs to provide access to EL7 environments • provide Singularity in EL7 environments • CMS can migrate from glexec to Singularity at RAL • decommission pool accounts on worker nodes • they serve no purpose at all, per-slot users are far simpler • automated rolling reboots • use etcd distributed key-value store for coordinating reboots • all worker nodes will drain & reboot themselves when necessary while maintaining MoU committments

Further in the future • So now we have the ability to • run either SL6 or SL7 jobs from LHC experiments • run jobs with other (Linux) OS’s • What if • someone wants to run SLURM MPI jobs? • someone wants to run Spark jobs? • we need more hypervisors for the cloud? • Move away from dedicated HTCondor worker nodes • more flexible, generic nodes • if needed as a HTCondor worker node, a scheduler can run all the appropriate containers • including HTCondor, CVMFS, ...

Activities • Container cluster managers • Using Mesos as a platform for multiple computing activities & running services • Using Kubernetes as an abstraction across multiple clouds • In both cases • nodes don’t have any grid middleware installed • but can run grid worker nodes as needed • CVMFS is difficult currently • containers usually have private mount namespaces • therefore CVMFS from one container not visible anywhere else

Kubernetes • RCUK Cloud Working Group Pilot Project investigating • portability between on-prem resources & public clouds • portability between multiple public clouds • What are we doing that’s different to previous work in HEP? • previous work all involved different methods of provisioning VMs using cloud APIs • all the major public clouds have different APIs • get locked-in to specific clouds • a lot of work to move to a different cloud • instead we’re using Kubernetes as an abstraction layer • only worry about the Kubernetes API • public clouds generally provide instant Kubernetes clusters

Kubernetes • How Kubernetes is being used to run LHC jobs • create a pool of pilots which scales automatically depending on how much work is available (“vacuum” model) • squids for CVMFS & Frontier • Created by a single command, e.g. kubectl create –f atlas.yaml proxy renewal (cron) custom controller pilot pilot pilot pilot service (stable VIP) horizontal pod autoscaler squid replication controller squid pilot

Kubernetes • So far • have successfully run CMS jobs at RAL, Google, AWS, Azure • have successfully run ATLAS & LHCb jobs at RAL • Ongoing work • A RAL Azure site is being setup within ATLAS • Azure blob storage has been added to RAL Dynafed • FTS3 instance has been setup on Azure • Aim is to run ATLAS jobs on Azure • up to ~5000 concurrent cores • using Azure blob storage via Dynafed Thanks to Google & Amazon for credits & to Microsoft for an Azure Research Award

Other technology

Storing container images • Images stored in a private Docker registry • no reliance on external services (including Amazon S3) • using Swift storage backend • two services • registry • auth server (authentication, ACLs, ..) Docker registry Ceph (Swift API) Ceph gateways (Swift API) registry auth server

Logstash • Increasing usage of Filebeat has lead to a proliferation of VMs running Logstash • Started running multiple Logstash instances in containers on 2 machines as a first trial at consolidation logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat filebeat

Load balancers • Using HAProxy & Keepalived as HA load balancers in front of some services • FTS for over a year • site & top BDII since January • Dynafed, OpenStack • Was particularly useful in hiding a recent HyperV incident from users • CERN likely to put HAProxy in front of their FTS instances soon

Monitoring infrastructure • Moving away from Ganglia to more modern & flexible tools • Telegraf (metrics collection) • InfluxDB (time series database) • Grafana (visualisation) • 3 InfluxDB instances • general services & head nodes • Ceph (Echo) • worker nodes • Currently have over 800 hosts sending metrics to InfluxDB

Summary • We’re currently migrating to the HTCondor Docker universe • jobs no longer depend on OS version or software installed on worker nodes • gain a lot flexibility • Also likely to provide Singularity • give experiments the possibilty to run containers within their pilot jobs

Containers & Other Technology at Tier-1: Overview & Future

Containers & Other Technology at Tier-1: Overview & Future

Presentation Transcript

Tier-1 Overview

HTCondor at the RAL Tier-1

Evaluating 1:1 Initiatives and Other Technology Programs

Tier 1

Virtualisation at the RAL Tier 1

Disaster Management at the Tier-1

Multi-core jobs at the RAL Tier-1

Tier-1 Evolution and Futures

CEPH at the Tier 1

Tier 1

HTCondor at the RAL Tier-1

Tier 1

VMs at a Tier-1 site

Tier 1, Round 1

Cloud Computing at the RAL Tier 1

A Year of HTCondor at the RAL Tier-1

Tier 1

Experience with GPFS and StoRM at the INFN Tier-1

Tier One Interventions at Tier Three

Experience with GPFS and StoRM at the INFN Tier-1

Tier-1/Tier-2 operation experience

Introduction of load balancers at the RAL Tier-1