1 / 31

Containers & Other Technology at Tier-1: Overview & Future

This presentation provides an overview of worker nodes and containers at RAL, including past, present, and future container cluster managers and other technologies. It explores the history of isolating jobs in batch systems and discusses the use of Docker and Singularity for running jobs in containers on worker nodes. The limitations and advantages of each technology are discussed.

mmeissner
Download Presentation

Containers & Other Technology at Tier-1: Overview & Future

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Containers and othertechnology at the Tier-1 Andrew Lahiff 7th April 2017, GridPP38 - Sussex

  2. Overview • Worker nodes & containers at RAL • past, present & future • Container cluster managers • Other technology

  3. Introduction • Long history of trying to isolate jobs in our batch system • protect the machine from jobs • protect the jobs from the machine • protect one job from another • Back in the Days of Torque, things were very simple • jobs run with different uids • jobs which used too much memory killed • jobs which used too much CPU or wall time killed

  4. Introduction • Since migrating to HTCondor Linux kernel functionality has improved our ability to isolate jobs • cgroups for resource limits & monitoring, ensuring processes can’t escape the batch system • CPU, memory, ... • PID namespaces • processes in a job can’t see any other processes on the host • mount namespaces • /tmp, /var/tmp inside each job is unique • This has (mostly) worked well

  5. Introduction • Limitation: all jobs use the same root filesystem as the host • this strongly ties the jobs to the host OS • SL6 host: can only run SL6 jobs • software/OS dictated by LHC experiments • Possible solution? HTCondor’s named chroot functionality • specify a directory containing an alternative root filesystem • Problem • difficult to create the environments • never really took off • Successfully tested at RAL with CMS jobs in early 2015 • SL6 jobs running on SL7 machines

  6. HTCondor Docker universe • Docker universe • By default HTCondor checks if Docker installed • HTCondor runs each job in a Docker container • History • Introduced in HTCondor 8.3.6 in June 2015 • Successfully ran LHC jobs at RAL in 2015 • jobs in SL6 containers on SL7 worker nodes • Lots of bug fixes & improvements made • Nebraska Tier 2 migrated fully to Docker universe in summer 2016

  7. HTCondor Docker universe • containers run as pool account users, not root • users don’t have access to the Docker daemon at all • no way for users to specify arbitrary images via the Grid Worker node HTCondor Docker engine job job job

  8. HTCondor Docker universe • Running jobs in containers on worker nodes now much more important for us • Echo • in order to get the best performance we want to run an xrootd gateway on every worker node • this requires SL7 on worker nodes now

  9. Worker nodes • First step: move to SL7 but with as few changes as possible • many things (CVMFS, config) bind-mounted into the containers HTCondor machine job features HTCondor machine job features CVMFS /etc/grid-security CVMFS /etc/grid-security /etc/arc RPMs (VO dependencies) /etc/arc /etc/<vo> glexec /etc/<vo> grid config files grid config files SL6 worker node SL7 worker node RPMs (VO dependencies) glexec CentOS 6 image

  10. CVMFS • There are a few options, e.g. • static CVMFS mounts, bind mount from host • CERN’s CVMFS Docker volume plugin • autofs • bind mount from host, using shared mount propagation • We’re using autofs • for multi-VO sites this seems the most sensible choice • discovered a problem when restarting autofs on SL7 • https://sft.its.cern.ch/jira/browse/CVM-1200 • a workaround is available & is included in CVMFS 2.3.5

  11. It’s complicated • With latest Docker for RHEL7 • default storage driver is OverlayFS: using a standard XFS filesystem • get lots of kernel errors • host eventually dies (future Docker releases will refuse to run in this situation) • with device-mapper storage driver • bugs in the RHEL7.3 kernel (it’s old!) result in occasional problems deleting containers • We’re using OverlayFS with an XFS partition formatted correctly (“ftype=1”) • no problems so far

  12. Worker nodes • Using the Docker universe, the pilot jobs are isolated from the host • but what about the payload jobs? Worker node Container Pilot job Payload Payload

  13. Containers & unprivileged users • Docker engine • daemon runs as root • need root access to run containers • Many tools have been developed to run containers on batch systems as unprivileged users • Shifter (NERSC) • Singularity (LBL) • udocker (INDIGO - DataCloud) • bdocker (INDIGO – DataCloud, upcoming INDIGO-2 release) • WLCG has settled on Singularity • also very popular in US HPC sites

  14. Singularity • What it does • allows a user to run a process (as the same user) in a specified environment • provides • file isolation • process isolation • How does this compare to Docker? Docker has more features, including: • more namespaces • cgroups for resource monitoring & limiting • CPU, memory, swap, disk IO, ... • network isolation • Linux capabilities

  15. Singularity • Experiments can run Singularity containers themselves • E.g. CMS model: Worker node • payload jobs cannot see other processes on the host or even processes from the pilot • payload jobs cannot see any files from the pilot Container Pilot job Container Container Payload Payload Docker container Singularity container

  16. Computing elements • My view (ARC CE, HTCondor CE) • experiments use CEs to acquire & provision resources • e.g. ATLAS & CMS can request CPUs & memory as needed • keep just a single queue per CE • could specify OS using RTE in XRSL, e.g. • (runtimeenvironment=ENV/OS/EL6) • (runtimeenvironment=ENV/OS/EL7) • DIRAC has a gliteWMS-style view of the Grid • possibly will need to setup dedicated CEs for EL7

  17. Monitoring & traceability • Greater visibility into what each job is doing • including networking see what processes are running in each job (without relying on uids) resource usage metrics

  18. Example CMS job • s

  19. Current status • ~30% batch farm has been migrated to SL7 • Have run jobs from • all LHC experiments • other VOs (ENMR, ILC, SNO+, Pheno, LSST, ...)

  20. Plans • Things in progress or planned (short/medium term) • Ceph xrootd gateways on worker nodes • also Ceph xrootd proxies on worker nodes • configure CEs to provide access to EL7 environments • provide Singularity in EL7 environments • CMS can migrate from glexec to Singularity at RAL • decommission pool accounts on worker nodes • they serve no purpose at all, per-slot users are far simpler • automated rolling reboots • use etcd distributed key-value store for coordinating reboots • all worker nodes will drain & reboot themselves when necessary while maintaining MoU committments

  21. Further in the future • So now we have the ability to • run either SL6 or SL7 jobs from LHC experiments • run jobs with other (Linux) OS’s • What if • someone wants to run SLURM MPI jobs? • someone wants to run Spark jobs? • we need more hypervisors for the cloud? • Move away from dedicated HTCondor worker nodes • more flexible, generic nodes • if needed as a HTCondor worker node, a scheduler can run all the appropriate containers • including HTCondor, CVMFS, ...

  22. Activities • Container cluster managers • Using Mesos as a platform for multiple computing activities & running services • Using Kubernetes as an abstraction across multiple clouds • In both cases • nodes don’t have any grid middleware installed • but can run grid worker nodes as needed • CVMFS is difficult currently • containers usually have private mount namespaces • therefore CVMFS from one container not visible anywhere else

  23. Kubernetes • RCUK Cloud Working Group Pilot Project investigating • portability between on-prem resources & public clouds • portability between multiple public clouds • What are we doing that’s different to previous work in HEP? • previous work all involved different methods of provisioning VMs using cloud APIs • all the major public clouds have different APIs • get locked-in to specific clouds • a lot of work to move to a different cloud • instead we’re using Kubernetes as an abstraction layer • only worry about the Kubernetes API • public clouds generally provide instant Kubernetes clusters

  24. Kubernetes • How Kubernetes is being used to run LHC jobs • create a pool of pilots which scales automatically depending on how much work is available (“vacuum” model) • squids for CVMFS & Frontier • Created by a single command, e.g. kubectl create –f atlas.yaml proxy renewal (cron) custom controller pilot pilot pilot pilot service (stable VIP) horizontal pod autoscaler squid replication controller squid pilot

  25. Kubernetes • So far • have successfully run CMS jobs at RAL, Google, AWS, Azure • have successfully run ATLAS & LHCb jobs at RAL • Ongoing work • A RAL Azure site is being setup within ATLAS • Azure blob storage has been added to RAL Dynafed • FTS3 instance has been setup on Azure • Aim is to run ATLAS jobs on Azure • up to ~5000 concurrent cores • using Azure blob storage via Dynafed Thanks to Google & Amazon for credits & to Microsoft for an Azure Research Award

  26. Other technology

  27. Storing container images • Images stored in a private Docker registry • no reliance on external services (including Amazon S3) • using Swift storage backend • two services • registry • auth server (authentication, ACLs, ..) Docker registry Ceph (Swift API) Ceph gateways (Swift API) registry auth server

  28. Logstash • Increasing usage of Filebeat has lead to a proliferation of VMs running Logstash • Started running multiple Logstash instances in containers on 2 machines as a first trial at consolidation logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat logstash logstash logstash logstash logstash logstash logstash logstash filebeat filebeat

  29. Load balancers • Using HAProxy & Keepalived as HA load balancers in front of some services • FTS for over a year • site & top BDII since January • Dynafed, OpenStack • Was particularly useful in hiding a recent HyperV incident from users • CERN likely to put HAProxy in front of their FTS instances soon

  30. Monitoring infrastructure • Moving away from Ganglia to more modern & flexible tools • Telegraf (metrics collection) • InfluxDB (time series database) • Grafana (visualisation) • 3 InfluxDB instances • general services & head nodes • Ceph (Echo) • worker nodes • Currently have over 800 hosts sending metrics to InfluxDB

  31. Summary • We’re currently migrating to the HTCondor Docker universe • jobs no longer depend on OS version or software installed on worker nodes • gain a lot flexibility • Also likely to provide Singularity • give experiments the possibilty to run containers within their pilot jobs

More Related