1 / 26

STFC Cloud

Explore the background, features, and design considerations of OpenStack and OpenNebula for building the STFC Cloud. Learn about capacity, multi-tenancy, high availability, private networks, VXLAN performance, hardware, and use cases.

christiee
Download Presentation

STFC Cloud

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STFC Cloud Alexander Dibbo

  2. Contents • Background • OpenNebula • OpenStack • Design Considerations • Capacity • Users

  3. Background • Started as a graduate project using StratusLab • Funded to set up and OpenNebula based cloud • Started evaluating and deploying OpenStack in 2016

  4. Why run a cloud? • Provide Self Service VMs to SCD and wider STFC • Underpin Horizon 2020 projects • To support SCD’s Facilities program • To give “easy” access to computing for new user communities (GridPP and UKT0 goal)

  5. OpenNebula • Running stably for 3 years • Works well for individual users • Tricky to use programmatically • “Small” close-knit community • Should be decommissioned this year

  6. OpenStack • Very large community • Very flexible • Complicated • Momentum in scientific communities • Strong API • Preexisting integrations • Jenkins, Grid Engine, LSF etc. • Running for 18 months • Already used for some production services

  7. OpenStack Design Considerations • Multi tenancy • Multiple user communities internally and externally • Highly Available • Services should be as highly available as possible • Flexible • We want to accommodate all reasonable requests

  8. Highly Available • OpenStack services should be highly available where possible • Ceph RBD is used for VM images

  9. OpenStack Services • Multiple instances of each OpenStack service are behind HAProxy loadbalancers

  10. Ceph RBD • A replicated Ceph cluster called SIRIUS provides block storage for VMs and Volumes • 3x Replication • Optimised for lower latency

  11. Multi Tenancy • Projects (Tenants) need to be isolated • From each other • From STFC site network • Security Groups • VXLAN private project networks • Brings its own problems

  12. Private Networks • Virtual machines connect to a private network • VXLAN is used to tunnel these networks across hypervisors • Ingress and Egress is via a virtual router with NAT • Distributed Virtual Routing is used to minimise this bottleneck – every hypervisor runs a limited version of the virtual router agent.

  13. VXLAN • VXLAN by default has significant overheads • VXLAN performance is ~10% of line rate • Tuning memory pages, CPU allocation, mainline kernel • Performance is ~40% of line rate • Hardware offload • VXLAN offload to NIC gives ~80% of line rate • High Performance Routed network + EVPN • 99+% of line rate

  14. Cloud Network

  15. Flexible • Availability Zones across site • 1st will be in ISIS soon • GPU support • AAI • APIs • Nova, EC2, OCCI • Design decisions shouldn’t preclude anything • Pet VMs

  16. AAI – EGI CheckIn - Horizon

  17. AAI – EGI CheckIn – Horizon 2

  18. AAI – EGI CheckIn

  19. AAI – Google Login

  20. AAI – EGI CheckIn – Almost • Not quite working completely at RAL yet

  21. Hardware • 2014 • 28 Hypervisors – 2x8C/16T 128GB • 30 Storage nodes – 8x4TB (1 used for OS) • 2015 (ISIS funded) • 10 hypervisors - – 2x8C/16T 128GB • 12 storage nodes - 12x4TB (1 used for OS) • 2016 (ISIS funded) • 10 hypervisors - – 2x8C/16T 128GB 2 Nvidia Quadro K620s • 10 storage nodes – 12x4TB (1 used for OS) • 2017 • 108 hypervisors – 2x8C/16T, 96GB (UKT0 funded) • 12 storage nodes – 12x4TB disk + 1x3.6TB PCIe SSD + 2 os disks (SCD Funded)

  22. Hardware Available • Once 2017 hardware deployed • ~5000 logical cores • 20 Nvidia Quadro K620s • ~2PB raw storage (~660TB useable)

  23. Use cases – January 2017 • Development and Testing • DLS TopCat sprint development server • LSF, ICAT and IJP Development and Testing • Tier 1 – Grid services (including CVMFS) development and testing • Tier 1 – GridFTP development environment • Testing for Indigo Datacloud project • Development hosts for Quattor Releasing • Development work supporting APEL • EUDAT – federating object stores • CICT – Testing software packages before deploying into production e.g. moodle and lime survey • Repository development for ePubs and eData • Building and Releasing • build and integration hosts for Quattor Releasing • Building software for the CA • Building APEL packages for release • Testing/Production work • CCP4-DAAS, IDAAS – User interface machines to other department resources. • Testing Aquilon sandboxes and personalities • CEDA – data access server • EUDAT.EU - Hosting a prototype graph database • Nagios monitoring system for the Database team • Dashboard and database for the database team’s backup testing framework • Blender Render Farm – Visualisation for Supercomputing conference

  24. User Communities • SCD • Self service VMs • Some programmatic use • Tier1 • Bursting the batch farm • ISIS • Data-Analysis-as-a-Service • SESC Build Service • Jenkins • CLF – OCTOPUS • Diamond/Xchem • Cloud bursting Diamond GridEngine • Xchem data processing using OpenShift • WLCG Datalake project • Quattor Nightlies • West-Life

  25. Links • Data-Analysis-as-a-Service • https://indico.cern.ch/event/713848/contributions/2932932/attachments/1617174/2570708/ukt0_wp2_software_infrastructure.pdf • https://indico.cern.ch/event/713848/contributions/2933001/attachments/1617640/2571686/IDAaaS-Overview.pdf

  26. Any Questions? Alexander.dibbo@stfc.ac.uk

More Related