1 / 42

In-depth monitoring for Openstack services

In-depth monitoring for Openstack services. George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist. The infrastructure team. George Mihaiescu Cloud architect for the Cancer Genome Collaboratory 7 years of Openstack experience First deployment - Cactus

adivers
Download Presentation

In-depth monitoring for Openstack services

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. In-depth monitoring for Openstack services George Mihaiescu, Senior Cloud Architect Jared Baker, Cloud Specialist

  2. The infrastructure team George Mihaiescu • Cloud architect for the Cancer Genome Collaboratory • 7 years of Openstack experience • First deployment - Cactus • First conference - Boston 2011 • Openstack speaker at Barcelona, Boston and Vancouver conferences Jared Baker • Cloud specialist for the Cancer Genome Collaboratory • 2 years of Openstack experience • 10 years MSP experience • First deployment - Liberty • First conference (and speaker - Boston 2017

  3. Ontario Institute for Cancer Research (OICR) • Largest cancer research institute in Canada, funded by the government of Ontario • Together with its collaborators and partners supports more than 1,700 researchers, clinician scientists research staff and trainees

  4. Cancer Genome Collaboratory Project goals and motivation • Cloud computing environment built for biomedical research by OICR, and funded by government of Canada grants • Enables large scale cancer research on the world’s largest cancer genome dataset currently produced by the International Cancer Genome Consortium (ICGC) • Entirely built using open-source software like Openstack and Ceph • Compute infrastructure goal to provide 3,000 cores and 15 PB storage • A system for cost-recovery

  5. No frills design • Use high density commodity hardware to reduce physical footprint & related overhead • Use open source software and tools • Prefer copper over fiber for network connectivity • Spend 100% of the hardware budget on the infrastructure that supports cancer research, not on licenses or “nice to have” features

  6. Hardware architecture Compute nodes

  7. Hardware architecture Ceph storage nodes

  8. Openstack controllers • Three controllers in HA configuration (2 x 6 cores CPU, 128 GB RAM, 6 x 200 GB Intel S3700 SSD drives) • Separate partitions for OS, Ceph Mon and MySQL • Haproxy (SSL termination with ECC certs) and Keepalived • 4 x 10 GbE bonded interfaces, 802.3ad, layer 3+4 hash • Neutron + GRE, HA routers, no DVR

  9. Networking • Ruckus ICX 7750-48C top-of-rack switches configured in a stack ring topology • 6 x 40Gb Twinax cables between the racks, providing 240 Gbps non-blocking redundant connectivity (2:1 oversubscription ratio)

  10. Capacity vs. extreme performance

  11. Upgrades

  12. Software – entirely open source

  13. Rally – end-to-endtests Rally test that runs every hour and does end-to-end check • Starts a VM • Assigns floating IP • Connects over SSH • Pings an external host five times Alert if the check fails, takes too long to complete or packet loss is greater than 40% It sends runtime info to Graphite for long term graphing. Grafana alerts us if average runtime is above a threshold.

  14. Rally – RBD volume performance test Another Rally check monitors RBD volume (Ceph based) write performance over time: • it boots an instance from a volume • it assigns a floating IP • it connects over SSH • it runs a script that writes a 10 GB file three times • it captures the average IO throughputat the end • it sends throughput info to Graphite for long term graphing • it alerts if the average runtime is above the threshold

  15. Rally– custom checks

  16. Rally smoke tests & load tests

  17. Zabbix and Grafana

  18. Zabbix and Grafana

  19. Dockerized monitoring stack We run a number of tools in containers: • Sflowtrend • Prometheus • Graphite • Collectd • Grafana • Ceph_exporter • Elasticsearch • Logstash • Kibana

  20. Ceph Monitoring IOPS

  21. Ceph Monitoring Performance & Integrity

  22. Openstack capacity usage

  23. Sflowtrend

  24. Zabbix • 200+ hosts • 38,000+ items • 15,000+ triggers • Performant • Reliable • Customizable https://github.com/CancerCollaboratory/infrastructure

  25. Zabbix The Zabbix Agent (client) • CPU • Disk I/O • Memory • Filesystem • Security • Services running • HW Raid card • Fans, temperature, power supply status • PDU power usage

  26. Zabbix Custom checks • When security updates are available • When new cloud images are released • Number of IPs banned by fail2ban • Iptables rules across all controllers are in sync • Openvswitch ports tagged with VLAN 4095 (bad) • Number of Cinder volumes != Number of RBD volumes • Agg memory use per process type (e.g. Nova-api, Radosgw, etc) • Compute nodes have the “neutron-openvswi-sg-chain” openstack volume list --all -f value -c ID >> /tmp/rbdcindervolcompare rbd -p volumes ls | sed "s/volume-//" >> /tmp/rbdcindervolcompare sort /tmp/rbdcindervolcompare | uniq -u

  27. Zabbix Openstack APIs • Multiple checks per API: • Is the process running? • Is the port listening? • Internal checks (from each controller) • External checks (from monitoring server) • Memory usage aggregated per process type • Response time, number and type of API calls

  28. Zabbix OpenStack services memory usage

  29. Zabbix Neutron router traffic

  30. Zabbix Capacity planning • Total/Used vCPU • Total/Used vRAM • # of Instances • Internet traffic

  31. Zabbix alerting • Alerts to Slack, Email, etc • Very configurable • Multiple channels • Emojis! https://github.com/ericoc/zabbix-slack-alertscript

  32. ELK & Filebeat Embracing the chaos of logs • Powerful search • Fast • Meaningful visualizations • Great documentation

  33. ELK & Filebeat Dashboards to suit your needs

  34. ELK & Filebeat Filebeat tags Tagging at the source - type: log paths: - /var/log/glance/*.log tags: ["glance","openstack"] - type: log paths: - /var/log/heat/*.log tags: ["heat","openstack"] exclude_lines: ['DEBUG'] Kibana search by tags

  35. ELK & Filebeat Monitoring the OpenStack dashboard

  36. ELK & Filebeat Alerting Log entry [Mon Apr 23 14:18:34 2018] [pid 1736219] Loginsuccessfulforuser "admin", remote address 64.231.26.191 Logstash output plugin if [source] == "/var/log/apache2/access.log" and [login_status] =~ "successful" and ([user] =~"admin") and !([clientip] =~ "206.108.177.10") { email { from => "alert-server@domain.com" subject => "ALERT! Openstack Admin account was logged in from outside expected IP space" to => "administrators@domain.com" via => "smtp" body => "ALERT! Openstack Admin account was logged in from outside expected IP space: %{message}" port => "587" address => "smtp.domain.com" username => "username" password => "password" authentication => "plain" use_tls => true } }

  37. Monitoring for users https://github.com/CancerCollaboratory/webstatus-update

  38. Lessons learned • If something needs to be running, test it • Be generous with your specs for the monitoring and control plane (more RAM and CPU than you might think it will be needed) • Monitor RAM usage aggregated per process types • If your router should run in HA, verify there is ONLY ONE agent active and ONLY ONE standby • Have a check for the metadata NAT rule inside the router’s namespace • It’s possible to run a stable and performant Openstack cluster with few but qualified resources, as long as you carefully design it and choose the most stable (and absolutely needed) Openstack projects and configurations

  39. Future plans • Add Magnum and Octavia, possibly Trove • Slowly migrate to container based control plane Openstack services, mainly for ease of upgrade • Build a bioinformatics SaaS solution making the infrastructure easier to use for less experienced cancer researchers

  40. Thank you • Discovery Frontiers: Advancing Big Data Science in Genomics Research program (grant no. RGPGR/448167-2013, ‘The Cancer Genome Collaboratory’) • Natural Sciences and Engineering Research Council (NSERC) of Canada • the Canadian Institutes of Health Research (CIHR), Genome Canada • the Canada Foundation for Innovation (CFI) • Ontario Research Fund of the Ministry of Research, Innovation and Science.

  41. Contact Questions? George Mihaiescu george.mihaiescu@oicr.on.ca Jared Baker jared.baker@oicr.on.ca Github repo: https://github.com/CancerCollaboratory/infrastructure.git

More Related