1 / 32

CloudStack Scalability Testing, Development, Results, and Futures

CloudStack Scalability Testing, Development, Results, and Futures. Anthony Xu Apache CloudStack contributor. Apache CloudStack : a project in incubation. Secure, multi-tenant cloud orchestration platform Turnkey platform for delivering IaaS clouds Hypervisor agnostic

agalia
Download Presentation

CloudStack Scalability Testing, Development, Results, and Futures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CloudStack Scalability Testing, Development, Results, and Futures Anthony Xu Apache CloudStack contributor

  2. Apache CloudStack: a project in incubation • Secure, multi-tenant cloud orchestration platform • Turnkey platform for delivering IaaS clouds • Hypervisor agnostic • Highly scalable, secure and open • Complete Self-service portal • Open source, open standards • Deploys on premise

  3. Manage hosts, create VMs, virtual disks, virtual networks, meter usage, …. Admin Internet Management Server Cluster Router Primary MySQL Load Balancer Backup MySQL L3 Core Switch Top of Rack Switch … Object Storage Servers … … … … Availability Zone 1 Pod 3 Pod 1 Pod 2 Pod N

  4. Thinking about cloud orchestration at scale • Host management • Capacity management • What host to use to deploy a new VM • Failure handling • Security group propagation • Set a goal

  5. We can’t afford this as our QA lab

  6. Simulator enables scale testing Mgmt. Server Zone Simulator User API Mgmt. Server Load Balancer Admin API Mgmt. Server Mgmt. Server MySQL MySQL

  7. Environment 2 cores, 4 with Hyper Threading. 2.2 GHz Xeon. 16 GB RAM. 12 GB JVM Heap. Mgmt. Server Single spinning disk, later single SSD. 32 GB RAM. MySQL 5.5. Zone Simulator User API Mgmt. Server Load Balancer Admin API Mgmt. Server Mgmt. Server MySQL MySQL

  8. Allocator performance is awful with 1000 hosts • Two minutes to decide which host to use for a new VM! • Computing capacity for every pod repeatedly • Fixed that, but still 12 seconds to decide • Use host tags, down to 2 seconds • Major changes required to improve further • In 2.2.0, store capacity info in DB, skip pod altogether • Harness the power of SQL select and all is well

  9. Polling doesn’t scale FALSE? TRUE? Sometimes, it is good enough

  10. Host management • Check host state via TCP connection • Check every minute • 30,000 checks per minute, 500 per second • But they take 10 seconds, so 5000 in parallel • Not using async I/O so 5000 threads required… • Single JVM can support 5000+ threads so this is concerning but may not be the limiting factor

  11. Host management • What is the maximum feasible JVM heap size? • Some people use heaps with hundreds of GB • Commercial tools can help, but cost • We decided to stay below 20 GB (GC concerns) • How much CPU is required for background processing?

  12. CPU Utilization. 400% is maximum CPU utilization while deploying 30,000 VMs on 30,000 hosts 20,000 5000 5000 Idle Time

  13. Deploy time from 25,000 to 30,000 VMs Seconds to deploy VM number: 25,000 plus X

  14. Problem: agent load balancing • Management servers start/stop/fail/crash • How do newly started Management Servers get agents / work? • When a Management Server exits, how do others pick up its load? • When new hosts are added how is the load distributed?

  15. Common use case timings at scale • 30,000 hosts and 4 Management Servers • 4 Management Servers running, 1 fails: 10 minutes to redistribute 7500 agents • 3 Management Servers running, add a fourth: 40 minutes to redistribute load evenly • 0 Management Servers running, start all 4 simultaneously: 16 minutes to connect to all 30,000 hosts IMPORTANT

  16. Understanding security groups Web VM DB VM Web VM Web Security Group DB Security Group Web VM Web VM DB VM … … … Web VM Web VM Ingress Rule: Allow VMs in Web Security Group access to VMs in DB Security Group on Port 3306

  17. L3 isolation with distributed firewalls Tenant 1 VM 1 10.1.0.2 Public IP address 65.37.141.11 65.37.141.24 65.37.141.36 65.37.141.80 Public Internet 10.1.0.1 Tenant 2 VM 1 Pod 1 L2 Switch 10.1.0.3 Tenant 1 VM 2 10.1.0.4 … 10.1.8.1 Pod 2 L2 Switch L3 Core 10.1.16.1 Load Balancer Pod 3 L2 Switch …

  18. L3 isolation with distributed firewalls Tenant 1 VM 1 10.1.0.2 Public IP address 65.37.141.11 65.37.141.24 65.37.141.36 65.37.141.80 Public Internet 10.1.0.1 Tenant 2 VM 1 Pod 1 L2 Switch 10.1.0.3 Tenant 1 VM 2 10.1.0.4 … 10.1.8.1 Pod 2 L2 Switch L3 Core 10.1.16.1 Load Balancer Pod 3 L2 Switch … Tenant 1 VM 3 10.1.16.47 Tenant 1 VM 4 10.1.16.85

  19. L3 isolation with distributed firewalls Tenant 1 VM 1 10.1.0.2 Public IP address 65.37.141.11 65.37.141.24 65.37.141.36 65.37.141.80 Public Internet 10.1.0.1 Tenant 2 VM 1 Pod 1 L2 Switch 10.1.0.3 Tenant 1 VM 2 10.1.0.4 … 10.1.8.1 Pod 2 L2 Switch L3 Core Tenant 2 VM 2 10.1.16.12 10.1.16.1 Load Balancer Pod 3 L2 Switch Tenant 2 VM 3 10.1.16.21 … Tenant 1 VM 3 10.1.16.47 Tenant 1 VM 4 10.1.16.85

  20. 1 Firewall per Virtual Machine

  21. One million firewalls? VM … VM VM VM … VM VM VM VM … VM … … VM … VM VM VM VM VM VM VM VM VM VM … VM … … VM VM VM VM VM VM … VM VM … VM … … VM VM VM VM VM VM VM … VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM … … … … … … … … … … … … … … … … … … … … … … … … … … … VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

  22. Orchestrating hundreds of thousands of firewalls • Well-known software scaling techniques • Message queues • Consistency tradeoffs • Idempotent configuration & retries • CloudStack uses • Special purpose queues • Optimized for large security groups • Eventual consistency for rule updates

  23. Problem: firewall rules explosion in dom0 • Allow Security Group {Web} on TCP port 3060 -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.16.31 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.45.112 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.1.189.5 – j ACCEPT -A FORWARD -m tcp –p tcp –dport 3060 –src 10.21.9.77 – j ACCEPT … Performance suffers for large security groups

  24. Problem: firewall rules explosion in dom0 Fix with ipsets: ipset –N web_sgiptreemap ipset –A web_sg 10.1.16.31 ipset –A web_sg 10.1.16.112 ipset –A web_sg 10.1.189.5 ipset –A web_sg 10.21.9.77 -A FORWARD –p tcp –m tcp –dport 3060 –m set –match-set web_sgsrc -j ACCEPT … See also http://daemonkeeper.net/781/mass-blocking-ip-addresses-with-ipset/

  25. Security group propagation time Seconds to fully synced Number of VMs in security group

  26. Problem: database connection management • Scale testing resulted in several “too many open connections” errors from MySQL • Common problem: holding open connections while doing long-running operations • Took some code clean up and refactoring • No longer an issue • 10,000 connections are OK • CloudStack is far below that

  27. DB connections per MS while deploying 30,000 VMs 5,000 5,000 Number of DB connections 20,000 Time

  28. Other considerations (beyond control plane) • Network design and devices • Object store scalability • Per-host and cluster scalability • Storage • Understand your workload

  29. Future work • Improve simulator accuracy • Publish results of advanced network (VLAN) testing • Verify assumption of VM density not impacting scale

  30. More information and joining the project Project web site: http://incubator.apache.org/projects/cloudstack.html Mailing lists: cloudstack-dev-subscribe@incubator.apache.org cloudstack-users-subscribe@incubator.apache.org Scalability study: http://wiki.cloudstack.org/pages/viewpage.action?pageId=14320020

  31. Q&A

More Related