Control Plane Architectures: Design Solutions Shane Gibson – Cloud Infrastructure Architect

Control Plane Architectures: Design Solutions Shane Gibson – Cloud Infrastructure Architect ZeroStack, Inc. - https://zerostack.com/ OpenStack Summit - Boston, MA- May 11, 2017

QR Code Why take pix? Just use the QR Code! https://www.slideshare.net/ShaneGibson3/openstack-control-plane-architectures-design-solutions

IMPORTANT LEGAL STUFF Lorem ipsum dolor sit amet, consectetur adipiscing elit. Mauris venenatis posuere odio vel auctor. Fusce non turpis nec lorem varius dictum. Nulla felis neque, convallis a congue sed, molestie vel lectus. Aenean hendrerit metus non nunc commodo sodales. Etiam ac erat at massa tincidunt lobortis at et ligula. Fusce sed lorem tellus. Suspendisse potenti. Ut dignissim suscipit aliquet. Donec luctus pulvinar lectus quis condimentum. Etiam sed iaculis nunc, sed blandit magna. Fusce mattis nisl nec dapibus luctus. Proin a augue facilisis, vehicula mauris non, cursus augue. Mauris tristique, justo vitae tempor tincidunt, metus ligula ornare tellus, at condimentum ex diam nec odio. Sed porttitor ultrices libero sed efficitur. Cras sem diam, eleifend sit amet dui eu, pretium cursus fermentum nibh. Donec tincidunt cursus enim a varius. Nam placerat eu nunc id rutrum. Praesent ullamcorper fringilla eros, vitae rutrum elit consectetur in. Aliquam eu tempus dui, a feugiat nulla. Quisque laoreet imperd ex, in facilisis elit tristique a. Nunc ante felis, faucibus at semper nec, consequat commodo magna.

About Shane Gibson Shane Gibson serves as the Cloud Infrastructure Architect for ZeroStack, Inc., which is a private cloud solutions company. There he is responsible for the architecture, implementation, and management of the internal cloud platform that drives the SaaS and Cloud Portal that power the ZeroStack solution. Previously, he served as Sr. Principal Infrastructure Architect at Symantec for the Cloud Platform Engineering (CPE) team. He was responsible for the infrastructure design of the underlying platforms, operating systems, tools, and application stack that enables the OpenStack clusters within the CPE group. In previous roles, Shane has served as a Systems Architect, Network Architect, Security Architect, Unix Systems Administrator, Mainframe Operator, Mainframe Hardware Specialist, and has also served in the United States Marine Corps. In his "spare" time, he loves to anything on two wheels; motorcycling, mountain biking, road biking, cyclocross, etc…

Agenda what we'll be talking about (and not) problem statement needs analysis solutions summary questions thank you references

what we'll be talking about (and not) …

What we'll be talking about Short definition of what "Control Plane" means Short definition of what "Data Plane" means How much Control Plane do you need? Briefly discuss general HA design solutions Introduce four design architectures Standalone (seriously!) Active/passive Fully Redundant, separate control plane Distributed, embedded control plane Discuss the architecture of these design solutions

What we won't be talking about Things that aren't OpenStack Ancillary services (eg AD/LDAP behind Keystone) Server Load Balancers architectures (they're key to HA!) ok, we'll talk about them a bit … Specifics of Network Controller architecture Container Orchestration Engine (COE) HA Physical infrastructure (eg power, cooling, etc.) Complex DB setups (sharding, multisite … ) Multi-site Control Plane Storage HA architecture (Ceph, Swift, etc…)

Control Plane Definition control plane The control plane is the management traffic responsible for sending signaling and commands, examples: give me a token so I can do something create port, network, router instantiate/terminate an instance Sort of like a Drill Sergeant: instructs recruits (data plane) signals and commands Ref: 1

Data Plane Definition data plane The data plane is all of the bits and bytes moving around related to doing the work as instructed by the control plane: actually instantiating the instance east/west traffic between VMs, north/south traffic in and out of your cloud Kind of like these poor Recruits stand at attention, pass out at attention !! Ref: 1

problem statement

Problem Statement man, this devstack is easy !! So you've completed a PoC … like what you see … Need to build a shiny new cloud From PoC to production - what architecture do you need? Understand your needs Match your needs to a design Overbuilding is just as dangerous as under building But, keep in mind - you may need/ want/forced to scale You're control plane needs to grow with your cloud

needs analysis

Needs Analysis Understanding how much reliability you need is critical to determining an appropriate CP architecture Quantify how available your platform needs to be Be honest … can you live with a 95% available CP? How about 98% ? Do you *need* 99.9%? Can you afford to build, staff, support, and maintain 99.999%? Complexity adds cost, time, and significant risk

Needs Analysis - how much is enough Downtime, based on percentage of availability: • 365.243 days per year (leap year, baby!) • 52.178 weeks per year • 30.437 days per month • 4.348 weeks per month calculations source: http://uptime.is/

Needs Analysis • To match your uptime/downtime threshold • Understand business use of your platform • Survey your user groups to determine what applications they will be using, and how critical they are • Determine how much talent (be honest) you have to build or you can buy (hire or rent) for the platform you need… • *You* might be a rock star, but you need a dedicated and competent team to tend to a complex HA solution • A well tended single server solution *may* outperform a poorly managed highly complex one • performance, of course, not-withstanding …

Needs Analysis: match uptime to solution Active/Active or Distributed A complete (bogus?) guideline: Active/Passive 99.99+ % 99.5 to 99.99 % 98 to 99.5 % Standalone 95 to 98 %

Needs Analysis • How much capacity (compute, memory, storage, etc) do you need for your control plane services? • Great resource/data: • URL: https://docs.openstack.org/developer/performance-docs/test_results/ • Example Control Plane resource consumption for: • 6 nodes • 200 nodes • 400 nodes • 1000 nodes

patterns - basics of availability designs

HA Design Solutions - single system with hardware redundancy Server (redundant hardware subystems) typically located in a datacenter(like) location with redundant power, network, cooling, etc… capacity / scaling is going to be your bug-a-boo (you can only scale "up" so much), suggest building in service LB from the beginning

HA Design Solutions - active/passive VIP either bare metal or virtualized / containerized work loads mysql (active) mysql (standby) svc A svc A svc B svc B svc C svc C mysql replication replicated data (eg DRBD) externally replicated, service is unaware - eg use of load balancer and pacemaker + DRBD service based replication example: mysql repl.

HA Design Solutions - clustered follower leader A C B A C B follower application maintains and controls cluster replication, leader election, and take-overs A C B

HA Design Solutions - virtualized services Implement simple hypervisors (eg just bare KVM) or implement a small OpenStack cluster (caution !!) a lot of interesting Containerized CP solutions are maturing hypervisor 1 hypervisor 2 hypervisor 3 VIP A VIP B VIP A VIP B VIP A VIP B VM - service A VM - Service A VM - Service A VM - service B VM - Service B VM - Service B

HA Design Solutions - distributed services hypervisor B hypervisor C hypervisor A Embed a VM or Container in each hypervisor of your cluster which is responsible for service orchestration tasks controller service A controller service B controller service C VIP A / B orch. data VIP A / B orch. data VIP A / B orch. data services services services VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM VM

solutions - as applied to control plane

Solutions: Overview Standalone (yes, still!)

Solutions: Overview Standalone (yes, still!) Active/Passive one master, one standby system

Solutions: Overview Standalone (yes, still!) Active/Passive one master, one standby system Active/Active Cluster Multiple members in cluster Leader election system / quorom protocols Or - use of load balancers for singleton systems

Solutions: Overview Standalone (yes, still!) Active/Passive one master, one standby system Active/Active Cluster Multiple members in cluster Leader election system / quorom protocols Or - use of load balancers for singleton systems Distributed System Embedded across the cluster

Solutions: how the control plane correlates control plane control plane data plane Remember our Drill Sergeant ? Ref: 3 Ref: 1

Solutions: how the control plane correlates tor01 tor02 tor03 tor04 tor05 tor06 tor07 tor08 tor09 tor10 tor11 tor12 how this might look in racks …

Solutions: standalone Stand alone doesn't have to mean "prone to failure" Redundant power supplies (with redundant feeds) Redundant NICs/separate LOM + PCIe (or 2x PCIe) Hardware RAID based storage Redundant Top-of-Rack (bonded NICs) In an environmentally controlled facility cooling, power, electrical, etc. You would be surprised how fault tolerant a single, well designed system can be… Can only "scale up" so much before you have to "scale out" Edgar Magana of Workday: OpenStack HA, or not HA not HA - Level 4 Ballroom G at 5:30pm

Solutions: standalone Server (redundant hardware subystems)

Solutions: how much can HA/Reliability cost you ? Have you ever heard of the Jepsen tests or articles? Check out "The Network is Reliable" [Ref: 2] (Kyle Kingsbury): it just might chill your blood …

Solutions: active/passive Ok, maybe standalone doesn't cut it for you … Active/Passive utilizes a service to monitor the main (active) service, and then execute a coup if trouble is detected… for example: STONITH (Shoot The Other Node In The Head)

Solutions: active/passive Most of the services aren't aware of the fact they have a "shadow partner" … Utilize various tools to monitor services, and initiate a take-over if the primary/active service fails keepalived, pacemaker, corosync, STONITH, etc… Data is usually replicated outside of applications knowledge DRBD (Distributed Replicated Block Devices) very stable, around a LONG time, actively maintained and supported xNBD/bNBD, SAN based replication Ceph RBD (replica of 2), GlusterFS, etc… Or … "simply" via database replication

Solutions: active/passive Primary mechanism is Service LB with a watchdog of some type Let distributed services (egrabbitmqand mysql)replicate natively Shared storage for things like configurations, backing instances, etc.

Solutions: fully redundant So you've decided you're "all in" Fully Redundant - requires very careful consideration Complex HA and Reliability solutions have their own baggage that just might cost you more than you bargained for But if you need to drive towards the 99% and better uptime… Each service requires it's own treatment in terms of architecture … but there are common threads

Solutions: fully redundant - virtualized Like active/passive - but we now scale 3, 5, etc… (odd numbers for proper quorum) of fully active members

Solutions: fully redundant - containerized New alternatives emerging around COE models for managing your Control Plane services. Kubernetes Example: Kubernetes Master with HA – One of many proposed HA models

Solutions: fully redundant - containerized kubernetes masterN kubernetes masterN kubernetes masterN Kubernetes Worker Nodes: worker 1 worker 2 worker 3 kubelet kubelet kubelet mysql neutron neutron glance glance nova nova cinder cinder nova cinder ...etc...

Solutions: distributed Big departure from the traditional model With distributed (embedded) clusters, there are some special considerations necessary: Be very careful of "noisy neighbor" problem causing your control plane grief See "Quantifying the Noisy Neighbor Problem" by ZeroStack from Austin 2016 Summit Designing the algorithms on placing and managing your control plane systems inthe cluster can bevery complex Need a distributed state/service orchestration piece (egetcd, consul, serf, atomix, zookeeper)

Solutions: distributed or ... Consider a COE (container orchestration engine) to manage the placement and healing properties of your CP: Still a relatively young solution with potential pitfalls Can utilize this model with Fully Redundant or Distributed models Consider tight QoS controls (eg namespaces and cgroups) for service guarantees if using Distributed

Solutions: distributed - four node cluster

Solutions: distributed When you have a CP that dynamical does this, auto-heals, deals with noisy neighbors, and can scale on demand …

QUESTIONS ?

We are hiring!! Check us out on the thingy called the "web", at: https://www.zerostack.com/careers/

THANK YOU! Shane Gibson shane@zerostack.com

References [1] CartoonStock License Agreement: https://www.cartoonstock.com/licenseagreement.asp [2] "The network is reliable" (Kyle Kingsbury and Peter Bailis): https://aphyr.com/posts/288-the-network-is-reliable [3] OpenStack Operators Guide: http://docs.openstack.org/openstack-ops/content/example_architecture.html#example_archs_conclusion

Control Plane Architectures: Design Solutions Shane Gibson – Cloud Infrastructure Architect

Control Plane Architectures: Design Solutions Shane Gibson – Cloud Infrastructure Architect

Presentation Transcript

Brief Introduction: Cambodia

Brief Introduction: Indonesia

Brief Introduction

A Brief Introduction of CMAA

Brief LaTeX Introduction

Brief Introduction

Redundant IOC Introduction

BRIEF COMPANY INTRODUCTION

Brief Introduction of HKICM

CDC Brief introduction

A Brief Introduction

Solution-Focused Brief Therapy

A BRIEF INTRODUCTION

VCX IP Telephony Solution Solution Brief

BRIEF INTRODUCTION

I. Brief introduction

A Brief Introduction of PageRank

A Brief Introduction of FE

(Brief) Solution Focused Therapy

BRIEF INTRODUCTION

Brief Introduction

Redundant Routers