1 / 51

Architecting for High Availability on Azure Infrastructure

Architecting for High Availability on Azure Infrastructure. Igal Figlin, Azure Compute Ziv Rafalovich, Azure Compute Dave Beus, Adobe Jayan Kandathil , Adobe. BRK3336BRK3363. Agenda & Objectives. Topics: VM Resiliency, High Availability and Disaster Recovery Resiliency improvements

jackethomas
Download Presentation

Architecting for High Availability on Azure Infrastructure

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Architecting for High Availability onAzure Infrastructure Igal Figlin, Azure Compute Ziv Rafalovich, Azure Compute Dave Beus, Adobe JayanKandathil, Adobe BRK3336BRK3363

  2. Agenda & Objectives Topics: VM Resiliency, High Availability and Disaster Recovery Resiliency improvements Managing Azure with AI/ML Achieving High Availability for your application High Availability with Adobe Azure Site Recovery – What’s new Key Takeaways: The right service architecture can dramatically increase availability and end-user satisfaction Azure is focused on offering the right scenarios for reliable and performant implementation of VM-based services

  3. Achieving Business ContinuityFrom mission critical applications to backup Data Residency boundary Data Residency boundary Region Region 1 Region 2 Region 1 Region 2 High Availability Redundancy with fault Isolation. Data is replicated to a minimum of one additional location Disaster Recovery • Asynchronous replication across regions. • Multiple regions within data residency boundaries. • Backup Point in time copies of your data and configurations Store backups for redundancy purposes with data residency.

  4. Most Comprehensive Resiliency and SLA Legacy Apps Regional Availability Global Scale Intra-DC Isolation Fault isolation: None VM SLA 99.9% Fault isolation: Compute Racks & Storage Stamps Fault isolation: Availability Zones Fault isolation: Regions (100+ miles apart) VM SLA VM SLA 99.99% 99.95% Regions 54 Region 2 Region 1 Single VM Availability sets & VM Scale Set Single VM & VM Scale Set Region pairs Protection with Premium Storage Protection from entire datacenter failures Protection from disaster withData Residency compliance Protection against failures within datacenters

  5. VM Resiliency

  6. From Average Availability to AIR 99.999% Availability is not enough  26sec per month (1 reboot per VM?) AIR = Annual Interruption Rate (incl. Planned Maintenance) Chances of a VM to reboot in a year 1 / AIR = Mean time to fail for a VM (in years) Differentiating between types of impact Continued Improvement: Delivered: 3x reduction since March Planning: 6x reduction by December Internal project “Rainier”: AIR “pollution” improvements

  7. Analyzing VM Downtime Unexpected/unpredicted Downtime Hardware, software crash or platform failed Automatic service healing for impacted VMs Fault Isolation domains: Server, Rack, Data Center, Availability Zone, Region Unplanned Hardware Maintenance Event Trigger: Azure predicts that the hardware or platform is about to fail Use Live Migration to evict the node (if possible) Otherwise, heal the VM into a new node (reboot) Coming: when Live Migration can’t complete (e.g. specific HW failure type), allow customer to trigger healing Planned Maintenance events Impact-less Maintenance: Control plane components Reboot-less Maintenance: Pausing a VM to apply maintenance to underlying hosting environment Maintenance requiring reboots: Platform maintenance (getting more and more rare) and HW Decommissioning

  8. How the update mechanism work? If not feasible and VM is extremely sensitive to updates Update with zero impact if possible Chose the least impactful in-place update Live migration Opportunistic migration Offer self maintenance window of 30 days If not feasible If not feasible and host reboot required If not feasible In-place platform update a June/July 2018: L1TF (Foreshadow) vulnerability mitigation

  9. In-Place vs Live Migration in Azure In-Place Migration Secure & fast: fast end-to-end in-place update; minimal coordination Predictable: not dependent on customer payload Safe: continuous deployment pipeline with health feedback and machine-learned baseline High VM eligibility for different flavors Low impact + diverse: Many flavors to minimize observable impact; impact is unobservable cases. Exceptions can average at 13sec CPU pause (improvements in progress). Live-Migration Allows recovery from failures at the node or rack level Supports platform changes which can’t be done with in-place migration (e.g. hardware component change) Impact depends on customer traffic and patterns, some VMs might be too big to move Supported on all standard VMs (exception: G, M, N* and some H series, work in progress to reduce) Low impact: CPU pause averages at 1.7sec for premium storage, 2.8sec for standard storage. Node A Node B Node A Node B Node A Node B A Node B

  10. Planned Maintenance evolution in 2018 Reduce Impact Jan 2018 (Spectre/Meltdown) fleet reboot Spring 2018 reboot-less security rollout 2018 – reduced downtime with Live and In-Place Migration (for standard VM sizes) Provide Control Jan 2018 – Controlled reboot maintenance experience (“Planned Maintenance”) Aug 2018 – added support for VMSS Aug 2018 – Added support for HW decomm Communicate Improved email coverage (email validation) Maintenance dashboard and alerts supporting multiple resource types In-VM notification (see next) Reduce Impact Impact-less Live Migration In-Place Migration VM Reboots Email Notification Notification only Maintenance dashboard Control over reboots Scheduled Events Maintenance Sensitive Communicate Provide Control

  11. How Azure is Minimizing Downtime with Azure AI?

  12. Case Study for ML-Driven Availability: Disk Failure Prediction Online Prediction and Customer Protection Goal – minimize VM reboots due to disk failures by triggering Live Migration (moving VMs to healthy node with only a few seconds of blackout time) Azure Cluster Azure Cluster Azure Cluster Azure Cluster Offline Training N1 N1 N1 N1 N2 N2 N2 N2 VM VM VM VM Cosmos + TLC Online prediction Marking bad-nodes Live-migrate workload • “Improving Service Availability of Cloud Systems by Predicting Disk Error”, USNIX ATC 2018

  13. Reminder - Azure Scheduled Events: Reacting to maintenance events... before they happen • Upcoming maintenance events from within your VM to improve availability • A local endpoint with a simple REST API • Visibility to upcoming event across all offers: VMs, cloud service / Availability Set/ VMSS • A NotBefore time (10-15 minutes notification) • Acknowledge completion to expedite • Potential use cases • Graceful shutdown – save state, drain node, suspend jobs • Proactive failover – fasted failover (skip detection) • Adjust thresholds – Avoid failover in the case of VM-preserving maintenance • Cover all maintenance scenarios • Platform initiated • In-place low-impact maintenance & Live Migration • Interactive user calls (e.g. restart a VM) • New: predictable hardware failures curl -H Metadata:true http://169.254.169.254/metadata/scheduledevents?api-version=2017-08-01 { "DocumentIncarnation": {IncarnationID}, "Events": [ { "EventId": {eventID}, "EventType": "Reboot" | "Redeploy" | "Freeze", "ResourceType": "VirtualMachine", "Resources": [{resourceName}], "EventStatus": "Scheduled" | "Started", "NotBefore": {timeInUTC}, } ] }

  14. Scheduled Events – 2018 Updates Opt-out from user-initiated operations You can test your response to scheduled events by issuing VM operations (restart, redeploy) However, if you use those to run your business (e.g. scale), you don’t want to wait the extra 15 minutes. POST to the scheduled events endpoint with the following body Longer duration for full size VMs: Provide longer notification time to full node size VMs Preview starting soon { "Configuration": { "UserEvents": true | false } }

  15. Demo: Scheduled Events & Serverless

  16. React to Azure Scheduled Events from outside the VM Goal • Forward Azure Scheduled Events from within the VM using Event Grid Use Cases • Forward notification (alert) • Monitoring, logging, auditing • Proactive Failover • Turn on a stand-by server • Take machine of the LB Approach • ScheduledEventsExtension – VM extension for Azure Virtual Machines using Serverless https://github.com/zivraf/ScheduledEvents • Custom script to deploy the extension https://github.com/zivraf/ScheduledEvents/tree/master/setup/linux • Build your own serverless application

  17. High Availability in Azure

  18. Most comprehensive resiliency and SLA Legacy Apps Regional Availability Global Scale Intra-DC Isolation VM SLA 99.9% Fault isolation: Compute Racks & Storage Stamps Fault isolation: Availability Zones Fault isolation: Regions (100+ miles apart) VM SLA VM SLA 99.99% 99.95% Regions 54 Region 2 Region 1 Single VM Availability sets & VM Scale Set Single VM & VM Scale Set Region pairs Protection with Premium Storage Protection from entire datacenter failures Protection from disaster withData Residency compliance Protection against failures within datacenters

  19. “Poor Man’s” High Availability - Scenario For cases where running two VMs is just too expensive: Prepare 2 VMs in an availability Set Keep one of them turned off When you plan to update one of the VMs, turn on the stand-by VM Turn it off again when done Gaps: Time to recover Potential for Data loss Update coordination Value: easy and better then nothing

  20. Intra-DC HA with Availability Sets & VM Scale Sets (*) Managed Availability Sets A single deployment spanning fault isolation boundaries Platform provided fault domains (FD) User controls the FD count Compute and storage alignment Note: not a collocation constraint (*)VMSS Support: Disable scaling beyond 100 instances Exclude Low-priority VMSS Coming soon: Storage FD alignment FD1 FD2 FD0 VM Availability Set Managed Storage account 3 Managed Storage account 2 Managed Storage account 1 Storage FD0 Storage FD2 Storage FD1 Disks on separate storage FDs & aligned with VM FDs

  21. South Central US, 4 September 2018 • Electrical storm, power sags & swells • [Within 30 minutes] Electrical storm, power sags & swells (different datacenter) • Temperature increase & “graceful shutdown” to protect infrastructure • Bringing the datacenter back online, storage recovery • Services impacted due to dependencies • Service failover resulting in throttling

  22. Availability Zones Protect applications from DC-level failures. • An Availability Zone is made up of one or more physical datacenters. • Each AZ is equipped with independent power, network and cooling. Enable low latency synchronous replication. • Inter-AZ latency diameter (VM-to-VM roundtrip) of <2ms. • A minimum of three AZs in every supported region. Region Zone 1 • Available in the following regions with more coming soon. • Central US • West US 2 • West Europe • North Europe • France Central • East US 2 (Preview) • Southeast Asia (Preview) Zone 2 Zone 3

  23. Zones-aware servicesUser configured and automatically replicated options VPN Gateway Load Balancer Standard (Zone Redundant) Managed Disks Virtual Machines Service Bus Application Gateway Zone Redundant Storage (ZRS) Express Route Event Hubs Virtual Machine Scale Set Zonal services – User configured Zone-redundant services – platform replicates across three zones

  24. ARM template – Zonal VM Add the Managed Disk Resource: { "apiVersion": "2017-03-30", "type": "Microsoft.Compute/disks", "name": "myManagedDataDisk", "location": "[resourceGroup().location]", "zones": ["1"], "properties": { "creationData": { "createOption": "Empty" }, "accountType :"[parameters('storageAccountType')]", "diskSizeGB": 64 } } Add the VIP Resource: { "apiVersion": "2017-08-01", "type": "Microsoft.Network/publicIPAddresses", "name": "[variables('publicIPAddressName')]", "location": "[resourceGroup().location]", "sku": { "name": "Standard" }, "properties": { "publicIPAllocationMethod": “Dynamic", "dnsSettings": { "domainNameLabel": "[parameters('dnsLabelPrefix')]" } } } Add the Compute Resource: { "apiVersion": "2017-03-30", "type": "Microsoft.Compute/virtualMachines", "name": "[variables('vmName')]", "location": "[resourceGroup().location]", "zones": ["1"], "dependsOn": [ ... ], "properties": { "hardwareProfile": { "vmSize": "[parameters('vmSize')]" }, "osProfile": { ... }, } }

  25. ARM template – Zone-redundant services Zone-redundant VMSS: { "apiVersion": "2017-03-30", "type": "Microsoft.Compute/virtualMachineScaleSets", "name": "[parameters('vmssName')]", "zones" : ["1","2","3"], "location": "[resourceGroup().location]", "dependsOn": [ ... ], "sku": { ... }, "properties": { ... }, } Zone-redundant SQLDB: { "apiVersion": "2014-04-01 “, "type":"Microsoft.Sql/servers", "name": "[variables('sqlServerName')]", "location": "[resourceGroup().location]", “zoneRedundant”: “true”, "properties": { ... } } } Zone-redundant LB: { "apiVersion": "2017-08-01", { "type": "Microsoft.Network/loadBalancers", "name": "[variables('loadBalancerName')]", "location": "[resourceGroup().location]", "sku": { "name": "Standard" }, }

  26. Your HA Strategy - Evaluating the options Go global when Data replication enables Latency allows Use Availability Zones when Region supports Latency allows Otherwise, rely of fault domains

  27. Test your readiness – Chaos Scenarios Identify your chaos scenarios Power loss scenarios using VM API VM, FD, AZ Network disconnect using NSGs NIC, Subnet For example : https://github.com/raj-ganapathy-msft/AzureFI

  28. Adobe Experience Manager (AEM) & Adobe Sign High Availability In Azure Use Cases Dave Beus Jayan Kandathil BRK3336

  29. Adobe Experience Manager (AEM) on Azure Availability Zones (AZs)

  30. What is AEM Adobe’s enterprise-class [Web Content], [Asset] and [Forms] management software Three separate workloads (tiers) Scales horizontally as farms of compute nodes Dispatcher (“Content Caching”) Publisher (“Content Publishing”) Author (“Content Authoring”) • Serve content from memory • Scales horizontally with Publishers • Serve content from disk • Scales horizontally • Create or upload new content or change existing content • Replicate “activated” content to Publishers

  31. Technology Stack Dispatcher (“Content Caching”) Author (“Content Authoring”) Publisher (“Content Publishing”) • Red Hat Enterprise Linux 7.4 • OSGi Java Runtime (Apache Felix) • Red Hat Enterprise Linux 7.4 • Apache HTTP Server 2.4 • Red Hat Enterprise Linux 7.4 • OSGi Java Runtime (Apache Felix)

  32. AEM on Azure • ~One year of experience running PROD workloads • 30+ customers • 500+ VMs • 1,000+ Managed Disks • Active in 10+ Azure Regions • Most popular? [West US 2], followed by [West Europe]

  33. AEM Offerings - by Application Availability SLA • 99.50% • Single-Region, No AZ • 99.90% • Single Region, Multi-AZ • 99.95% • Multi-Region, Multi-AZ

  34. 99.90% SLA (Content Delivery)Single-Region, Multi-AZ

  35. 99.95% SLA (Content Delivery)Multi-Region, Multi-AZ

  36. Health Checks • Traffic Manager health checks should be lightweight • App Gateway health checks should be application-aware – lesson learned from AZ-down drill • PINGs, [200 OK] from login dialog etc. not enough • AppGateway sent traffic to Dispatcher when its Publisher was not yet ready • Get the Application owner to create it • Login • Invoke User GUI Layer • Invoke API Layer • Pay attention to timeouts, DNS TTLs (1 minute)

  37. Adobe Sign on Azure Availability Zones

  38. Adobe Sign – What is it? Sign Components Adobe Sign is Adobe’s e-signature service that lets you replace paper and ink signature processes with fully automated electronic signature workflows • Compute (VMSS) Based Components • Web and App (Apache & Tomcat) • Task Workers (each a VMSS) • Virus Scan • CutyCapt (Conversion to PDF) • Adobe & Office (Conversion to PDF) • Raster • CDS (Certified Document Service) • MySQL Database Store • . • . • 20+ VM Scale Sets • Key Infrastructure and PaaS Based Components • CloudHSM (KeyVault) • File Store (Azure Block BLOB) • Azure Private DNS • Azure Standard Load Balancer

  39. Adobe Sign – High Availability Architecture (Simplified) HTTPS HTTPS Region 1 Zone Redundant Load Balancer Availability Zone 3 Availability Zone 1 Availability Zone 2 HTTPS User 1 User 2 Zone Redundant VMSS Web Server (Apache HTTP) HTTP Zone Redundant VMSS App Server (Apache Tomcat) Read Only Read/Write TCP 3306 HTTPS - BLOB Gets/Puts Database (MySQL) Zone Redundant VMSS Master Slave Slave Tungsten replication Zone Redundant Storage Azure BLOB Storage Synchronous Writes

  40. Adobe Sign – AZ Failure  Process (Simplified) HTTPS HTTPS Region 1 Zone Redundant Load Balancer Availability Zone 3 Availability Zone 1 Availability Zone 2 HTTPS User 1 User 2 Web Server (Apache HTTP) Zone Redundant VMSS HTTP Zone Redundant VMSS App Server (Apache Tomcat) TCP 3306 Read Only Read/Write HTTPS - BLOB Gets/Puts Zone Redundant VMSS Database (MySQL) Master Slave Slave Tungsten replication Zone Redundant Storage Azure BLOB Storage Synchronous Writes

  41. Key Take-Aways • Verify your Azure services and regions support AZs • Tune your health checks • Make sure your health checks are application-aware • Recovery should be automatic • Test failure and recovery designs • Over provision # of VMs in VMSS or use autoscale rules • Use ZRS Storage for HA and GRS/RAGRS Storage for DR

  42. Disaster Recovery in Azure

  43. High Availability Vs. Disaster Recovery • Introducing RPO and RTO • Setting RTO = 0 and RPO = 0 may not be a business requirement • It could also be expensive and/or performance impacting • Outages may impact the entire region • Hurricane, Tsunami, and earthquakes • Manmade disasters • You wish to be able to recover from a remote location (100s miles apart) • You may be required to keep your data residency

  44. Azure Site Recovery: The Complete Migration & Disaster Recovery Private cloud to Azure Azure to Azure Any Cloud Azure Azure to Azure Physical VMware Hyper-V Any OS Linux Windows

  45. Automated protection and replication Continuous log based replication Best in class RPO and RTO No impact DR Drills with Test Failover Orchestrated Recovery Plans for Disaster Recovery Centralized Monitoring and Alerting Failback support Azure Site Recovery – Key capabilities Considerations

  46. Announcing • Cross Subscription DR • Ability to isolate DR resources • Help in managing billing and access control • DR for Encrypted VM • Support for VMs using Azure disk encryption (ADE) • Simplified Key replication across regions • DR for VM in Availability Zone - Preview • Leverage both levels of resiliency • Retain your application HA across DR site Azure Site Recovery (ASR)

  47. FastTrack for AzureBuild Azure solutions quickly and confidently • Customer Benefits • Direct assistance from Azure engineers and program managers • Use proven practices and tools from real customer experiences • Accelerateddeployment to full production of Azure solutions • Discovery: Validate project scope, vision, requirements and assess architectural needs • Solution Enablement: Guidance on solution architecture design using proven practices and design principles, in addition to providing advice and assistance to facilitate solution PoC and dev/test environments • Deployment: Support in-house customer and/or partner led deployment of Azure solution • Continuous Partnership: Provide periodic check-ins and address additional workloads and deployment needs Learn more: FastTrack for Azure booth @ Deployment area near the Expo center landmark • Azure.com/FastTrack

  48. Please evaluate this sessionYour feedback is important to us! Please evaluate this session through MyEvaluations on the mobile appor website. Download the app:https://aka.ms/ignite.mobileApp Go to the website: https://myignite.techcommunity.microsoft.com/evaluations

  49. Key Takeaways: • The right service architecture can dramatically increase availability and end-user satisfaction • Azure is focused on offering the right scenarios for reliable and performant implementation of VM-based services Questions?

More Related