Microsoft and Cloud Computing             [10 minutes] Introduction to Windows Azure      [35 minutes] Research Applic - PowerPoint PPT Presentation

sandra_john
slide1 l.
Skip this Video
Loading SlideShow in 5 Seconds..
Microsoft and Cloud Computing             [10 minutes] Introduction to Windows Azure      [35 minutes] Research Applic PowerPoint Presentation
Download Presentation
Microsoft and Cloud Computing             [10 minutes] Introduction to Windows Azure      [35 minutes] Research Applic

play fullscreen
1 / 80
Download Presentation
Microsoft and Cloud Computing             [10 minutes] Introduction to Windows Azure      [35 minutes] Research Applic
290 Views
Download Presentation

Microsoft and Cloud Computing             [10 minutes] Introduction to Windows Azure      [35 minutes] Research Applic

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Windows Azure for ResearchRoger Barga & Jared JacksonContributors include Nelson Araujo, Dennis Gannon and Wei LuCloud Computing Futures Group, Microsoft Research

  2. Presentation Outline Microsoft and Cloud Computing            [10 minutes] Introduction to Windows Azure     [35 minutes] Research Applications on Azure, demos[10 minutes] How They Were Built                                                 [15 minutes] A Closer Look at Azure [15 minutes] Cloud Research Engagement Initiative[ 5 minutes] Q&A   [ * ]

  3. Microsoft and Cloud Computing

  4. Science 2020 “In the last two decades advances in computing technology, from processing speed to network capacity and the Internet, have revolutionized the way scientists work. From sequencing genomes to monitoring the Earth's climate, many recent scientific advances would not have been possible without a parallel increase in computing power - and with revolutionary technologies such as the quantum computer edging towards reality, what will the relationship between computing and science bring us over the next 15 years?”

  5. Sapir–Whorf: Context and Research Sapir–Whorf Hypothesis (SWH) • Language influences the habitual thought of its speakers Scientific computing analog • Available systems shape research agendas Consider some past examples • Cray-1 and vector computing • VAX 11/780 and UNIX • Workstations and Ethernet • PCs and web • Inexpensive clusters and Grids Today’s examples • multicore, sensors, clouds and services … What lessons can we draw?

  6. The Pull of Economics … Moore’s “Law” favored consumer commodities • Economics drove enormous improvements • Specialized processors and mainframes faltered • The commodity software industry was born Today’s economics • Manycore processors/accelerators • Software as a service/cloud computing • Multidisciplinary data analysis and fusion This is driving change in research and technical computing • Just as did “killer micros” and inexpensive clusters LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86 LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache PCIe PCIe 1 MB 1 MB NoC NoC ctlr ctlr cache cache LPIA LPIA 1 MB GPU GPU x86 x86 cache LPIA LPIA 1 MB 1 MB x86 x86 cache cache LPIA LPIA DRAM DRAM OoO x86 x86 ctlr ctlr x86

  7. Clouds are built on Data Centers • Range in size from “edge” facilities to megascale. • Economies of scale • Approximate costs for a small size center (1000 servers) and a larger, 100K server center. Each data center is 11.5 times the size of a football field

  8. Containers: Separating Concerns

  9. Microsoft Advances in DC Deployment Conquering complexity • Building racks of servers & complex cooling systems all separately is not efficient. • Package and deploy into bigger units, JITD

  10. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Communication fabric • Storage systems • Reliability and resilience • Programming model and services

  11. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Node architectures are indistinguishable – Intel Nehalem, AMD Barcelona or Shanghai, multiple processors, big chunk of memory on the nodes • Communication fabric • Storage systems • Reliability and resilience • Programming model and services

  12. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Communication fabric

  13. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Communication fabric • Storage systems • HPC: local scratch or non-existent, secondary is SAN or PFS, PB tertiary storage • DC: TB local storage, secondary is JBOD, tertiary is non-existent

  14. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Communication fabric • Storage systems • Reliability and resilience • HPC: periodic checkpoints, rollback and resume in response to failures, MTBF approaching zero, checkpoint frequency increasing, I/O demand intolerable. • DC: loosely consistent models, designed to transparently recover from failures

  15. Data Centers and HPC Clusters – select comparisons • Node and system architectures • Communication fabric • Storage systems • Reliability and resilience • Programming model and services

  16. Platform Extension to Cloud is a Continuum

  17. Windows Azure in a Nutshell

  18. A bunch of machines in a data center Azure FC Owns this Hardware Highly-available Fabric Controller (FC)

  19. FC Installs An Optimized Hypervisor

  20. FC Installs A Host Virtual Machine (VM)

  21. FC then Installs the Guest VM

  22. Up to 7 of Them to be Exact

  23. Each VM Has… At Minimum CPU: 1.5-1.7 GHz x64 Memory: 1.7GB Network: 100+ Mbps Local Storage: 500GB Up to CPU: 8 Cores Memory: 14.2 GB Local Storage: 2+ TB

  24. FC Then Installs the Azure Platform Compute Storage

  25. Windows Azure Compute Service A closer look Web Role Worker Role main() { … } HTTP ASP.NET, WCF, etc. IIS Load Balancer Agent Agent Fabric VM

  26. Suggested Application ModelUsing queues for reliable messaging To scale, add more of either main() { … } Worker Role Web Role 1) Receive work 4) Do work ASP.NET, WCF, etc. 2) Put work in queue 3) Get work from queue Queue

  27. Scalable, Fault Tolerant Applications • Queues are the application glue • Decouple parts of application, easier to scale independently; • Resource allocation, different priority queues and backend servers • Mask faults in worker roles (reliable messaging). • Use Inter-role communication for performance • TCP communication between role instances • Define your ports in the service models

  28. Storage Blob REST API Queue Table Load Balancer

  29. Azure Storage ServiceA closer look HTTP Blobs Drives Tables Queues Application Storage Compute Fabric …

  30. Windows Azure StoragePoints of interest Storage types • Blobs: Simple interface for storing named files along with metadata for the file • Drives – Durable NTFS volumes • Tables: entity-based storage Not relational – entities, which contain a set of properties • Queues: reliable message-based communication Access • Data is exposed via .NET and RESTful interfaces • Data can be accessed by: • Windows Azure apps • Other on-premise applications or cloud applications

  31. Development Environment Develop Work Development Fabric Your App Run Develop Home Development Storage Source Control Version Local Application Works Locally

  32. In the Cloud Application Works Locally Application Works In Staging Cloud

  33. Windows Azure Platform Basics What the ‘Value Add’ ? Provide a platform that is scalable and available • Services are always running, rolling upgrades/downgrades • Failure of any node is expected, state has to be replicated • Failure of a role (app code) is expected, automatic recovery • Services can grow to be large, provide state management that scales automatically • Handle dynamic configuration changes due to load or failure • Manage data center hardware: from CPU cores, nodes, rack, to network infrastructure and load balancers.

  34. Windows Azure Compute FabricFabric Controller • Owns all data center hardware • Uses inventory to host services • Deploys applications to free resources • Maintains the health of those applications • Maintains health of hardware • Manages the service life cycle starting from bare metal

  35. Windows Azure Compute FabricFault Domains Purpose: Avoid single points of failures Fault domains • Unit of a failure • Examples: Compute node, a rack of machines • System considers fault domains when allocating service roles • Service owner assigns number required by each role • Example: 10 front-ends, across 2 fault domains Allocation is across fault domains

  36. Windows Azure Compute FabricUpdate Domains Purpose: ensure the service stays up while undergoing an update • Unit of software/configuration update • Example: set of nodes to update • Used when rolling forward or backward • Developer assigns number required by each role • Example: 10 front-ends, across 5 update domains Update domains Allocation is across update domains

  37. Windows Azure Compute FabricPush-button Deployment Step 1: Allocate nodes • Across fault domains • Across update domains Step 2: Place OS and role images on nodes Step 3: Configure settings Step 4: Start Roles Step 5: Configure load-balancers Step 6: Maintain desired number of roles • Failed roles automatically restarted • Node failure results in new nodes automatically allocated Allocation across fault and update domains Load-Balancers

  38. Windows Azure Compute FabricThe FC Keeps Your Service Running Windows Azure FC monitors the health of roles • FC detects if a role dies • A role can indicate it is unhealthy • Current state of the node is updated appropriately • State machine kicks in again to drive us back into goals state Windows Azure FC monitors the health of host • If the node goes offline, FC will try to recover it If a failed node can’t be recovered, FC migrates role instances to a new node • A suitable replacement location is found • Existing role instances are notified of change

  39. Windows AzureKey takeaways Cloud services have specific design considerations • Always on, distributed state, large scale, fault tolerance • Scalable infrastructure demands a scalable architecture • Stateless roles and durable queues Windows Azure frees service developers from many platform issues Windows Azure manages both services and servers

  40. Cloud Research Engagement

  41. Azure Applications Demonstrating Scientific Research Applications in the Cloud AzureBLAST- Finding similarities in genetic sequences Azure Ocean- Rich client visualization with cloud based data computation Azure MODIS- Imagery analysis from satellite photos PhyloD- Finding relationships in phylogenetic trees

  42. AzureBLAST Demonstration

  43. Azure Ocean Demonstration

  44. Azure MODIS Overview Two satellites: • Terra,“EOS AM” , launched 12/1999,descending, equator crossing at 10:30 AM • Aqua, “EOS PM”, launched 5/2002,ascending, equator crossing at 1:30 PM Near polar orbits, day/night mode, ~2300 KM swath L0 (raw) and L1 (calibrated) data held at Goddard DAAC L2 and L3 products made by a collection of different algorithms provided by a number of different researchers

  45. Azure MODIS . . . Research Results Download Queue Analysis Reduction Stage AzureMODIS Service Web Role Portal Data Collection Stage ReprojectionStage Derivation ReductionStage

  46. PhyloD Overview • Statistical tool used to analyze DNA of HIV from large studies of infected patients • PhyloD was developed by Microsoft Research and has been highly impactful • Small but important group of researchers • 100’s of HIV and HepC researchers actively use it • 1000’s of research communities rely on results Cover of PLoS Biology November 2008 • Typical job, 10 – 20 CPU hours, extreme jobs require 1K – 2K CPU hours • Requires a large number of test runs for a given job (1 – 10M tests) • Highly compressed data per job ( ~100 KB per job)

  47. AzureBLAST – Looking Deeper Step 1. Staging • Compress required data • Upload to Azure Store • Deploy Worker Roles- Init() function downloads and decompresses data to the local disk Local Sequence Database Uploaded Compressed Azure Storage Deployed BLAST Executable …

  48. AzureBLAST – Looking Deeper Step 2. Partitioning a Job User Input Input Partition Azure Storage Queue Message Web Role Single Partitioning Worker Role

  49. AzureBLAST – Looking Deeper Step 3. Doing the Work User Input Input Partition BLAST Output Azure Storage Queue Message Logs Web Role Single Partitioning Worker Role … BLAST ready Worker Roles

  50. AzureBLAST – Some good results