Managing Scale and Complexity of Next Generation HPC Systems and Clouds - PowerPoint PPT Presentation

managing scale and complexity of next generation hpc systems and clouds n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Managing Scale and Complexity of Next Generation HPC Systems and Clouds PowerPoint Presentation
Download Presentation
Managing Scale and Complexity of Next Generation HPC Systems and Clouds

play fullscreen
1 / 16
Managing Scale and Complexity of Next Generation HPC Systems and Clouds
129 Views
Download Presentation
ugo
Download Presentation

Managing Scale and Complexity of Next Generation HPC Systems and Clouds

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Managing Scale and Complexity of Next Generation HPC Systems and Clouds Peter ffoulkes Vice President of Marketing April 2011

  2. The World’s Most Capable Computing Systems Are Powered by Moab • The world’s largest HPC system, No. 2-ranked Jaguar, with over 18,500 nodes, 224,000 cores and a speed of 1.75 petaflop/s • Half of the top 10 systems • Over one third of the top 50 systems (17 systems) • 38% of the compute cores in the top 100 systems • Source: Nov 2010 rankings from www.Top500.org

  3. Oak Ridge National Laboratory • Jaguar , the second most capable HPC system in the world, running at 1.759 petaflop/s • 18,686 nodes, 224,256 processing cores, 300TB of memory • Diversity of users was severely limiting system workload-management capability Moab resolved Jaguar’s workload management problems and increased system utilization, decreased downtime, and allowed more control over resources

  4. What’s Next… • Tsubami 2.0 • 2,816 (6 core) CPUs (16,896 cores) combined with 4,224 of NVIDIA’s Tesla M2050 (448 core) general-purpose GPUS, dual-rail, non-blocking fabric employing two Voltaire 40 Gb/s InfiniBand connections on each node • Tianhe 1A • 3,600 nodes, 14,336 (6 core) CPUs, 7,168 (448 core) GPUs - 86,000 general purpose cores and 7,168 GPUs, 160 Gbit/second Galaxy interconnect developed in China

  5. Managing Scale and Complexity • Moab 6.0 • A new command communication architecture that delivers a 100-fold increase in internal communications throughput • Support for the most commonly used Moab commands and grids deploying multiple Moab instances, dramatically increasing the manageability of complex supercomputing environments • Support for hybrid installations deploying GPGPU technologies in conjunction with TORQUE 2.5.4

  6. Managing GPGPUs • Moab 6.0 and TORQUE 2.5.4 • Specify GPGPUs in the same manner as CPUs • GPGPUs are requested as a defined resource • Applications receive indexed GPGPU information about which GPGPU(s) to access • Moab’s intelligent scheduling ensures GPGPUs never get oversubscribed • GPGPU usage is recorded in utilization reports

  7. Managing Scale and Complexity • Moab 6.0 • New on-demand dynamic provisioning and management capabilities that support both virtual and physical resources, including VM migration for load balancing, workload packing and consolidation • Idle-resource management to deliver increased utilization, efficiency and energy conservation for HPC and enterprise cloud deployments • Improved administration and reporting, including new parameterized administration functions; enhanced limits for event, group and account management; and new formats for job and reservation event reporting

  8. Managing Scale and Complexity • Moab Viewpoint 2.0 HPC as a service and HPC cloud capability: • Creation, management and status reporting of reservations and job queues for HPC and batch workloads and system maintenance • On-demand dynamic management of VMs and physical nodes • Increased scalability to support management of tens of thousands of nodes and hundreds of thousands of VMs • Flexible security management for flexible security options at installation, including built-in security, single Sign On (SSO), or Lightweight Directory Access Protocol (LDAP) models • Service-based administration and reporting for easy access and management of HPC and cloud resources

  9. University of Cambridge: Cosmos • Overview • COSMOS has expanded: • New SGIAltix UV1000, 6-core NehalemEX chips,768 cores , 2TGB of global shared memory • Existing SGIAltix 4700, 920 cores and 2.5TB RAM • Both compute systems are supported by 64TB of high performance storage. • Challenge • Managing both cluster-based workloads and SMP shared memory workloads in the same environment

  10. University of Birmingham Overview The University of Birmingham’s 1500 core cluster runs a mixed workload, from many - often hundreds - of short single-core parameter-sweep jobs to massively parallel multi-core computations, some running for over a week. Challenges The workload is variable, especially at different times of the year, and keeping the whole cluster powered up during less busy periods is wasteful of power. A sophisticated system for managing the power requirements is required to be aware of the scheduled as well as the active workload to ensure that resources are always available when required without power being wasted. Solution Moab Adaptive HPC Suite™ Results An annual saving of about 10% of the current power costs, amounting to £50,000 from powering off nodes that are not in use. Further savings from ancillary supplies, especially the air conditioning in the datacentre are expected.

  11. Is the Future of Computing Clear or is it Obscured by Clouds? What’s in a cloud: Vapor-ware or silver lining? National Institute of Standards and Technology: “Cloud computing is a model for enabling convenient, on-demand network access to a shared pool of configurable computing resources (e.g., networks, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction. This cloud model promotes availability and is composed of five essential characteristics, three service models, and four deployment models.” • Essential Characteristics: • On-demand self-service, Broad network access, Resource pooling, Rapid elasticity, Measured Service. • Service Models: • Cloud Software as a Service (SaaS), Cloud Platform as a Service (PaaS), Cloud Infrastructure as a Service (IaaS) • Deployment Models: • Private cloud, Community cloud,Public cloud, Hybrid cloud. • Note: Cloud software takes full advantage of the cloud paradigm by being service oriented with a focus on statelessness, low coupling, modularity, and semantic interoperability. The NIST Definition of Cloud Computing.http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-def-v15.doc

  12. Three Essential Cloud Characteristics Agile Delivers business services rapidly, efficiently and successfully Eliminates human error, enables scaling and capacity, reduces management complexity and cost Automated Anticipates and adapts intelligently to dynamic business service needs and conditions Adaptive

  13. SciNet—University of Toronto • Solution • Energy-aware, stateless, on-demand multi-OS provisioning • Moab Adaptive HPC Suite™ and xCAT provisioning software • 4,000 server supercomputer system • 30,000 Intel Xeon 5500 cores, – a theoretical peak of 306 TFlops • Results • A state-of-the-art data center that saves enough energy to power more than 700 homes yearly. On-demand provisioning allows users to make their OS choice part of their automated job template. SciNet always has several different flavors of Linux running simultaneously. “Why should we pay for cooling when it’s so cold outside? Toronto is pretty cold for at least half of the year. We could have bought a humongous pile of cheap x86 boxes but couldn’t power, maintain or operate them in any logical way.” Dr. Daniel Gruner, PhD, chief technology officer of software for SciNet.

  14. A Global Bank based in the USA • Who: Top 3 financial services company • What: Moab - Automation Intelligence Manager will manage 80-90% of workloads (Up to 10,000+ applications of more than 100,000 servers across more than 10 datacenters) • Use Case: Iaas, PaaS, AaaS, using Workload-Driven Cloud 2.0 • Objective: Increase agility, reduce risk and save over $1 billion dollars in 3 years.