Challenges to High Productivity Computing Systems and Networks - PowerPoint PPT Presentation

The 3rd International Conference
1 / 72

  • Uploaded on
  • Presentation posted in: General

The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive Networks Amman, Jordan October 10-13, 2011. Challenges to High Productivity Computing Systems and Networks. Mohammad Malkawi Dean of Engineering, Jadara University

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentationdownload

Challenges to High Productivity Computing Systems and Networks

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Challenges to high productivity computing systems and networks

The 3rd International Conference on Emerging Ubiquitous Systems and Pervasive, JordanOctober 10-13, 2011

Challenges to high productivity computing systems and networks

Challenges to High Productivity Computing Systems and Networks

Mohammad Malkawi

Dean of Engineering,

Jadara University



  • High Productivity Computing Systems (HPCS) - The Big Picture

  • The Challenges


  • Cray Cascade

  • SUN Hero Program

  • Cloud Computing

Hpcs the big picture

HPCS: The Big Picture

  • Manufacture and deliver a peta-flop class computer

    • Complex architecture

    • High performance

    • Easier to program

    • Easier to use

Hpcs goals

HPCS Goals

  • Productivity

    • Reduce code development time

  • Processing power

    • Floating point & integer arithmetic

  • Memory

    • Large size, high bandwidth & low latency

  • Interconnection

    • Large bisection bandwidth

Hpcs challenges

HPCS Challenges

  • High Effective Bandwidth

    • High bandwidth/low latency memory systems

  • Balanced System Architecture

    • Processors, memory, interconnects, programming environments

  • Robustness

    • Hardware and software reliability

    • Compute through failure

    • Intrusion identification and resistance techniques.

Hpcs challenges1

HPCS Challenges

  • Performance Measurement and Prediction

    • New class of metrics and benchmarks to measure and predict performance of system architecture and applications software

  • Scalability

    • Adapt and optimize to changing workload and user requirements; e.g., multiple programming models, selectable machine abstractions, and configurable software/hardware architectures

Productivity challenges

Productivity Challenges

  • Quantify productivity for code development and production

  • Identify characteristics of

    • Application codes

    • Workflow

    • Bottlenecks and obstacles

    • Lessons learned so that decisions by the productivity team and the vendors are based on real data rather than anecdotal data

Did not learn the lessons

Figure 2: Defect Arrival Rate for R8, R9 and R10

Did Not Learn the Lessons

Productivity dilemma 1

Productivity Dilemma - 1

  • Diminishing productivity is alarming

    • Coding

    • Debugging

    • Optimizing

    • Modifying

    • Over-provisioning hardware

    • Running high-end applications

Productivity dilemma 2

Productivity Dilemma - 2

  • Not long ago, a computational scientist could personally write, debug and optimize code to run on a leadership class high performance computing system without the help of others.

  • Today, the programming for a cluster of machines is significantly more difficult than traditional programming, and the scale of the machines and problems has increased more than 1,000 times.

Productivity dilemma 3

Productivity Dilemma - 3

  • Owning and running high-end computational facilities for nuclear research, seismic modeling, gene sequencing or business intelligence, takes sizeable investment in terms of staffing, procurement and operations.

  • Applications achieve 5 to 10 percent of the theoretical peak performance of the system.

  • Applications must be restarted from scratch every time a hardware or software failure interrupts the job.

Hpcs trends productivity crisis

HPCS Trends: Productivity Crisis

High productivity computing

High Productivity Computing

Scaling the Program Without Scaling the Programmer

Bandwidth enables productivity and allows for simpler programming environments and systems with greater fault tolerance

Language challenges

Language Challenges

  • MPI is a fairly low-level language

    • Reliable, predictable and works.

    • Extension of Fortran, C and C++

  • New languages with higher level of abstraction

  • Improve legacy applications

  • Scale to Petascale levels

    • SUN – Fortress

    • IBM - X10

    • Cray – Chapel

    • Open MP

Global view programming model

Global View Programming Model

  • Global View programs present a single, global view of the program's data structures,

  • Begin with a single main thread.

  • Parallel execution then spreads out dynamically as work becomes available.

Unprecedented performance leap

Unprecedented Performance Leap

  • Performance targets require aggressive improvements in system parameters traditionally ignored by the "Linpack" benchmark.

  • Improve system performance under the most demanding benchmarks (GUPS)

  • Determine whether general applications will be written or modified to benefit from these features.

Trade offs


  • Portability versus innovations

  • Abstractions vs. difficulty of programming and performance overhead

  • Shared memory versus message passing

Cost of petascale computing

Cost of Petascale Computing

  • Require petabytes of memory

  • Order of 106 processors

  • Hundreds of petabytes of disk storage for capacity and bandwidth.

  • Power consumption and cost for DRAM and disks (Tens of Mega Watts)

  • Operational cost

The darpa hpcs program

The DARPA HPCS Program

  • First major program to devote effort to make high end computers more user-friendly

    • Mask the difficulty of developing and running codes on HPCS

    • Mask the challenge of getting good performance for a general code

    • Fast, large, and low latency RAM

    • Fast processing

    • Quantitative measure of productivity

Ibm hpcs example


Ibm hpcs program perc 2011

IBM HPCS Program – PERC 2011

  • Productive, Easy-to-use, Reliable Computer

  • Rich programming environment

    • Develop new applications and maintain existing ones.

    • Support existing programming models and languages

    • Scalability to the peta-level

  • Automate performance tuning tasks

  • Rich graphical interfaces

  • Automate monitoring and recovery tasks

  • Fewer system administrators to handle larger systems more effectively

Ibm blue gene hpcs base

IBM Blue Gene – HPCS Base

Ibm approach hardware

IBM Approach - Hardware

  • Innovative processor chip design & leverage the POWER processor server line.

  • Lower Soft Error Rates (SER)

  • Reduce the latency of memory accesses by placing the processors close to large memory arrays.

  • Multiple chip configuration to suit different workloads.

Ibm approach software

IBM Approach - Software

  • Large set of tools integrated into a modern, user-friendly programming environment.

  • Support both legacy programming models and languages (MPI, OpenMP, C, C++, Fortran, etc.),

  • Support emerging ones (PGAS)

  • Design new experimental programming language, called X10.

X10 features

X10 Features

  • Designed for parallel processing from the ground up.

  • Falls under the Partitioned Global Address Space (PGAS) category

  • Balance between a high-level abstraction and exposing the topology of the system

  • Asynchronous interactions among the parallel threads

  • Avoid the blocking synchronization style

Cray hpcs example


Multiple processing technologies

Multiple Processing Technologies

  • In high performance computing: one size does not fit all

    • Heterogeneous computing using custom processing technologies.

  • Performance achieved via deeper pipelining and more complex microarchitectures

  • Introduction of multi-core processors:

    • Further stresses processor-memory balance issues

    • Drives up the number of processors required to solve large problems

Specialized computing technologies

Specialized Computing Technologies

  • Vector processing and field programmable gate arrays (FPGAs)

    • Ability to extract more performance out of the transistors on a chip with less control overhead.

    • Allow higher processor performance, with lower power

    • Reduce the number of processors required to solve a given problem

    • Vector processors tolerate memory latency extremely well

Specialized computing technologies1

Specialized Computing Technologies

  • Multithreading improve latency tolerance

  • Cascade design will combine multiple computing technologies

    • Pure scalar nodes, based on Opteron microprocessors

    • Nodes providing vector, massively multithreaded, and FPGA-based acceleration.

    • Nodes that can adapt their mode of operation to the application.

Cray the cascade approach

Cray: The Cascade Approach

  • Scalable, high-bandwidth system

  • Globally addressable memory

  • Heterogeneous processing technologies

  • Fast serial execution

  • Massive multithreading

  • Vector processing and FPGA-based application acceleration.

  • Adaptive supercomputing:

    • The system adapts to the application rather than requiring the programmer to adapt the application to the system.

Cascade approach

Cascade Approach

  • Use Cray T3ETM massively parallel system

  • Use best-of-class microprocessor

  • Processors directly access global memory with very low overhead and at very high data rates.

  • Hierarchical address translation allows the processors to access very large data sets without suffering from TLB faults

  • AMD's Opteron will be the base processor for Cascade

Cray adaptive supercomputing

Cray – Adaptive Supercomputing

  • The system adapts to the application

  • The user logs into a single system, and sees one global file system.

  • The compiler analyzes the code to determine which processing technology best fits the code

  • The scheduling software automatically deploys the code on the appropriate nodes.

Balanced hardware design

Balanced Hardware Design

  • A balanced hardware design

    • Complements processor flops with memory, network and I/O bandwidth

  • Scalable performance

  • Improving programmability and breadth of applicability.

  • Balanced systems also require fewer processors to scale to a given level of performance, reducing failure rates and administrative overhead.

Cray system bandwidth challenge

Cray- System Bandwidth Challenge

  • The Cascade program is attacking this problem on two fronts

    • Signalling technology and

    • Network design.

  • Provide truly massive global bandwidth at an affordable cost.

  • A key part of the design is a common, globally addressable memory across the whole machine.

    • Efficient, low-overhead communication.

Cray system bandwidth challenge1

Cray- System Bandwidth Challenge

  • Accessing remote data is as simple as issuing a load or store instruction, rather than calling a library function to pass messages between processors.

  • Allows many outstanding references to be overlapped with each other and with ongoing computation.

Cray programming model

Cray Programming Model

  • Support MPI for legacy purposes

  • Unified Parallel C (UPC) and Coarray Fortran (CAF)

    • simpler and easier to write than MPI

  • Reference memory on remote nodes as easily as referencing memory on the local node

  • Data sharing is much more natural

  • Communication overhead is much lower.

The chapel cray hpcs language

The Chapel – Cray HPCS Language

  • Support for graphs, hash tables, sparse arrays, and iterators.

  • Ability to separate the specification of an algorithm from structural details of the computation including

    • Data layouts

    • Work decomposition and communication.

    • Simplifies the creation of the basic algorithms

    • Allows these structural components to be gradually tuned over time.

Cray s programming tools

Cray's Programming Tools

  • Reduce the complexity of working on highly scalable applications.

  • The Cascade debugger solution will

    • Focus on data rather than control

    • Support application porting

    • Allow scaling commensurate with the application

  • Integrated user environment (IDE)

Cascade performance analysis tools

Cascade Performance Analysis Tools

  • Hardware performance counters

  • Software introspection techniques.

  • Present the user with insight, rather than statistics.

  • Act as a parallel programming expert

  • Provide high-level feedback on program behaviour

  • Provide suggestions for program modifications to remove key bottlenecks or otherwise improve performance.

Sun hpcs example


Evolution of hpcs at sun

Evolution of HPCS at SUN

  • Grid:

    • Loosely coupled heterogeneous resources

    • Multiple administrative domains

    • Wide area network

  • Clusters

    • Tightly coupled high performance systems

    • Message passing – MPI

  • Ultrascale

    • Distributed scalable systems

    • High productivity shared memory systems

    • High bandwidth, global address space, unified administration tools

Sun approach the hero system

SUN Approach – The Hero System

  • Rich bandwidth

  • Low latencies

  • Very high levels of fault tolerance

  • Highly integrated toolset to scale the program and not the programmers

  • Multithreading technologies ( > 100 concurrent threads)

Sun approach the hero system1

SUN Approach – The Hero System

  • Globally addressable memory

  • System level and application checkpointing

  • Hardware and software telemetry for dramatically improved fault tolerance.

  • The system appears more like a flat memory system

  • Focus on solving the problem at hand rather than making elaborate efforts to distribute data in a robust manner.

Definition bisection bandwidth

Definition: Bisection Bandwidth

A standard metric for system’s ability to globally move data

Example is an all-to-all interconnect between 8 cabinets

There are 28 total connections, of which 16 cross the bisection (orange) and 12 do not (blue)

High bandwidth optical connections are key to meeting HPCS peta-scale bisection bandwidth target

Split a system into equal halves such that there is the minimum number of connections across the split- the bandwidth across the split is the bisection bandwidth

System bandwidth over time

System Bandwidth Over Time

A giant leap in productivity expected

High bandwidth required by hpcs

High Bandwidth Required by HPCS

Radical Changes From Today’s Architecture Necessary

Motivation for higher bandwidth

Motivation for Higher Bandwidth

Growing bw demand in hpcs

Growing BW demand in HPCS

  • Multicore CPUs: Aggregation of multiple cores is unstoppable and copper interconnects are stressed at very large scale

  • Silicon Photonics is the solution since it brings a potential of unlimited BW on the best medium allowing for large aggregation of multicore CPUs

Growing bw demand in hpcs1

Growing BW demand in HPCS

  • Clusters are growing in number of nodes and in performance/node

  • Interconnects are the limiting factor in BW, latency, distance

  • Protocols reduce latency & copper increases latency.

  • Silicon Photonics brings high BW and low latency

Growing bw demand in hpcs2

Growing BW demand in HPCS

  • Storage I/O BW increasing exponentially due to the faster data/rate and the parallelism caused by striping technologies

  • WDM will eventually allow 10Tb of data to be transmitted down a single piece of fiber

  • Silicon Photonics is at the beginning of its life cycle with headroom for explosive BW growth without any increase in latency or reduction in reach

Proximity cmos photonics

Proximity + CMOS Photonics

Proximity communication 2

Proximity Communication -2

Proximity communication 3

Proximity Communication -3

Proximity communication

Proximity Communication

  • Capacitive coupling enables high-speed data communication between neighboring chips without the need for wires of any kind

  • Allows for the alignment of metal plates on one chip with metal plates on a neighboring chip and the transfer of data between them

    • reduced power

    • improves cross-section bandwidth and

    • communication power

Proximity communication sun

Proximity Communication - SUN

  • 3.6 x 4.1 mm test chip

  • 0.35 um technology

  • 50 um bit pitch

  • 1.35 Gbps/channel for 16 simultaneous channels

  • < 10^-12 BER @ 1Gbps

  • 3.6 mW/channel static power

  • 3.9 pJ/bit dynamic power

Proximity communication 4

Proximity Communication -4

Proximity communication 5

Proximity Communication -5

Low cost low power optics

Low Cost, Low Power Optics

Dwdm cmos photonics

DWDM CMOS Photonics

Cmos photonics module

CMOS Photonics Module

Sun programming model

SUN Programming Model

Simpler Code with High Bandwidth Shared Memory

NAS Parallel Benchmark CG (Conjugate Gradient) Lines of Code

Sun fortress language

SUN Fortress Language

To Do For Fortran What JavaTM Did For C

  • Catch stupid mistakes

  • Extensive libraries

  • Platform indpendence

  • Security model

  • Type safety

  • Multithreading

  • Dynamic compilation

Object based smart storage

Object-Based “Smart” Storage

With Object Storage File Systems For Massive Scalability and Extreme Performance

Ultra scale computing in 2010

Ultra-scale Computing in 2010

  • Simpler development environments will make HPC more accessible to a diverse range of users

  • Lone researchers and small teams will once again be able to harness the computational power of leadership class systems

  • Many gaps regarding commercial and scientific computing will narrow

Cloud computing

Cloud Computing

  • Service computing

  • The net is the computer

  • More than 100 vendors

  • Growing fast

  • Programming environment

Backup slides


Hpcs technologies

HPCS Technologies

Some Publicly Announced Projects

Ibm hpcs percs


  • Open source operating systems and hypervisors will provide HPC-oriented

    • Virtualization

    • Security

    • Resource management

    • Affinity control

    • Resource limits

    • Checkpoint-restart and reliability features that will improve the robustness and availability of the system.

Mpi paradigm

MPI Paradigm

  • Writing applications in MPI requires breaking up all the data and computation into a large number of discrete pieces

  • and then using library code to explicitly bundle up data and pass it between processors in messages whenever processors need to share data.

  • It's a cumbersome affair that distracts scientists from their primary focus.

  • Once an application is written, it's generally a time-consuming process to debug and tune it.

  • Traditional debugging models just don't scale well to thousands or tens of thousands of processors (try opening up 10,000 debugger windows, one for each thread!).

  • Trying to figure out why your application isn't getting the performance you think it should is also exceedingly difficult at large scales.

  • Traditional profiling and even sophisticated statistics-gathering may be insufficient to ascertain why the performance is lagging, much less how to change the code to improve it.

Productivity challenges1

Productivity Challenges

  • The time spent trying to structure an application to fit the attributes of the target machine.

  • If the machine is a cluster with limited interconnect bandwidth

    • the programmer must carefully minimize communication

    • make sure that any sparse data to be communicated is first bundled together into larger messages to reduce communication overheads.

Productivity challenges2

Productivity Challenges

  • If the machine uses conventional microprocessors

    • Care must be taken to maximize cache re-use

    • Eliminate global memory references, which tend to stall the processor.

  • If the machine looks like a hammer

    • You'd better make all your codes look like nails!

    • This can lead to "unnatural" algorithms and data structures, which significantly reducesprogrammer productivity

  • Login