Introduction to
Download
1 / 167

Introduction to High Performance Computing: - PowerPoint PPT Presentation


  • 433 Views
  • Updated On :

Introduction to High Performance Computing: Parallel Computing, Distributed  Computing, Grid Computing and More. Dr. Jay Boisseau Director, Texas Advanced Computing Center [email protected] December 3, 2001. Texas Advanced Computing Center. The University of Texas at Austin. Outline.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Introduction to High Performance Computing:' - daniel_millan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg

Introduction toHigh Performance Computing:Parallel Computing, Distributed Computing, Grid Computing and More

Dr. Jay Boisseau

Director, Texas Advanced Computing Center

[email protected]

December 3, 2001

Texas Advanced Computing Center

The University of Texas at Austin


Outline l.jpg
Outline

  • Preface

  • What is High Performance Computing?

  • Parallel Computing

  • Distributed Computing, Grid Computing, and More

  • Future Trends in HPC


Purpose l.jpg
Purpose

  • Purpose of this workshop:

    • to educate researchers about the value and impact of high performance computing (HPC) techniques and technologies in conducting computational science and engineering

  • Purpose of this presentation:

    • to educate researchers about the techniques and tools of parallel computing, and to show them the possibilities presented by distributed computing and Grid computing


Goals l.jpg
Goals

  • Goals of this presentation are to help you:

    • understand the ‘big picture’ of high performance computing

    • develop a comprehensive understanding of parallel computing

    • begin to understand how Grid and distributed computing will further enhance computational science capabilities


Content and context l.jpg
Content and Context

  • This material is an introduction and an overview

    • It is not a comprehensive HPC, so further reading (much more!) is recommended.

  • Presentation is followed by additional speakers with detailed presentations on specific HPC and science topics

  • Together, these presentations will help prepare you to use HPC in your scientific discipline.


Background me l.jpg
Background - me

  • Director of the Texas Advanced Computing Center (TACC) at the University of Texas

  • Formerly at San Diego Supercomputer Center (SDSC), Artic Region Supercomputing Center

  • 10+ years in HPC

  • Known Luis for 4 years - plan to develop strong relationship between TACC and CeCalCULA


Background tacc l.jpg
Background – TACC

  • Mission:

    • to enhance the academic research capabilities of the University of Texas and its affiliates through the application of advanced computing resources and expertise

  • TACC activities include:

    • Resources

    • Support

    • Development

    • Applied research


Tacc activities l.jpg
TACC Activities

  • TACC resources and support includes:

    • HPC systems

    • Scientific visualization resources

    • Data storage/archival systems

  • TACC research and development areas:

    • HPC

    • Scientific Visualization

    • Grid Computing


Current hpc systems l.jpg
Current HPC Systems

640

GB

ARCHIVE

300

GB

500

GB

aurora

golden

azure

CRAY SV1

16 CPU, 16GB

Memory

IBM SP

64+ procs

256 MB/proc

CRAY T3E

256+ procs

128 MB/proc

HiPPI

AscendRouter

FDDI


New hpc systems l.jpg
New HPC Systems

  • Four IBM p690 HPC servers

    • 16 Power4 Processors

      • 1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server

    • 16 GB Shared Memory

      • >200 GB/s memory bandwidth!

    • 144 GB Disk

  • 1 TB disk to partition across servers

  • Will configure as single system (1/3 Tflop) with single GPFS system (1 TB) in 2Q02


New hpc systems11 l.jpg

IA64 Cluster

20 2-way nodes

Itanium (800 MHz) processors

2 GB memory/node

72 GB disk/node

Myrinet 2000 switch

180GB shared disk

IA32 Cluster

32 2-way nodes

Pentium III (1 GHz) processors

1 GB Memory

18.2 GB disk/node

Myrinet 2000 Switch

New HPC Systems

750 GB IBM GPFS parallel file system for both clusters


World class vislab l.jpg
World-Class Vislab

  • SGI Onyx2

    • 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines

    • 24 GB Memory, 750 GB Disk

  • Front and Rear Projection Systems

    • 3x1 cylindrically-symmetric Power Wall

    • 5x2 large-screen, 16:9 panel Power Wall

  • Matrix switch between systems, projectors, rooms


More information l.jpg
More Information

  • URL: www.tacc.utexas.edu

  • E-mail Addresses:

    • General Information: [email protected]

    • Technical assistance: [email protected]

  • Telephone Numbers:

    • Main Office: (512) 475-9411

    • Facsimile transmission: (512) 475-9445

    • Operations Room: (512) 475-9410


Outline14 l.jpg
Outline

  • Preface

  • What is High Performance Computing?

  • Parallel Computing

  • Distributed Computing, Grid Computing, and More

  • Future Trends in HPC


Supercomputing l.jpg
‘Supercomputing’

  • First HPC systems were vector-based systems (e.g. Cray)

    • named ‘supercomputers’ because they were an order of magnitude more powerful than commercial systems

  • Now, ‘supercomputer’ has little meaning

    • large systems are now just scaled up versions of smaller systems

  • However, ‘high performance computing’ has many meanings


Hpc defined l.jpg
HPC Defined

  • High performance computing:

    • can mean high flop count

      • per processor

      • totaled over many processors working on the same problem

      • totaled over many processors working on related problems

    • can mean faster turnaround time

      • more powerful system

      • scheduled to first available system(s)

      • using multiple systems simultaneously


My definitions l.jpg
My Definitions

  • HPC: any computational technique that solves a large problem faster than possible using single, commodity systems

    • Custom-designed, high-performance processors (e.g. Cray, NEC)

    • Parallel computing

    • Distributed computing

    • Grid computing


My definitions18 l.jpg
My Definitions

  • Parallel computing: single systems with many processors working on the same problem

  • Distributed computing: many systems loosely coupled by a scheduler to work on related problems

  • Grid Computing: many systems tightly coupled by software and networks to work together on single problems or on related problems


Importance of hpc l.jpg
Importance of HPC

  • HPC has had tremendous impact on all areas of computational science and engineering in academia, government, and industry.

  • Many problems have been solved with HPC techniques that were impossible to solve with individual workstations or personal computers.


Outline20 l.jpg
Outline

  • Preface

  • What is High Performance Computing?

  • Parallel Computing

  • Distributed Computing, Grid Computing, and More

  • Future Trends in HPC


What is a parallel computer l.jpg
What is a Parallel Computer?

  • Parallel computing: the use of multiple computers or processors working together on a common task

  • Parallel computer: a computer that contains multiple processors:

    • each processor works on its section of the problem

    • processors are allowed to exchange information with other processors


Parallel vs serial computers l.jpg
Parallel vs. Serial Computers

  • Two big advantages of parallel computers:

    • total performance

    • total memory

  • Parallel computers enable us to solve problems that:

    • benefit from, or require, fast solution

    • require large amounts of memory

    • example that requires both: weather forecasting


Parallel vs serial computers23 l.jpg
Parallel vs. Serial Computers

  • Some benefits of parallel computing include:

    • more data points

      • bigger domains

      • better spatial resolution

      • more particles

    • more time steps

      • longer runs

      • better temporal resolution

    • faster execution

      • faster time to solution

      • more solutions in same time

      • lager simulations in real time


Serial processor performance l.jpg
Serial Processor Performance

Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached


Types of parallel computers l.jpg
Types of Parallel Computers

  • The simplest and most useful way to classify modern parallel computers is by their memory model:

    • shared memory

    • distributed memory


Shared vs distributed memory l.jpg
Shared vs. Distributed Memory

P

P

P

P

P

P

Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000)

BUS

Memory

P

P

P

P

P

P

Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters)

M

M

M

M

M

M

Network


Shared memory uma vs numa l.jpg
Shared Memory: UMA vs. NUMA

Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000)

P

P

P

P

P

P

BUS

Memory

P

P

P

P

P

P

P

P

Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin)

BUS

BUS

Memory

Memory

Network


Distributed memory mpps vs clusters l.jpg
Distributed Memory: MPPs vs. Clusters

  • Processor-memory nodes are connected by some type of interconnect network

    • Massively Parallel Processor (MPP): tightly integrated, single system image.

    • Cluster: individual computers connected by s/w

Interconnect

Network

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM

CPU

MEM


Processors memory networks l.jpg
Processors, Memory, & Networks

  • Both shared and distributed memory systems have:

    • processors: now generally commodity RISC processors

    • memory: now generally commodity DRAM

    • network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.)

  • We will now begin to describe these pieces in detail, starting with definitions of terms.


Processor related terms l.jpg
Processor-Related Terms

Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz).

Instruction: an action executed by a processor, such as a mathematical operation or a memory operation.

Register: a small, extremely fast location for storing data or instructions in the processor.


Processor related terms31 l.jpg
Processor-Related Terms

Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc.

Pipeline : technique enabling multiple instructions to be overlapped in execution.

Superscalar: multiple instructions are possible per clock period.

Flops: floating point operations per second.


Processor related terms32 l.jpg
Processor-Related Terms

Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly.

Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)


Memory related terms l.jpg
Memory-Related Terms

SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable.

DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper).

Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.


Interconnect related terms l.jpg
Interconnect-Related Terms

  • Latency:

    • Networks: How long does it take to start sending a "message"? Measured in microseconds.

    • Processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?)

  • Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec


Interconnect related terms35 l.jpg
Interconnect-Related Terms

Topology: the manner in which the nodes are connected.

  • Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons.

  • Instead, processors are arranged in some variation of a grid, torus, or hypercube.

2-d mesh

2-d torus

3-d hypercube


Processor memory problem l.jpg
Processor-Memory Problem

  • Processors issue instructions roughly every nanosecond.

  • DRAM can be accessed roughly every 100 nanoseconds (!).

  • DRAM cannot keep processors busy! And the gap is growing:

    • processors getting faster by 60% per year

    • DRAM getting faster by 7% per year (SDRAM and EDO RAM might help, but not enough)


Processor memory performance gap l.jpg
Processor-Memory Performance Gap

µProc

60%/yr.

1000

CPU

“Moore’s Law”

100

Processor-Memory

Performance Gap:(grows 50% / year)

Performance

10

DRAM

7%/yr.

DRAM

1

1980

1982

1983

1985

1987

1990

1992

1994

1995

1997

1999

1981

1984

1986

1988

1991

1993

1996

1998

2000

1989

From D. Patterson, CS252, Spring 1998 ©UCB


Processor memory performance gap38 l.jpg
Processor-Memory Performance Gap

  • Problem becomes worse when remote (distributed or NUMA) memory is needed

    • network latency is roughly 1000-10000 nanoseconds (roughly 1-10 microseconds)

    • networks getting faster, but not fast enough

  • Therefore, cache is used in all processors

    • almost as fast as processors (same circuitry)

    • sits between processors and local memory

    • expensive, can only use small amounts

    • must design system to load cache effectively


Processor cache memory l.jpg
Processor-Cache-Memory

  • Cache is much smaller than main memory and hence there is mapping of data from main memory to cache.

CPU

Cache

Main Memory


Memory hierarchy l.jpg
Memory Hierarchy

CPU

Cache

Local

Memory

Remote

Memory


Cache related terms l.jpg
Cache-Related Terms

  • ICACHE : Instruction cache

  • DCACHE (L1) : Data cache closest to registers

  • SCACHE (L2) : Secondary data cache

    • Data from SCACHE has to go through DCACHE to registers

    • SCACHE is larger than DCACHE

    • Not all processors have SCACHE


Cache benefits l.jpg
Cache Benefits

  • Data cache was designed with two key concepts in mind

    • Spatial Locality

      • When an element is referenced its neighbors will be referenced also

      • Cache lines are fetched together

      • Work on consecutive data elements in the same cache line

    • Temporal Locality

      • When an element is referenced, it might be referenced again soon

      • Arrange code so that data in cache is reused often


Direct mapped cache l.jpg
Direct-Mapped Cache

  • Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache.

cache

main memory


Fully associative cache l.jpg
Fully Associative Cache

  • Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache.

cache

Main memory


Set associative cache l.jpg
Set Associative Cache

  • Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into N (N > 1) locations in the cache.

2-way set-associative cache

Main memory


Cache related terms46 l.jpg
Cache-Related Terms

Least Recently Used (LRU): Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block.

Random Replace: Cache replacement strategy for set associative caches. A cache block is randomly replaced.


Example cray t3e cache l.jpg
Example: CRAY T3E Cache

  • The CRAY T3E processors can execute

    • 2 floating point ops (1 add, 1 multiply) and

    • 2 integer/memory ops (includes 2 loads or 1 store)

  • To help keep the processors busy

    • on-chip 8 KB direct-mapped data cache

    • on-chip 8 KB direct-mapped instruction cache

    • on-chip 96 KB 3-way set associative secondary data cache with random replacement.


Putting the pieces together l.jpg
Putting the Pieces Together

  • Recall:

    • Shared memory architectures:

      • Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000

      • Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000

    • Distributed memory architectures:

      • Massively Parallel Processor (MPP): tightly integrated system, single system image. Ex: CRAY T3E, IBM SP

      • Clusters: commodity nodes connected by interconnect. Example: Beowulf clusters.


Symmetric multiprocessors smps l.jpg
Symmetric Multiprocessors (SMPs)

  • SMPs connect processors to global shared memory using one of:

    • bus

    • crossbar

  • Provides simple programming model, but has problems:

    • buses can become saturated

    • crossbar size must increase with # processors

  • Problem grows with number of processors, limiting maximum size of SMPs


Shared memory programming l.jpg
Shared Memory Programming

  • Programming models are easier since message passing is not necessary. Techniques:

    • autoparallelization via compiler options

    • loop-level parallelism via compiler directives

    • OpenMP

    • pthreads

  • More on programming models later.


Massively parallel processors l.jpg
Massively Parallel Processors

  • Each processor has it’s own memory:

    • memory is not shared globally

    • adds another layer to memory hierarchy (remote memory)

  • Processor/memory nodes are connected by interconnect network

    • many possible topologies

    • processors must pass data via messages

    • communication overhead must be minimized


Communications networks l.jpg
Communications Networks

  • Custom

    • Many vendors have custom interconnects that provide high performance for their MPP system

    • CRAY T3E interconnect is the fastest for MPPs: lowest latency, highest bandwidth

  • Commodity

    • Used in some MPPs and all clusters

    • Myrinet, Gigabit Ethernet, Fast Ethernet, etc.


Types of interconnects l.jpg
Types of Interconnects

  • Fully connected

    • not feasible

  • Array and torus

    • Intel Paragon (2D array), CRAY T3E (3D torus)

  • Crossbar

    • IBM SP (8 nodes)

  • Hypercube

    • SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree)

  • Combinations of some of the above

    • IBM SP (crossbar & fully connected for 80 nodes)

    • IBM SP (fat tree for > 80 nodes)


Clusters l.jpg
Clusters

  • Similar to MPPs

    • Commodity processors and memory

      • Processor performance must be maximized

    • Memory hierarchy includes remote memory

    • No shared memory--message passing

      • Communication overhead must be minimized

  • Different from MPPs

    • All commodity, including interconnect and OS

    • Multiple independent systems: more robust

    • Separate I/O systems


Cluster pros and cons l.jpg
Cluster Pros and Cons

  • Pros

    • Inexpensive

    • Fastest processors first

    • Potential for true parallel I/O

    • High availability

  • Cons:

    • Less mature software (programming and system)

    • More difficult to manage (changing slowly)

    • Lower performance interconnects: not as scalable to large number (but have almost caught up!)


Distributed memory programming l.jpg
Distributed Memory Programming

  • Message passing is most efficient

    • MPI

    • MPI-2

    • Active/one-sided messages

      • Vendor: SHMEM (T3E), LAPI (SP

      • Coming in MPI-2

  • Shared memory models can be implemented in software, but are not as efficient.

  • More on programming models in the next section.


Distributed shared memory l.jpg
“Distributed Shared Memory”

  • More generally called cc-NUMA (cache coherent NUMA)

  • Consists of m SMPs with n processors in a global address space:

    • Each processor has some local memory (SMP)

    • All processors can access all memory: extra “directory” hardware on each SMP tracks values stored in all SMPs

    • Hardware guarantees cache coherency

    • Access to memory on other SMPs slower (NUMA)


Distributed shared memory58 l.jpg
“Distributed Shared Memory”

  • Easier to build because of slower access to remote memory (no expensive bus/crossbar)

  • Similar cache problems

  • Code writers should be aware of data distribution

  • Load balance: Minimize access of “far” memory


Dsm rationale and realities l.jpg
DSM Rationale and Realities

  • Rationale: combine ease of SMP programming with scalability of MPP programming at much at cost of MPP

  • Reality: NUMA introduces additional layers in SMP memory hierarchy relative to SMPs, so scalability is limited if programmed as SMP

  • Reality: Performance and high scalability require programming to the architecture.


Clustered smps l.jpg
Clustered SMPs

  • Simpler than DSMs:

    • composed of nodes connected by network, like an MPP or cluster

    • each node is an SMP

    • processors on one SMP do not share memory on other SMPs (no directory hardware in SMP nodes)

    • communication between SMP nodes is by message passing

    • Ex: IBM Power3-based SP systems


Clustered smp diagram l.jpg

P

P

P

P

P

P

P

P

Network

BUS

BUS

Memory

Memory

Clustered SMP Diagram


Reasons for clustered smps l.jpg
Reasons for Clustered SMPs

  • Natural extension of SMPs and clusters

    • SMPs offer great performance up to their crossbar/bus limit

    • Connecting nodes is how memory and performance are increased beyond SMP levels

    • Can scale to larger number of processors with less scalable interconnect

    • Maximum performance:

      • Optimize at SMP level - no communication overhead

      • Optimize at MPP level - fewer messages necessary for same number of processors


Clustered smp drawbacks l.jpg
Clustered SMP Drawbacks

  • Clustering SMPs has drawbacks

    • No shared memory access over entire system, unlike DSMs

    • Has other disadvantages of DSMs

      • Extra layer in memory hierarchy

      • Performance requires more effort from programmer than SMPs or MPPs

  • However, clustered SMPs provide a means for obtaining very high performance and scalability


Clustered smp npaci blue horizon l.jpg
Clustered SMP: NPACI “Blue Horizon”

  • IBM SP system:

    • Power3 processors: good peak performance (~1.5 Gflops)

    • better sustained performance (highly superscalar and pipelined) than for many other processors

    • SMP nodes have 8 Power3 processors

    • System has 144 SMP nodes (1154 processors total)


Programming clustered smps l.jpg
Programming Clustered SMPs

  • NSF: Most users use only MPI, even for intra- node messages

  • DoE: Most applications are being developed with MPI (between nodes) and OpenMP (intra-node)

  • MPI+OpenMP programming is more complex, but might yield maximum performance

  • Active messages and pthreads would theoretically give maximum performance


Types of parallelism l.jpg
Types of Parallelism

  • Data parallelism: each processor performs the same task on different sets or sub-regions of data

  • Task parallelism: each processor performs a different task

  • Most parallel applications fall somewhere on the continuum between these two extremes.

Task parallelism

Data parallelism


Data vs task parallelism l.jpg
Data vs. Task Parallelism

  • Example of data parallelism:

    • In a bottling plant, we see several ‘processors’, or bottle cappers, applying bottle caps concurrently on rows of bottles.

  • Example of task parallelism;

    • In a restaurant kitchen, we see several chefs, or ‘processors’, working simultaneously on different parts of different meals.

    • A good restaurant kitchen also demonstrates load balancing and synchronization--more on those topics later.


Example master worker parallelism l.jpg
Example: Master-Worker Parallelism

  • A common form of parallelism used in developing applications years ago (especially in PVM) was Master-Worker parallelism:

    • a single processor is responsible for distributing data and collecting results (task parallelism)

    • all other processors perform same task on their portion of data (data parallelism)


Parallel programming models l.jpg
Parallel Programming Models

  • The primary programming models in current use are

    • Data parallelism - operations are performed in parallel on collections of data structures. A generalization of array operations.

    • Message passing - processes possess local memory and communicate with other processes by sending and receiving messages.

    • Shared memory - each processor has access to a single shared pool of memory


Parallel programming models70 l.jpg
Parallel Programming Models

  • Most parallelization efforts fall under the following categories.

    • Codes can be parallelized using message-passing libraries such as MPI.

    • Codes can be parallelized using compiler directives such as OpenMP.

    • Codes can be written in new parallel languages.


Programming models architectures l.jpg
Programming Models  Architectures

  • Natural mappings

    • data parallel  CM-2 (SIMD machine)

    • message passing  IBM SP (MPP)

    • shared memory  SGI Origin, Sun E10000

  • Implemented mappings

    • HPF (a data parallel language) and MPI (a message passing library) have been implemented on nearly all parallel machines

    • OpenMP (a set of directives, etc. for shared memory programming) has been implemented on most shared memory systems.


Slide72 l.jpg
SPMD

  • All current machines are MIMD systems (Multiple Instruction, Multiple Data) and are capable of either data parallelism or task parallelism.

  • The primary paradigm for programming parallel machines is the SPMD paradigm: Single Program, Multiple Data

    • each processor runs a copy of same source code

    • enables data parallelism (through data decomposition) and task parallelism (through intrinsic functions that return the processor ID)


Openmp shared memory standard l.jpg
OpenMP - Shared Memory Standard

  • OpenMP is a new standard for shared memory programming: SMPs and cc-NUMAs.

    • OpenMP provides a standard set of directives, run-time library routines, and

    • environment variables for parallelizing code under a shared memory model.

    • Very similar to Cray PVP autotasking directives, but with much more functionality. (Cray now uses supports OpenMP.)

    • See http://www.openmp.org for more information


Openmp example l.jpg
OpenMP Example

Fortran 77

Fortran 77 + OpenMP

program add_arrays

parameter (n=1000)

real x(n),y(n),z(n)

read(10) x,y,z

do i=1,n

x(i) = y(i) + z(i)

enddo

...

end

program add_arrays

parameter (n=1000)

real x(n),y(n),z(n)

read(10) x,y,z

!$OMP PARALLEL DO

do i=1,n

x(i) = y(i) + z(i)

enddo

...

end

Highlighted directive specifies that loop is executed in parallel. Each processor executes a subset of the loop iterations.


Mpi message passing standard l.jpg
MPI - Message Passing Standard

  • MPI has emerged as the standard for message passing in both C and Fortran programs. No longer need to know MPL, PVM, TCGMSG, etc.

  • MPI is both large and small:

    • MPI is large, since it contains 125 functions which give the programmer fine control over communications

    • MPI is small, since message passing programs can be written using a core set of just six functions.


Mpi examples send and receive l.jpg
MPI Examples - Send and Receive

MPI messages are two-way: they require a send and a matching receive:

PE 0 calls MPI_SEND to pass the real variable x to PE 1.

PE 1 calls MPI_RECV to receive the real variable y from PE 0

if(myid.eq.0) then

call MPI_SEND(x,1,MPI_REAL,1,100,MPI_COMM_WORLD,ierr)

endif

if(myid.eq.1) then

call MPI_RECV(y,1,MPI_REAL,0,100,MPI_COMM_WORLD,

status,ierr)

endif


Mpi example global operations l.jpg
MPI Example - Global Operations

MPI also has global operations to broadcast and reduce (collect) information

PE 5 broadcasts the single (1) integer value n to all other processors

call MPI_BCAST(n,1,MPI_INTEGER,5,

MPI_COMM_WORLD,ierr)

PE 6 collects the single (1) integer value n from all other processors and puts the sum (MPI_SUM) into into sum

call MPI_REDUCE(n,allsum,1,MPI_INTEGER,MPI_SUM,6,

MPI_COMM_WORLD,ierr)


Mpi implementations l.jpg
MPI Implementations

  • MPI is typically implemented on top of the highest performance native message passing library for every distributed memory machine.

  • MPI is a natural model for distributed memory machines (MPPs, clusters)

  • MPI offers higher performance on DSMs beyond the size of an individual SMP

  • MPI is useful between SMPs that are clustered

  • MPI can be implemented on shared memory machines


Extensions to mpi mpi 2 l.jpg
Extensions to MPI: MPI-2

  • A standard for MPI-2 has been developed which extends the functionality of MPI. New features include:

    • One sided communications - eliminates the need to post matching sends and receives. Similar in functionality to the shmemPUT and GET on the CRAY T3E (most systems have analogous library)

    • Support for parallel I/O

    • Extended collective operations

    • No full implementation yet - it is difficult for vendors


Mpi vs openmp l.jpg
MPI vs. OpenMP

  • There is no single best approach to writing a parallel code. Each has pros and cons:

    • MPI - powerful, general, and universally available message passing library which provides very fine control over communications, but forces the programmer to operate at a relatively low level of abstraction.

    • OpenMP - conceptually simple approach for creating parallel codes on a shared memory machines, but not applicable to distributed memory platforms.


Mpi vs openmp81 l.jpg
MPI vs. OpenMP

  • MPI is the most general (problems types) and portable (platforms, although not efficient for SMPs)

  • The architecture and the problem type often make the decision for you.


Parallel libraries l.jpg
Parallel Libraries

  • Finally, there are parallel mathematics libraries that enable users to write (serial) codes, then call parallel solver routines :

    • ScaLAPACK is for solving dense linear system of equations, eigenvalues and least square problems. Also see PLAPACK.

    • PETSc is for solving linear and non-linear partial differential equations (includes various iterative solvers for sparse matrices).

    • Many others: check NETLIB for complete survey:http://www.netlib.org


Hurdles in parallel computing l.jpg
Hurdles in Parallel Computing

There are some hurdles in parallel computing:

  • Scalar performance: Fast parallel codes require efficient use of the underlying scalar hardware

  • Parallel algorithms: Not all scalar algorithms parallelize well, may need to rethink problem

    • Communications: Need to minimize the time spent doing communications

    • Load balancing: All processors should do roughly the same amount of work

  • Amdahl’s Law: Fundamental limit on parallel computing


Scalar performance l.jpg
Scalar Performance

  • Underlying every good parallel code is a good scalar code.

  • If a code scales to 256 processors but only gets 1% of peak performance, it is still a bad parallel code.

    • Good news: Everything that you know about serial computing will be useful in parallel computing!

    • Bad news: It is difficult to get good performance out of the processors and memory used in parallel machines. Need to use cache effectively.


Serial performance l.jpg
Serial Performance

In this case, the parallel code achieves perfect scaling, but does not match the performance of the serial code until 32 processors are used


Use cache effectively l.jpg
Use Cache Effectively

A simplified memory

hierarchy

  • The data cache was designed with two key concepts in mind:

  • Spatial locality - cache is loaded an entire line (4-32 words) at a time to take advantage of the fact that if a location in memory is required, nearby locations will probably also be required

  • Temporal locality - once a word is loaded into cache it remains there until the cache line is needed to hold another word of data.

CPU

cache

Small

& fast

main memory

Big

& slow


Non cache issues l.jpg
Non-Cache Issues

  • There are other issues to consider to achieve good serial performance:

    • Force reductions, e.g., replacement of divisions with multiplications-by-inverse

    • Evaluate and replace common sub-expressions

    • Pushing loops inside subroutines to minimize subroutine call overhead

    • Force function inlining (compiler option)

    • Perform interprocedural analysis to eliminate redundant operations (compiler option)


Parallel algorithms l.jpg
Parallel Algorithms

  • The algorithm must be naturally parallel!

    • Certain serial algorithms do not parallelize well. Developing a new parallel algorithm to replace a serial algorithm can be one of the most difficult task in parallel computing.

    • Keep in mind that your parallel algorithm may involve additional work or a higher floating point operation count.


Parallel algorithms89 l.jpg
Parallel Algorithms

  • Keep in mind that the algorithm should

    • need the minimum amount of communication (Monte Carlo algorithms are excellent examples)

    • balance the load among the processors equally

  • Fortunately, a lot of research has been done in parallel algorithms, particularly in the area of linear algebra. Don’t reinvent the wheel, take full advantage of the work done by others:

    • use parallel libraries supplied by the vendor whenever possible!

    • use ScaLAPACK, PETSc, etc. when applicable


Load balancing l.jpg

Busy time

Idle time

PE 0

PE 1

Load Balancing

The figures below show the timeline for parallel codes run on two processors. In both cases, the total amount of work done is the same, but in the second case the work is distributed more evenly between the two processors resulting in a shorter time to solution.

PE 0

PE 1

t 

Synchronization

points


Communications l.jpg
Communications

  • Two key parameters of the communications network are

    • Latency: time required to initiate a message. This is the critical parameter in fine grained codes, which require frequent interprocessor communications. Can be thought of as the time required to send a message of zero length.

    • Bandwidth: steady-state rate at which data can be sent over the network.This is the critical parameter in coarse grained codes, which require infrequent communication of large amounts of data.


Latency and bandwidth example l.jpg
Latency and Bandwidth Example

  • Bucket brigade: the old style of fighting fires in which the townspeople formed a line from the well to the fire and passed buckets of water down the line

    • latency - the delay until the first bucket to arrives at the fire

    • bandwidth - the rate at which buckets arrive at the fire


More on communications l.jpg
More on Communications

  • Time spent performing communications is considered overhead. Try minimize the impact of communications:

    • minimize the effect of latency by combining large numbers of small messages into small numbers of large messages.

    • communications and computation do not have to be done sequentially, can often overlap communication and computations

Sequential: t = t(comp) + t(comm)

Overlapped: t = t(comp) + t(comm) - t(comp)  t(comm)


Combining small messages into larger ones l.jpg
Combining Small Messages into Larger Ones

The following examples of “phoning home” illustrate the value of combining many small messages into a single larger one.

  • dial

  • “Hi mom”

  • hang up

  • dial

  • “How are things?”

  • hang up

  • dial

  • “in the U.S.?”

  • hang up

  • dial

  • At this point many mothers would not pick up the next call.

  • dial

  • “Hi mom. How are things in the U.S.?. Yak, yak...”

  • hang up

By transmitting a single large message, I

only have to pay the price for the dialing

latency once. I transmit more information

in less time.


Overlapping communications and computations l.jpg
Overlapping Communications and Computations

In the following example, a stencil operation is performed on a 10 x 10 array that has been distributed over two processors. Assume periodic boundary conditions.

PE1

PE0

Stencil operation:

y(i,j)=x(i+1,j)+x(i-1,j)+x(i,j+1)+x(i,j-1)

  • Initiate communications

  • Perform computations on interior elements

  • Wait till communications are finished

  • Perform computations on boundary elements

Boundary elements - requires datafrom neighboring processor

Interior elements


Amdahl s law l.jpg
Amdahl’s Law

Amdahl’s Law places a strict limit on the speedup that can be realized by using multiple processors. Two equivalent expressions for Amdahl’s Law are given below:

tN = (fp/N + fs)t1 Effect of multiple processors on run time

S = 1/(fs + fp/N) Effect of multiple processors on speedup

Where:

fs = serial fraction of code

fp = parallel fraction of code = 1 - fs

N = number of processors


Illustration of amdahl s law l.jpg
Illustration of Amdahl’s Law

It takes only a small fraction of serial content in a code to degrade the parallel performance. It is essential to determine the scaling behavior of your code before doing production runs using large numbers of processors


Amdahl s law vs reality l.jpg
Amdahl’s Law Vs. Reality

Amdahl’s Law provides a theoretical upper limit on parallel speedup assuming that there are no costs for communications. In reality, communications (and I/O) will result in a further degradation of performance.


More on amdahl s law l.jpg
More on Amdahl’s Law

  • Amdahl’s Law can be generalized to any two processes of with different speeds

  • Ex.: Apply to fprocessorand fmemory:

    • The growing processor-memory performance gap will undermine our efforts at achieving maximum possible speedup!


Generalized amdahl s law l.jpg
Generalized Amdahl’s Law

  • Amdahl’s Law can be further generalized to handle an arbitrary number of processes of various speeds. (The total fractions representing each process must still equal 1.)

  • This is a weighted Harmonic mean. Application performance is limited by performance of the slowest component as much as it is determined by the fastest.

Ravg =


Gustafson s law l.jpg
Gustafson’s Law

  • Thus, Amdahl’s Law predicts that there is a maximum scalability for an application, determined by its parallel fraction, and this limit is generally not large.

  • There is a way around this: increase the problem size

    • bigger problems mean bigger grids or more particles: bigger arrays

    • number of serial operations generally remains constant; number of parallel operations increases: parallel fraction increases


The 1st question to ask yourself before you parallelize your code l.jpg
The 1st Question to Ask Yourself Before You Parallelize Your Code

  • Is it worth my time?

    • Do the CPU requirements justify parallelization?

    • Do I need a parallel machine in order to get enough aggregate memory?

    • Will the code be used just once or will it be a major production code?

  • Your time is valuable, and it can be very time consuming to write, debug, and test a parallel code. The more time you spend writing a parallel code, the less time you have to spend doing your research.


The 2nd question to ask yourself before you parallelize your code l.jpg
The 2nd Question to Ask Yourself Before You Parallelize Your Code

  • How should I decompose my problem?

    • Do the computations consist of a large number of small, independent problems - trajectories, parameter space studies, etc? May want to consider a scheme in which each processor runs the calculation for a different set of data

    • Does each computation have large memory or CPU requirements? Will probably have to break up a single problem across multiple processors


Distributing the data l.jpg
Distributing the Data Code

  • Decision on how to distribute the data should consider these issues:

    • Load balancing:Often implies an equal distribution of data, but more generally means an equal distribution of work

    • Communications:Want to minimize the impact of communications, taking into account both size and number of messages

    • Physics:Choice of distribution will depend on the processes that are being modeled in each direction.


A data distribution example l.jpg
A Data Distribution Example Code

A good distribution if the physics of the

problem is the same in both directions.

Minimizes the amount of data that must

be communicated between processors.

If expensive global operations need to be

carried out in the x-direction (ex. FFTs),

this is probably a better choice.


A more difficult example l.jpg
A More Difficult Example Code

Imagine that we are doing a simulation

in which more work is required for the

grid points covering the shaded object.

Neither data distribution from the

previous example will result in good

load balancing.

May need to consider an irregular grid

or a different data structure.


Choosing a resource l.jpg
Choosing a Resource Code

  • The following factors should be taken into account when choosing a resource:

    • What is the granularity of my code?

    • Are there any special hardware features that I need or can take advantage of?

    • How many processors will the code be run on?

    • What are my memory requirements?

  • By carefully considering these points, you can make the right choice of computational platform.


Choosing a resource granularity l.jpg

PE 0 Code

PE 1

Low-granularity application

PE 0

PE 1

High-granularity application

Choosing a Resource: Granularity

Granularity is a measure of the amount of work done by each processor between synchronization events.

Generally, latency is the critical parameter for low-granularity codes, while processor performance is the key factor for high-granularity applications.


Choosing a resource special hardware features l.jpg
Choosing a Resource: Special Hardware Features Code

  • Various HPC platforms have different hardware features that your code may be able to take advantage of. Examples include:

    • Hardware support for divide and square root operations (IBM SP)

    • Parallel I/O file system (IBM SP)

    • Data streams (CRAY T3E)

    • Control over cache alignment (CRAY T3E)

    • E-registers for by-passing cache hierarchy(CRAY T3E)


Importance of parallel computing l.jpg
Importance of Parallel Computing Code

  • High performance computing has become almost synonymous with parallel computing.

  • Parallel computing is necessary to solve big problems (high resolution, lots of timesteps, etc.) in science and engineering.

  • Developing and maintaining efficient, scalable parallel applications is difficult. However, the payoff can be tremendous.


Importance of parallel computing111 l.jpg
Importance of Parallel Computing Code

  • Before jumping in, think about

    • whether or not your code truly needs to be parallelized

    • how to decompose your problem.

  • Then choose a programming model based on your problem and your available architecture.

  • Take advantage of the resources that are available - compilers libraries, debuggers, performance analyzers, etc. - to help you write efficient parallel code.


Useful references l.jpg
Useful References Code

  • Hennessy, J. L. and Patterson, D. A. Computer Architecture: A Quantitative Approach.

  • Patterson, D.A. and Hennessy, J.L., Computer Organization and Design: The Hardware/Software Interface.

  • D. Dowd, High Performance Computing.

  • D. Kuck, High Performance Computing. Oxford U. Press (New York) 1996.

  • D. Culler and J. P. Singh, Parallel Computer Architecture.


Outline113 l.jpg
Outline Code

  • Preface

  • What is High Performance Computing?

  • Parallel Computing

  • Distributed Computing, Grid Computing, and More

  • Future Trends in HPC


Distributed computing l.jpg
Distributed Computing Code

  • Concept has been used for two decades

  • Basic idea: run scheduler across systems to runs processes on least-used systems first

    • Maximize utilization

    • Minimize turnaround time

  • Have to load executables and input files to selected resource

    • Shared file system

    • File transfers upon resource selection


Examples of distributed computing l.jpg
Examples of Distributed Computing Code

  • Workstation farms, Condor flocks, etc.

    • Generally share file system

  • [email protected], Entropia, etc.

    • Only one source code; central server copies correct binary code and input data to each system

  • Napster, Gnutella: file/data sharing

  • NetSolve

    • Runs numerical kernel on any of multiple independent systems, much like a Grid solution



Distributed vs parallel computing l.jpg
Distributed vs. Parallel Computing Code

  • Different

    • Distributed computing executes independent (but possibly related) applications on different systems; jobs do not communicate with each other

    • Parallel computing executes a single application across processors, distributing the work and/or data but allowing communication between processes

  • Non-exclusive: can distribute parallel applications to parallel computing systems


Grid computing l.jpg
Grid Computing Code

  • Enable communities (“virtual organizations”) to share geographically distributed resources as they pursue common goals—in the absence of central control, omniscience, trust relationships.

  • Resources (HPC systems, visualization systems & displays, storage systems, sensors, instruments, people) are integrated via ‘middleware’ to facilitate use of all resources.


Why grids l.jpg
Why Grids? Code

  • Resources have different functions, but multiple classes resources are necessary for most interesting problems.

  • Power of any single resource is small compared to aggregations of resources

  • Network connectivity is increasing rapidly in bandwidth and availability

  • Large problems require teamwork and computation


Network bandwidth growth l.jpg
Network Bandwidth Growth Code

  • Network vs. computer performance

    • Computer speed doubles every 18 months

    • Network speed doubles every 9 months

    • Difference = order of magnitude per 5 years

  • 1986 to 2000

    • Computers: x 500

    • Networks: x 340,000

  • 2001 to 2010

    • Computers: x 60

    • Networks: x 4000

Moore’s Law vs. storage improvements vs. optical improvements. Graph from Scientific American (Jan-2001) by Cleo Vilett, source Vined Khoslan, Kleiner, Caufield and Perkins.


Grid possibilities l.jpg
Grid Possibilities Code

  • A biochemist exploits 10,000 computers to screen 100,000 compounds in an hour

  • 1,000 physicists worldwide pool resources for petaflop analyses of petabytes of data

  • Civil engineers collaborate to design, execute, & analyze shake table experiments

  • Climate scientists visualize, annotate, & analyze terabyte simulation datasets

  • An emergency response team couples real time data, weather model, population data


Some grid usage models l.jpg
Some Grid Usage Models Code

  • Distributed computing: job scheduling on Grid resources with secure, automated data transfer

  • Workflow: synchronized scheduling and automated data transfer from one system to next in pipeline (e.g. HPC system to visualization lab to storage system)

  • Coupled codes, with pieces running on different systems simultaneously

  • Meta-applications: parallel apps spanning multiple systems


Grid usage models l.jpg
Grid Usage Models Code

  • Some models are similar to models already being used, but are much simpler due to:

    • single sign-on

    • automatic process scheduling

    • automated data transfers

  • But Grids can encompass new resources likes sensors and instruments, so new usage models will arise


Selected major grid projects l.jpg

g Code

g

g

g

g

g

Selected Major Grid Projects


Selected major grid projects125 l.jpg

g Code

g

g

g

g

g

Selected Major Grid Projects


Selected major grid projects126 l.jpg

g Code

g

g

g

g

g

Selected Major Grid Projects


Selected major grid projects127 l.jpg

g Code

g

Selected Major Grid Projects

New

There are also many technology R&D projects: e.g., Globus, Condor, NetSolve, Ninf, NWS, etc.


Example application projects l.jpg
Example Application Projects Code

  • Earth Systems Grid: environment (US DOE)

  • EU DataGrid: physics, environment, etc. (EU)

  • EuroGrid: various (EU)

  • Fusion Collaboratory (US DOE)

  • GridLab: astrophysics, etc. (EU)

  • Grid Physics Network (US NSF)

  • MetaNEOS: numerical optimization (US NSF)

  • NEESgrid: civil engineering (US NSF)

  • Particle Physics Data Grid (US DOE)


Some grid requirements systems deployment perspective l.jpg

Identity & authentication Code

Authorization & policy

Resource discovery

Resource characterization

Resource allocation

(Co-)reservation, workflow

Distributed algorithms

Remote data access

High-speed data transfer

Performance guarantees

Monitoring

Adaptation

Intrusion detection

Resource management

Accounting & payment

Fault management

System evolution

Etc.

Etc.

Some Grid Requirements – Systems/Deployment Perspective


Some grid requirements user perspective l.jpg
Some Grid Requirements – CodeUser Perspective

  • Single allocation (or none needed)

  • Single sign-on: authentication to any Grid resources authenticates for all others

  • Single compute space: one scheduler for all Grid resources

  • Single data space: can address files and data from any Grid resources

  • Single development environment: Grid tools and libraries that work on all grid resources


The systems challenges resource sharing mechanisms that l.jpg
The Systems Challenges: CodeResource Sharing Mechanisms That…

  • Address security and policy concerns of resource owners and users

  • Are flexible enough to deal with many resource types and sharing modalities

  • Scale to large number of resources, many participants, many program components

  • Operate efficiently when dealing with large amounts of data & computation


The security problem l.jpg
The Security Problem Code

  • Resources being used may be extremely valuable & the problems being solved extremely sensitive

  • Resources are often located in distinct administrative domains

    • Each resource may have own policies & procedures

  • The set of resources used by a single computation may be large, dynamic, and/or unpredictable

    • Not just client/server

  • It must be broadly available & applicable

    • Standard, well-tested, well-understood protocols

    • Integration with wide variety of tools


The resource management problem l.jpg
The Resource Management Problem Code

  • Enabling secure, controlled remote access to computational resources and management of remote computation

    • Authentication and authorization

    • Resource discovery & characterization

    • Reservation and allocation

    • Computation monitoring and control


Grid systems technologies l.jpg
Grid Systems Technologies Code

  • Systems and security problems addressed by new protocols & services. E.g., Globus:

    • Grid Security Infrastructure (GSI) for security

    • Globus Metadata Directory Service (MDS) for discovery

    • Globus Resource Allocations Manager (GRAM) protocol as a basic building block

      • Resource brokering & co-allocation services

    • GridFTP for data movement


The programming problem l.jpg
The Programming Problem Code

  • How does a user develop robust, secure, long-lived applications for dynamic, heterogeneous, Grids?

  • Presumably need:

    • Abstractions and models to add to speed/robustness/etc. of development

    • Tools to ease application development and diagnose common problems

    • Code/tool sharing to allow reuse of code components developed by others


Grid programming technologies l.jpg
Grid Programming Technologies Code

  • “Grid applications” are incredibly diverse (data, collaboration, computing, sensors, …)

    • Seems unlikely there is one solution

  • Most applications have been written “from scratch,” with or without Grid services

  • Application-specific libraries have been shown to provide significant benefits

  • No new language, programming model, etc., has yet emerged that transforms things

    • But certainly still quite possible


Examples of grid programming technologies l.jpg
Examples of Grid CodeProgramming Technologies

  • MPICH-G2: Grid-enabled message passing

  • CoG Kits, GridPort: Portal construction, based on N-tier architectures

  • GDMP, Data Grid Tools, SRB: replica management, collection management

  • Condor-G: simple workflow management

  • Legion: object models for Grid computing

  • Cactus: Grid-aware numerical solver framework

    • Note tremendous variety, application focus


Mpich g2 a grid enabled mpi l.jpg
MPICH-G2: A Grid-Enabled MPI Code

  • A complete implementation of the Message Passing Interface (MPI) for heterogeneous, wide area environments

    • Based on the Argonne MPICH implementation of MPI (Gropp and Lusk)

  • Globus services for authentication, resource allocation, executable staging, output, etc.

  • Programs run in wide area without change!

  • See also: MetaMPI, PACX, STAMPI, MAGPIE

www.globus.org/mpi


Grid events l.jpg
Grid Events Code

  • Global Grid Forum: working meeting

    • Meets 3 times/year, alternates U.S.-Europe, with July meeting as major event

  • HPDC: major academic conference

    • HPDC-11 in Scotland with GGF-8, July 2002

  • Other meetings include

    • IPDPS, CCGrid, EuroGlobus, Globus Retreats

www.gridforum.org, www.hpdc.org


Useful references140 l.jpg
Useful References Code

  • Book (Morgan Kaufman)

    • www.mkp.com/grids

  • Perspective on Grids

    • “The Anatomy of the Grid: Enabling Scalable Virtual Organizations”, IJSA, 2001

    • www.globus.org/research/papers/anatomy.pdf

  • All URLs in this section of the presentation, especially:

    • www.gridforum.org, www.grids-center.org, www.globus.org


Outline141 l.jpg
Outline Code

  • Preface

  • What is High Performance Computing?

  • Parallel Computing

  • Distributed Computing, Grid Computing, and More

  • Future Trends in HPC


Value of understanding future trends l.jpg
Value of Understanding Future Trends Code

  • Monitoring and understanding future trends in HPC is important:

    • users: applications should be written to be efficient on current and future architectures

    • developers: tools should be written to be efficient on current and future architectures

    • computing centers: system purchases are expensive and should have upgrade paths


The next decade l.jpg
The Next Decade Code

  • 1980s and 1990s:

    • academic and government requirements strongly influenced parallel computing architectures

    • academic influence was greatest in developing parallel computing software (for science & eng.)

    • commercial influence grew steadily in late 1990s

  • In the next decade:

    • commercialization will become dominant in determining the architecture of systems

    • academic/research innovations will continue to drive the development of the HPC software


Commercialization l.jpg
Commercialization Code

  • Computing technologies (including HPC) are now propelled by profits, not sustained by subsidies

    • Web servers, databases, transaction processing and especially multimedia applications drive the need for computational performance.

    • Most HPC systems are ‘scaled up’ commercial systems, with less additional hardware and software compared to commercial systems.

    • It’s not engineering, it’s economics.


Processors and nodes l.jpg
Processors and Nodes Code

  • Easy predictions:

    • microprocessors performance increase continues at ~60% per year (Moore’s Law) for 5+ years.

    • total migration to 64-bit microprocessors

    • use of even more cache, more memory hierarchy.

    • increased emphasis on SMPs

  • Tougher predictions:

    • resurgence of vectors in microprocessors? Maybe

    • dawn of multithreading in microprocessors? Yes


Building fat nodes smps l.jpg
Building Fat Nodes: SMPs Code

  • More processors are faster, of course

    • SMPs are simplest form of parallel systems

    • efficient if not limited by memory bus contention: small numbers of processors

  • Commercial market for high performance servers at low cost drives need for SMPs

  • HPC market for highest performance, ease of programming drives development of SMPs


Building fat nodes smps147 l.jpg
Building Fat Nodes: SMPs Code

  • Trends are to:

    • build bigger SMPs

    • attempt to share memory across SMPs (cc-NUMA)


Resurgence of vectors l.jpg
Resurgence of Vectors Code

  • Vectors keep functional units busy

    • vector registers are very fast

    • vectors are more efficient for loops of any stride

    • vectors are great for many science & eng. apps

  • Possible resurgence of vectors

    • SGI/Cray plans has built SV1ex, building SV2

    • NEC continues building (CMOS) parallel-vector, Cray-like systems

    • Microprocessors (Pentium4, G4) have added vector-like functionality for multimedia purposes


Dawn of multithreading l.jpg
Dawn of Multithreading? Code

  • Memory speed will always be a bottleneck

  • Must overlap computation with memory accesses: tolerate latency

    • requires immense amount of parallelism

    • requires processors with multiple streams and compilers that can define multiple threads



Multithreading l.jpg
Multithreading Code

  • Tera MTA was first multithreaded HPC system

    • scientific success, production failure

    • MTA-2 will be delivered in a few months.

  • Multithreading will be implemented (in more limited fashion) in commercial processors.


Networks l.jpg
Networks Code

  • Commercial network bandwidth and latency approaching custom performance.

  • Dramatic performance increases likely

    • “the network is the computer” (Sun slogan)

    • more companies, more competition

    • no severe physical, economic limits

  • Implications of faster networks

    • more clusters

    • collaborative, visual supercomputing

    • Grid computing


Commodity clusters l.jpg
Commodity Clusters Code

  • Clusters provide some real advantages:

    • computing power: leverage workstations and PCs

    • high availability: replace one at a time

    • inexpensive: leverage existing competitive market

    • simple path to installing parallel computing system

  • Major disadvantages were robustness of hardware and software, but both have improved

  • NCSA has huge clusters in production based on Pentium III and Itanium.


Clustering smps l.jpg
Clustering SMPs Code

  • Inevitable (already here!):

    • leverages SMP nodes effectively for same reasons clusters leverage individual processors

    • Commercial markets drive need for SMPs

  • Combine advantages of SMPs, clusters

    • more powerful nodes through multiprocessing

    • more powerful nodes -> more powerful cluster

    • Interconnect scalability requirements reduced for same number of processors


Continued linux growth in hpc l.jpg
Continued Linux Growth in HPC Code

  • Linux popularity growing due to price and availability of source code

  • Major players now supporting Linux, esp. IBM

  • Head start on Intel Itanium


Programming tools l.jpg
Programming Tools Code

  • However, programming tools will continue to lag behind hardware and OS capabilities:

    • Researchers will continue to drive the need for the most powerful tools to create the most efficient applications on the largest systems

    • Such technologies will look more like MPI than the Web… maybe worse due to multi-tiered clusters of SMPs (MPI + OpenMP; Active messages + threads?).

    • Academia will continue to play a large role in HPC software development.


Grid computing157 l.jpg
Grid Computing Code

  • Parallelism will continue to grow in the form of

    • SMPs

    • clusters

    • Cluster of SMPs (and maybe DSMs)

  • Grids provide the next level

    • connects multiple computers into virtual systems

    • Already here:

      • IBM, other vendors supporting Globus

      • SC2001 dominated by Grid technologies

      • Many major government awards (>$100M in past year)


Emergence of grids l.jpg
Emergence of Grids Code

  • But Grids enable much more than apps running on multiple computers (which can be achieved with MPI alone)

    • virtual operating system: provides global workspace/address space via a single login

    • automatically manages files, data, accounts, and security issues

    • connects other resources (archival data facilities, instruments, devices) and people (collaborative environments)


Grids are inevitable l.jpg
Grids Are Inevitable Code

  • Inevitable (at least in HPC):

    • leverages computational power of all available systems

    • manages resources as a single system--easier for users

    • provides most flexible resource selection and management, load sharing

    • researchers’ desire to solve bigger problems will always outpace performance increases of single systems; just as multiple processors are needed, ‘multiple multiprocessors’ will be deemed so


Grid enabled software l.jpg
Grid-Enabled Software Code

  • Commercial applications on single parallel systems and Grids will require that:

    • underlying architectures must be invisible: no parallel computing expertise required

    • usage must be simple

    • development must not be to difficult

  • Developments in ease-of-use will benefit scientists as users (not as developers)

  • Web-based interfaces: transparent supercomputing (MPIRE, Meta-MEME, etc.).


Grid enabled collaborative and visual supercomputing l.jpg
Grid-Enabled Collaborative and CodeVisual Supercomputing

  • Commercial world demands:

    • multimedia applications

    • real-time data processing

    • online transaction processing

    • rapid prototyping and simulation in engineering, chemistry and biology

    • interactive, remote collaboration

    • 3D graphics, animation and virtual reality visualization


Grid enabled collaborative visual supercomputing l.jpg
Grid-enabled Collaborative, Visual Supercomputing Code

  • Academic world will leverage resulting Grids linking computing and visualization systems via high-speed networks:

    • collaborative post-processing of data already here

    • simulations will be visualized in 3D, virtual worlds in real-time

    • such simulations can then be ‘steered’

    • multiple scientists can participate in these visual simulations

    • the ‘time to insight’ (SGI slogan) will be reduced


Web based grid computing l.jpg
Web-based Grid Computing Code

  • Web currently used mostly for content delivery

  • Web servers on HPC systems can execute applications

  • Web servers on Grids can launch applications, move/store/retrieve data, display visualizations, etc.

  • NPACI HotPage already enables single sign-on to NPACI Grid Resources


Summary of expectations l.jpg
Summary of Expectations Code

  • HPC systems will grow in performance but probably change little in design (5-10 years):

    • HPC systems will be larger versions of smaller commercial systems, mostly large SMPs and clusters of inexpensive nodes

    • Some processors will exploit vectors, as well as more/larger caches.

    • Best HPC systems will have been designed ‘top-down’ instead of ‘bottom-up’, but all will have been designed to make the ‘bottom’ profitable.

    • Multithreading is the only likely, near-term major architectural change.


Summary of expectations165 l.jpg
Summary of Expectations Code

  • Using HPC systems will change much more:

    • Grid computing will become widespread in HPC and in commercial computing

    • Visual supercomputing and collaborative simulation will be commonplace.

    • WWW interfaces to HPC resources will make transparent supercomputing commonplace.

  • But programming the most powerful resources most effectively will remain difficult.


Caution l.jpg
Caution Code

  • Change is difficult to predict (and I am an astrophysicist, not an astrologer):

    • Accuracy of linear extrapolation predictions degrade over long times (like weather forecasts)

    • Entirely new ideas can change everything:

      • WWW is an excellent example; Grid computing is probably the next

      • Eventually, something truly different will replace CMOS technology (nanotechnology? molecular computing? DNA computing?)


Final prediction l.jpg
Final Prediction Code

“The thing about change is that things will be different afterwards.”

Alan McMahon (Cornell University)


ad