Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Introduction toHigh Performance Computing:Parallel Computing, Distributed Computing, Grid Computing and More Dr. Jay Boisseau Director, Texas Advanced Computing Center boisseau@tacc.utexas.edu December 3, 2001 Texas Advanced Computing Center The University of Texas at Austin

Outline • Preface • What is High Performance Computing? • Parallel Computing • Distributed Computing, Grid Computing, and More • Future Trends in HPC

Purpose • Purpose of this workshop: • to educate researchers about the value and impact of high performance computing (HPC) techniques and technologies in conducting computational science and engineering • Purpose of this presentation: • to educate researchers about the techniques and tools of parallel computing, and to show them the possibilities presented by distributed computing and Grid computing

Goals • Goals of this presentation are to help you: • understand the ‘big picture’ of high performance computing • develop a comprehensive understanding of parallel computing • begin to understand how Grid and distributed computing will further enhance computational science capabilities

Content and Context • This material is an introduction and an overview • It is not a comprehensive HPC, so further reading (much more!) is recommended. • Presentation is followed by additional speakers with detailed presentations on specific HPC and science topics • Together, these presentations will help prepare you to use HPC in your scientific discipline.

Background - me • Director of the Texas Advanced Computing Center (TACC) at the University of Texas • Formerly at San Diego Supercomputer Center (SDSC), Artic Region Supercomputing Center • 10+ years in HPC • Known Luis for 4 years - plan to develop strong relationship between TACC and CeCalCULA

Background – TACC • Mission: • to enhance the academic research capabilities of the University of Texas and its affiliates through the application of advanced computing resources and expertise • TACC activities include: • Resources • Support • Development • Applied research

TACC Activities • TACC resources and support includes: • HPC systems • Scientific visualization resources • Data storage/archival systems • TACC research and development areas: • HPC • Scientific Visualization • Grid Computing

Current HPC Systems 640 GB ARCHIVE 300 GB 500 GB aurora golden azure CRAY SV1 16 CPU, 16GB Memory IBM SP 64+ procs 256 MB/proc CRAY T3E 256+ procs 128 MB/proc HiPPI AscendRouter FDDI

New HPC Systems • Four IBM p690 HPC servers • 16 Power4 Processors • 1.3 GHz: 5.2 Gflops per proc,83.2 Gflops per server • 16 GB Shared Memory • >200 GB/s memory bandwidth! • 144 GB Disk • 1 TB disk to partition across servers • Will configure as single system (1/3 Tflop) with single GPFS system (1 TB) in 2Q02

IA64 Cluster 20 2-way nodes Itanium (800 MHz) processors 2 GB memory/node 72 GB disk/node Myrinet 2000 switch 180GB shared disk IA32 Cluster 32 2-way nodes Pentium III (1 GHz) processors 1 GB Memory 18.2 GB disk/node Myrinet 2000 Switch New HPC Systems 750 GB IBM GPFS parallel file system for both clusters

World-Class Vislab • SGI Onyx2 • 24 CPUs, 6 Infinite Reality 2 Graphics Pipelines • 24 GB Memory, 750 GB Disk • Front and Rear Projection Systems • 3x1 cylindrically-symmetric Power Wall • 5x2 large-screen, 16:9 panel Power Wall • Matrix switch between systems, projectors, rooms

More Information • URL: www.tacc.utexas.edu • E-mail Addresses: • General Information: admin@tacc.utexas.edu • Technical assistance: remark@tacc.utexas.edu • Telephone Numbers: • Main Office: (512) 475-9411 • Facsimile transmission: (512) 475-9445 • Operations Room: (512) 475-9410

‘Supercomputing’ • First HPC systems were vector-based systems (e.g. Cray) • named ‘supercomputers’ because they were an order of magnitude more powerful than commercial systems • Now, ‘supercomputer’ has little meaning • large systems are now just scaled up versions of smaller systems • However, ‘high performance computing’ has many meanings

HPC Defined • High performance computing: • can mean high flop count • per processor • totaled over many processors working on the same problem • totaled over many processors working on related problems • can mean faster turnaround time • more powerful system • scheduled to first available system(s) • using multiple systems simultaneously

My Definitions • HPC: any computational technique that solves a large problem faster than possible using single, commodity systems • Custom-designed, high-performance processors (e.g. Cray, NEC) • Parallel computing • Distributed computing • Grid computing

My Definitions • Parallel computing: single systems with many processors working on the same problem • Distributed computing: many systems loosely coupled by a scheduler to work on related problems • Grid Computing: many systems tightly coupled by software and networks to work together on single problems or on related problems

Importance of HPC • HPC has had tremendous impact on all areas of computational science and engineering in academia, government, and industry. • Many problems have been solved with HPC techniques that were impossible to solve with individual workstations or personal computers.

What is a Parallel Computer? • Parallel computing: the use of multiple computers or processors working together on a common task • Parallel computer: a computer that contains multiple processors: • each processor works on its section of the problem • processors are allowed to exchange information with other processors

Parallel vs. Serial Computers • Two big advantages of parallel computers: • total performance • total memory • Parallel computers enable us to solve problems that: • benefit from, or require, fast solution • require large amounts of memory • example that requires both: weather forecasting

Parallel vs. Serial Computers • Some benefits of parallel computing include: • more data points • bigger domains • better spatial resolution • more particles • more time steps • longer runs • better temporal resolution • faster execution • faster time to solution • more solutions in same time • lager simulations in real time

Serial Processor Performance Although Moore’s Law ‘predicts’ that single processor performance doubles every 18 months, eventually physical limits on manufacturing technology will be reached

Types of Parallel Computers • The simplest and most useful way to classify modern parallel computers is by their memory model: • shared memory • distributed memory

Shared vs. Distributed Memory P P P P P P Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) BUS Memory P P P P P P Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters) M M M M M M Network

Shared Memory: UMA vs. NUMA Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000) P P P P P P BUS Memory P P P P P P P P Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin) BUS BUS Memory Memory Network

Distributed Memory: MPPs vs. Clusters • Processor-memory nodes are connected by some type of interconnect network • Massively Parallel Processor (MPP): tightly integrated, single system image. • Cluster: individual computers connected by s/w Interconnect Network CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM

Processors, Memory, & Networks • Both shared and distributed memory systems have: • processors: now generally commodity RISC processors • memory: now generally commodity DRAM • network/interconnect: between the processors and memory (bus, crossbar, fat tree, torus, hypercube, etc.) • We will now begin to describe these pieces in detail, starting with definitions of terms.

Processor-Related Terms Clock period (cp): the minimum time interval between successive actions in the processor. Fixed: depends on design of processor. Measured in nanoseconds (~1-5 for fastest processors). Inverse of frequency (MHz). Instruction: an action executed by a processor, such as a mathematical operation or a memory operation. Register: a small, extremely fast location for storing data or instructions in the processor.

Processor-Related Terms Functional Unit (FU): a hardware element that performs an operation on an operand or pair of operations. Common FUs are ADD, MULT, INV, SQRT, etc. Pipeline : technique enabling multiple instructions to be overlapped in execution. Superscalar: multiple instructions are possible per clock period. Flops: floating point operations per second.

Processor-Related Terms Cache: fast memory (SRAM) near the processor. Helps keep instructions and data close to functional units so processor can execute more instructions more rapidly. Translation-Lookaside Buffer (TLB): keeps addresses of pages (block of memory) in main memory that have recently been accessed (a cache for memory addresses)

Memory-Related Terms SRAM: Static Random Access Memory (RAM). Very fast (~10 nanoseconds), made using the same kind of circuitry as the processors, so speed is comparable. DRAM: Dynamic RAM. Longer access times (~100 nanoseconds), but hold more bits and are much less expensive (10x cheaper). Memory hierarchy: the hierarchy of memory in a parallel system, from registers to cache to local memory to remote memory. More later.

Interconnect-Related Terms • Latency: • Networks: How long does it take to start sending a "message"? Measured in microseconds. • Processors: How long does it take to output results of some operations, such as floating point add, divide etc., which are pipelined?) • Bandwidth: What data rate can be sustained once the message is started? Measured in Mbytes/sec or Gbytes/sec

Interconnect-Related Terms Topology: the manner in which the nodes are connected. • Best choice would be a fully connected network (every processor to every other). Unfeasible for cost and scaling reasons. • Instead, processors are arranged in some variation of a grid, torus, or hypercube. 2-d mesh 2-d torus 3-d hypercube

Processor-Memory Problem • Processors issue instructions roughly every nanosecond. • DRAM can be accessed roughly every 100 nanoseconds (!). • DRAM cannot keep processors busy! And the gap is growing: • processors getting faster by 60% per year • DRAM getting faster by 7% per year (SDRAM and EDO RAM might help, but not enough)

Processor-Memory Performance Gap µProc 60%/yr. 1000 CPU “Moore’s Law” 100 Processor-Memory Performance Gap:(grows 50% / year) Performance 10 DRAM 7%/yr. DRAM 1 1980 1982 1983 1985 1987 1990 1992 1994 1995 1997 1999 1981 1984 1986 1988 1991 1993 1996 1998 2000 1989 From D. Patterson, CS252, Spring 1998 ©UCB

Processor-Memory Performance Gap • Problem becomes worse when remote (distributed or NUMA) memory is needed • network latency is roughly 1000-10000 nanoseconds (roughly 1-10 microseconds) • networks getting faster, but not fast enough • Therefore, cache is used in all processors • almost as fast as processors (same circuitry) • sits between processors and local memory • expensive, can only use small amounts • must design system to load cache effectively

Processor-Cache-Memory • Cache is much smaller than main memory and hence there is mapping of data from main memory to cache. CPU Cache Main Memory

Memory Hierarchy CPU Cache Local Memory Remote Memory

Cache-Related Terms • ICACHE : Instruction cache • DCACHE (L1) : Data cache closest to registers • SCACHE (L2) : Secondary data cache • Data from SCACHE has to go through DCACHE to registers • SCACHE is larger than DCACHE • Not all processors have SCACHE

Cache Benefits • Data cache was designed with two key concepts in mind • Spatial Locality • When an element is referenced its neighbors will be referenced also • Cache lines are fetched together • Work on consecutive data elements in the same cache line • Temporal Locality • When an element is referenced, it might be referenced again soon • Arrange code so that data in cache is reused often

Direct-Mapped Cache • Direct mapped cache: A block from main memory can go in exactly one place in the cache. This is called direct mapped because there is direct mapping from any block address in memory to a single location in the cache. cache main memory

Fully Associative Cache • Fully Associative Cache : A block from main memory can be placed in any location in the cache. This is called fully associative because a block in main memory may be associated with any entry in the cache. cache Main memory

Set Associative Cache • Set associative cache : The middle range of designs between direct mapped cache and fully associative cache is called set-associative cache. In a n-way set-associative cache a block from main memory can go into N (N > 1) locations in the cache. 2-way set-associative cache Main memory

Cache-Related Terms Least Recently Used (LRU): Cache replacement strategy for set associative caches. The cache block that is least recently used is replaced with a new block. Random Replace: Cache replacement strategy for set associative caches. A cache block is randomly replaced.

Example: CRAY T3E Cache • The CRAY T3E processors can execute • 2 floating point ops (1 add, 1 multiply) and • 2 integer/memory ops (includes 2 loads or 1 store) • To help keep the processors busy • on-chip 8 KB direct-mapped data cache • on-chip 8 KB direct-mapped instruction cache • on-chip 96 KB 3-way set associative secondary data cache with random replacement.

Putting the Pieces Together • Recall: • Shared memory architectures: • Uniform Memory Access (UMA): Symmetric Multi-Processors (SMP). Ex: Sun E10000 • Non-Uniform Memory Access (NUMA): Most common are Distributed Shared Memory (DSM), or cc-NUMA (cache coherent NUMA) systems. Ex: SGI Origin 2000 • Distributed memory architectures: • Massively Parallel Processor (MPP): tightly integrated system, single system image. Ex: CRAY T3E, IBM SP • Clusters: commodity nodes connected by interconnect. Example: Beowulf clusters.

Symmetric Multiprocessors (SMPs) • SMPs connect processors to global shared memory using one of: • bus • crossbar • Provides simple programming model, but has problems: • buses can become saturated • crossbar size must increase with # processors • Problem grows with number of processors, limiting maximum size of SMPs

Shared Memory Programming • Programming models are easier since message passing is not necessary. Techniques: • autoparallelization via compiler options • loop-level parallelism via compiler directives • OpenMP • pthreads • More on programming models later.

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Introduction to High Performance Computing: Parallel Computing, Distributed Computing, Grid Computing and More

Presentation Transcript

HIGH PERFORMANCE COMPUTING

High Performance Computing Methods

Introduction to Grid Computing with High Performance Computing

High Performance Computing – Introduction to Unix

High Performance Computing Basics

Introduction to Grid Computing with High Performance Computing

High Performance Computing

High-Performance Computing

High-Performance Computing

high Performance Computing Lab

LSI High Performance Computing

Introduction to High Performance Computing

High Performance Distributed Computing

High Performance Computing

HIGH PERFORMANCE COMPUTING ENVIRONMENT

Computational Physics An Introduction to High-Performance Computing

High Performance Computing

Basic High Performance Computing

High Performance Computing – Beowulf

High Performance Computing – Introduction to C

HIGH PERFORMANCE COMPUTING

High Performance Computing – Supercomputers