CSCI-455/522

CSCI-455/522 Introduction to High Performance Computing Lecture 2

Types of Parallel Computers Two principal types: • Shared memory multiprocessor • Distributed memory multicomputer

Shared vs. Distributed Memory P P P P P P Shared memory - single address space. All processors have access to a pool of shared memory. (Ex: SGI Origin, Sun E10000) BUS Memory P P P P P P Distributed memory - each processor has it’s own local memory. Must do message passing to exchange data between processors. (Ex: CRAY T3E, IBM SP, clusters) M M M M M M Network

Shared Memory Multiprocessor

Conventional Computer Consists of a processor executing a program stored in a (main) memory: Each main memory location located by its address. Addresses start at 0 and extend to 2b - 1 when there are b bits (binary digits) in address. Main memory Instr uctions (to processor) Data (to or from processor) Processor

Shared Memory Multiprocessor System Natural way to extend single processor model - have multiple processors connected to multiple memory modules, such that each processor can access any memory module : Memory module One address space Interconnection network Processors

Simplistic View of a Small Shared Memory Multiprocessor Examples: • Dual Pentiums • Quad Pentiums Processors Shared memory Bus

Quad Pentium Shared Memory Multiprocessor Processor Processor Processor Processor L1 cache L1 cache L1 cache L1 cache L2 Cache L2 Cache L2 Cache L2 Cache Bus interface Bus interface Bus interface Bus interface Processor/ memory b us I/O interf ace Memory controller I/O b us Memory Shared memory

Programming Shared Memory Multiprocessors • Threads - programmer decomposes program into individual parallel sequences, (threads), each being able to access variables declared outside threads. Example Pthreads • Sequential programming language with preprocessor compiler directives to declare shared variables and specify parallelism. Example OpenMP - industry standard - needs OpenMP compiler

Sequential programming language with added syntax to declare shared variables and specify parallelism. Example UPC (Unified Parallel C) - needs a UPC compiler. • Parallel programming language with syntax to express parallelism - compiler creates executable code for each processor (not now common) • Sequential programming language and ask parallelizing compiler to convert it into parallel executable code. - also not now common

Shared Memory: UMA vs. NUMA Uniform memory access (UMA): Each processor has uniform access to memory. Also known as symmetric multiprocessors, or SMPs (Sun E10000) P P P P P P BUS Memory P P P P P P P P Non-uniform memory access (NUMA): Time for memory access depends on location of data. Local access is faster than non-local access. Easier to scale than SMPs (SGI Origin) BUS BUS Memory Memory Network

Distributed Memory /Message-Passing Multicomputers

Message-Passing Multicomputer Complete computers connected through an interconnection network: Interconnection network Messages Processor Local memory Computers

Distributed Memory: MPPs vs. Clusters • Processor-memory nodes are connected by some type of interconnect network • Massively Parallel Processor (MPP): tightly integrated, single system image. • Cluster: individual computers connected by s/w Interconnect Network CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM CPU MEM

Clusters • Similar to MPPs • Commodity processors and memory • Processor performance must be maximized • Memory hierarchy includes remote memory • No shared memory--message passing • Communication overhead must be minimized • Different from MPPs • All commodity, including interconnect and OS • Multiple independent systems: more robust • Separate I/O systems

Interconnection Networks • Limited and exhaustive interconnections • 2- and 3-dimensional meshes • Hypercube (not now common) • Using Switches: • Crossbar • Trees • Multistage interconnection networks

Communications Networks • Custom • Many vendors have custom interconnects that provide high performance for their system, specially MPP • CRAY T3E interconnect is the fastest for MPPs: lowest latency, highest bandwidth • Commodity • Used in some MPPs and all clusters • Myrinet, Gigabit Ethernet, Fast Ethernet, etc.

Types of Interconnects • Fully connected • not feasible • Array and torus • Intel Paragon (2D array), CRAY T3E (3D torus) • Crossbar • IBM SP (8 nodes) • Hypercube • SGI Origin 2000 (hypercube), Meiko CS-2 (fat tree) • Combinations of some of the above • IBM SP (crossbar & fully connected for 80 nodes) • IBM SP (fat tree for > 80 nodes)

Two-dimensional Array (Mesh) Computer/ Links processor Also three-dimensional - used in some large high performance systems.

Three-dimensional Hypercube

Four-dimensional Hypercube Hypercubes popular in 1980’s - not now

Crossbar Switch Memor ies Switches Processors

Tree Root Switch Links element Processors

Multistage Interconnection NetworkExample: Omega network 2 ´ 2 switch elements (straight-through or crossover connections) 000 000 001 001 010 010 011 011 Inputs Outputs 100 100 101 101 110 110 111 111

Distributed Shared Memory Making main memory of group of interconnected computers look as though a single memory with single address space. Then can use shared memory programming techniques. Interconnection netw or k Messages Processor Shared memory Computers

Flynn’s Classifications Flynn (1966) created a classification for computers based upon instruction streams and data streams: • Single instruction stream-single data stream (SISD) computer Single processor computer - single stream of instructions generated from program. Instructions operate upon a single stream of data items.

Single Instruction Stream-Multiple Data Stream (SIMD) Computer • A specially designed computer - a single instruction stream from a single program, but multiple data streams exist. Instructions from program broadcast to more than one processor. Each processor executes same instruction in synchronism, but using different data. • Developed because a number of important applications that mostly operate upon arrays of data.

Multiple Instruction Stream-Multiple Data Stream (MIMD) Computer General-purpose multiprocessor system - each processor has a separate program and one instruction stream is generated from each program for each processor. Each instruction operates upon different data. Both the shared memory and the message-passing multiprocessors so far described are in the MIMD classification.

The Banking Analogy • Tellers: Parallel Processors • Customers: tasks • Transactions: operations • Accounts: data

Vector/Array • Each teller/processor gets a very fine-grained task • Use pipeline parallelism • Good for handling batches when operations can be broken down into fine-grained stages

SIMD (Single-Instruction-Multiple-Data) • All processors do the same things or idle • Phase 1: data partitioning and distributed • Phase 2: data-parallel processing • Efficient for big, regular data-sets

Systolic Array • Combination of SIMD and Pipeline parallelism • 2-d array of processors with memory at the boundary • Tighter coordination between processors • Achieve very high speeds by circulating data among processors before returning to memory

MIMD(Multi-Instruction-Multiple-Data) • Each processor (teller) operates independently • Need synchronization mechanism • by message passing • or mutual exclusion (locks) • Best suited for large-grained problems • Less than data-flow parallelism

Networked Computers as a Computing Platform • A network of computers became a very attractive alternative to expensive supercomputers and parallel computer systems for high-performance computing in early 1990’s. • Several early projects. Notable: • Berkeley NOW (network of workstations) project. • NASA Beowulf project.

Key advantages: • Very high performance workstations and PCs readily available at low cost. • The latest processors can easily be incorporated into the system as they become available. • Existing software can be used or modified.

Software Tools for Clusters • Based upon Message Passing Parallel Programming: • Parallel Virtual Machine (PVM) - developed in late 1980’s. Became very popular. • Message-Passing Interface (MPI) - standard defined in 1990s. • Both provide a set of user-level libraries for message passing. Use with regular programming languages (C, C++, ...).

Beowulf Clusters* • A group of interconnected “commodity” computers achieving high performance with low cost. • Typically using commodity interconnects - high speed Ethernet, and Linux OS. * Beowulf comes from name given by NASA Goddard Space Flight Center cluster project.

Cluster Interconnects • Originally fast Ethernet on low cost clusters • Gigabit Ethernet - easy upgrade path More Specialized/Higher Performance • Myrinet - 2.4 Gbits/sec - disadvantage: single vendor • cLan • SCI (Scalable Coherent Interface) • QNet • Infiniband - may be important as infininband interfaces may be integrated on next generation PCs

Dedicated cluster with a master node Dedicated Cluster User Compute nodes Master node Up link Exter nal netw or k 2nd Ether net Switch interf ace

CSCI-455/522

CSCI-455/522

Presentation Transcript

Introduction to CSCI 1

CSCI 360 Survey Of Programming Languages

Automatic Schema Matching

Oracle Database Administration

Data Structures

Post Mortem Forensic Toxicology

Lecture 9: Gene expression analysis/Clustering

Chapter 3: Lexical Analysis

CSCI 233

CSCI 311 , Advanced Web Development

From Relational Algebra to SQL

Csci 211 Computer System Architecture Lec 4 – Instruction Level Parallelism

CSci 6971: Image Registration Lecture 9: Registration Components February 10, 2004

CSci 6971: Image Registration Lecture 27: FEM-Based Methods April 23, 2004

Foundations I: Methodologies, Knowledge Representation

Information integration, life-cycle and visualization

CSCI 4325 / 6339 Theory of Computation Chapter One

CSCI-631 Introduction to Computer Vision

Arrays and Strings

Performance Issues of Web Services

CSCI-2500: Computer Organization