Parallel Programming

Parallel Programming Techniques to use multiple computers that coordinate to solve problems • Categories • Hardware parallelism • Fine Grain parallelism (instruction level) • Course Grain parallelism (block level) • Parallel Systems • Shared Memory Systems • Uniform Memory Access (UMA) • Cache Only Memory Access (COMA) • Non-uniform Memory Access (NUMA) • Distributed Memory Systems • Beowulf Clusters • Networks or Cluster of Workstations • Computational Grids • Heterogeneous and Geographically Separated

History • Early Systems (1950s and 1960s) • Slow, High Failure Rates, Limited Operating Systems • Programmer Controlled • Poor resource utilization • Pre-PC (1960s and 1970s) • Medium numbers of systems • Operating System Development • Multi-thread and Multi-programming, Time Sharing • Good resource utilization • Today • Billions of systems • Single user work stations • Poor resource utilization • Future (Parallel Resource Utilization)

Flynn Categories SISD MIMD SIMD MISD Within MIMD SPMD MPMD Multiprocessor Systems Threads, Critical Sections Multi-computer Systems Message Passing Sockets, RMI P0 P1 P2 P3 Memory Hardware Configurations Shared Memory Multiprocessor P0 P1 P2 P3 M0 M1 M2 M3 Distributed Memory Multi-computer

Parallel Techniques • Peer-to-peer • Multiple Computers Coordinate to run a single application • Mechanisms • Threads, Message Passing, • Client-server • Server responds to many clients running many applications • Mechanisms • Remote Method Invocation • Sockets The focus is this class is peer-to-peer applications

Parallel Performance Metrics • Complexity (Big Oh Notation) • f(n) = O(g(n)) if for constants z, c>0 f(n)≤c g(n) when n>z • Speed up: s(p) = t1/tp • Cost: C(p) = tp*p • Efficiency: E(p) = t1/C(p) • Computation to Communication Ratio: tcomp/tcomm • Latency and Bandwidth Analysis • commp = function of data transmitted & # of messages • tp = function of computation and communication • Scalability • Imprecise term to measure the ability to add more processors • We might say and application scales well to 256 processors • Scalability can refer to hardware, software, or both

Superlinear speed-up (s(p)>p) Reasons for: • Non-optimal sequential algorithm • Solution: Compare to an optimal sequential algorithm • Parallel versions are often different from sequential versions • Specialized hardware on certain processors • Processor has fast graphics but computes slow • NASA superlinear distributed grid application • Average case doesn’t match single run • Consider a search application • What are the speed-up possibilities?

Speed-up Potential • Amdahl’s “pessimistic” law • Fraction of Sequential processing (f) is fixed • S(p) = ts / (f * ts + (1-f)ts/p) → 1/f as p →∞ • Gustafson’s “optimistic” law • Greater data implies parallel portion grows • Assumes more capability leads to more data • S(p) = f + (1-f)*p • For each law • What is the best speed-up if f=.25? • What is the speed-up for 16 processors? • Which assumption if valid?

Parallel Programming Examples • Embarrassingly Parallel Applications • Google searches employ > 100,1000 processors • Applications with unacceptable sequential run times • Grand Challenge Problems • 1991 High Performance Computing Act (Law 102-94)“Fundamental Grand Challenge science and engineering problems with broad economic and/or scientific impact and whose solution can be advanced by applying high performance computing techniques and resources.” • Promotes terascale level computation over high bandwidth wide area computational grids

Associations Computer Research Association (CRA) National Science Foundation (NSF) Partial Grand Challenge Problem List Predict contaminant seepage Predict airborne contaminant affects Gene sequence discovery Short term weather forecasts Predict long term global warming Predict earthquakes and volcanoes, Predict hurricanes and tornados Automate natural language understanding Computer vision Nanotechnology Computerized reasoning Protein mechanisms Predict asteroid collisions Sub-atomic particle interactions Model biomechanical processes Manufacture new materials Fundamental nature of matter Transportation patterns Computational fluid dynamics Grand ChallengeProblem Examples

Hardware Configurations • Shared Memory Systems • Do not scale to high numbers of processors • Distributed Memory Systems • Goal: Minimize connections but maintain speed • A network Topology identifies its configuration • Network topology terminology • Latency and bandwidth • Total edges and degree • Connectivity and bisection width • Diameter • Dilation.

Popular Network Topologies • Fully Connected and Star • Line and Ring • Tree and Fat Tree • Mesh and Torus • Hypercube • Hybrids • Pyramid • Multi-stage • Myrinet

Fully Connected and Star • Degree? • Connectivity? • Total edges? • Bisection width? • Diameter?

Line and Ring • Degree? • Connectivity? • Total edges? • Bisection width? • Diameter?

Tree and Fat Tree Edges Connecting Node at level k to k-1 are twice the number of edges connecting a Node from level k-2 to k-1 • Degree? • Connectivity? • Total edges? • Bisection width? • Diameter?

Mesh and Torus • Degree? • Connectivity? • Total edges? • Bisection width? • Diameter?

Hypercube • Degree? • Connectivity? • Total edges? • A Hypercube of degree zero is a single node • A Hypercube of degree d is two hypercubes of degree d-1 • With edges connecting the corresponding nodes • Bisection width? • Diameter?

Pyramid A Hybrid Network Combining a mesh and a Tree • Degree? • Connectivity? • Total edges? • Bisection width? • Diameter?

Routing Techniques • Packet Switching • A message is divided into pieces (“packets”) and routed separately. The packets are assembled at the sink • Deadlock free • Guarantees that multiple messages are not partially routed in such a way that resources will never be available to complete the journey • Store and Forward • Messages are stored at each point before transmission continues to the next node • Cut Through • Routing algorithm establishes the entire path of transmission before communication begins • Wormhole Routing • Each flit (a couple of bits) is held at each node until the next node in the route becomes available. A flit and those stored in predecessor nodes moves to next node as soon as it is available Question: What topology is Ethernet or Token Ring?

Myrinet • A proprietary interconnection technology (http://www.myri.com). • Supports 2.6 Gb/second bandwidth with 5 microsecond latency • Myrinet employs a dynamic switch based technology • point-to-point, full duplex links to processors or to switches. • Arbitrary parallel topologies using routing chip settings • Myrinet lightweight transmission protocol; cut through routing • Links thousands of processors without TCP/IP limitations • Routing transparently handled by hardware • Automatic alternative hardware routing if necessary • Transmitters can embed TCP/IP messages to maximize flexibility • Gateway devices for wide area heterogeneous networks Rectangles: Processors, Circles: Switches

Web-based Networks • Generally uses tcp/ip protocol • The number of hops between nodes is not constant • Communication incurs high latencies • Nodes scattered over large geographical distances • High Bandwidths possible after connection established • The slowest link along the path limits speed • Resources are highly heterogeneous • Security becomes a major concern; proxies often used • Encryption algorithms can require significant overhead. • Subject to local policies at each node • Example: www.globus.org

Parallel Programming