并行算法概述

并行算法概述

目录 • 1.并行计算模型2.并行算法的基本设计技术

Von Neumann Model

Instruction Processing Fetch instruction from memory Decode instruction Evaluate address Fetch operands from memory Execute operation Store result

Parallel Computing Model • Computing model • Bridge between SW and HW • general purpose HW, scalable HW • transportable SW • Abstract architecture for algorithm development • Ex) PRAM, BSP, LogP

Parallel Programming Model • What programmer uses in coding applications? • Specifies communication and synchronization • Communication primitives exposed to used-level realizes the programming model • Ex) Uniprocessor, Multiprogramming, Data parallel, message-passing, shared-address-space

Interconnection Network Memory Memory Memory Memory P P P P P P P P P P P P P P P P Multiprocessors Multiprocessors Multiprocessors Multiprocessors Aspects of Parallel Processing Algorithm developer 4 Application developer 3 Parallel computing model Parallel programming model System programmer 2 Middleware 1 Architecture designer

Parallel Computing Models • PRAM • Parallel Random Access Memory • A set of p processors • Global shared memory • Each processor can access any memory location in one time step • Globally synchronized • Executing same program in lockstep

Illustration of PRAM Single program executed in MIMD mode CLK Each processor has a unique index. P1 P2 P3 Pp Shared Memory P processors connected to a single shared memory

Features • Model architecture • Synchronized RAM with common clock, but not SIMD operation: MIMD • No local memory in each RAM • One global shared memory • single address space architecture • Synchronization, communication, parallelism overhead are zero.

Features (Cont’d) • Operations per step • Read/write a word from/to the memory • Local operation • An instruction could perform the following three operations in one cycle • Fetch on or two words from the memory as operands • Perform an arithmetic/logic operation • Store the result back in memory

Problems with PRAM • Inaccurate description of real-world parallel systems • Unaccounted costs • Latency, bandwidth, non-local memory access, memory access contention issues, synchronization costs, etc • Algorithms perceived to work well in PRAM may have poor performance in practice

PRAM Variants • Variants arise to model some of these costs • Each introduces some practical aspect of machine • Gives algorithm designer better idea for optimization • Variants can be grouped into 4 categories • Memory access • Synchronization • Latency • Bandwidth

Memory Access • Impractical to have concurrent read and write to same memory location • Contention issues • CRCW PRAM • CPRAM-CRCW(Common PRAM-CRCW)：仅允许写入相同数据 • PPRAM-CRCW(Priority PRAM-CRCW)：仅允许优先级最高的处理器写入 • APRAM-CRCW(Arbitrary PRAM-CRCW)：允许任意处理器自由写入 • EREW or CREW PRAM • QRQW (queue-read, queue-write ) • Expensive • Multiple ports required for concurrent access maybe prohibitively expensive.

Synchronization • Standard PRAM globally synchronized • Standard PRAM model do not charge a cost for synchronization • Unrealistic! Synchronization is necessary and expensive in practical parallel systems • Variants model cost of synchronization • APRAM (asynchrony PRAM ):每个处理器有其局部存储器、局部时钟、局部程序；无全局时钟，各处理器异步执行；处理器通过SM进行通讯；处理器间依赖关系，需在并行程序中显式地加入同步路障。 • XPRAM (bulk synchronous PRAM, also known as BSP model) • Provides an incentive for algorithm designers to synchronize only when necessary

Latency • Standard PRAM assumes unit-cost for non-local memory access • In practice, non-local memory access has severe effect on performance • PRAM variant • LPRAM (Local-memory PRAM) • A set of nodes each with a processor and a local memory; the nodes can communicate through a globally shared memory. Two types of steps are defined and separately accounted for: computation steps, where each processor performs one operation on local data, and communication steps, where each processor can write, and then read a word from global memory • Charge a cost of L units to access global memory

Synchronization (cont.) • BPRAM (Block-Parallel RAM) • BSP assumes n nodes, each containing a processor and a memory module, interconnected by a communication medium. • A BSP computation is a sequence of phases, called supersteps: in one superstep, each processor can execute operations on data residing in the local memory, send messages and, at the end, execute a global synchronization instruction. A messages sent during a superstep becomes • Charge L units to access 1st message and b units for each subsequent contiguous block

Bandwidth • Standard PRAM assumes unlimited bandwidth • In practice, bandwidth is limited • PRAM Variant • DRAM (Distribution random access machine) • 2 level memory hierarchy • Access to global memory is charged a cost based on possible data congestion? • PRAM(m) • Global memory segmented into modules • Any given step, only m memory accesses can be serviced

Other Distributed Models • Distributed Memory Model • No global memory • Each processor associated with some local memory • Postal Model • Processor sends request for non-local memory • Instead of stalling, it continues working while data is en-route

Network Models • Focus on impact of topology of communications network • Early focus of parallel computation • Distributed Memory Model? • Cost of remote memory access is a function of both topology and the access pattern • Provides incentives for efficient • Data mappings • Communications routing

Bridging Models • Candidate Type Architecture Model • Finite number of von Neumann computers executing asynchronously • Global controller for synchronization • 2 level memory hierarchy • Provide incentives for reference locality • Assumes • Unlimited Bandwidth • Zero latency • Does not provide incentives for • Reduction of messages • Synchronization avoidance

LogP • Model design strongly influenced by trends in parallel computer design • Product of efforts by diverse groups of researchers from different disciplines • Model of a distributed memory multiprocessor • Processors communicate via point to point messages • Attempts to capture important bottleneck of parallel machines

LogP • Specifies performance characteristics of communication network. • Provide incentive for clever data placement • Illustrates importance of balanced communication

Parallel Machine Trends • Machine organization for most parallel machines is similar • A collection of complete computers • Microprocessor • Cache memory • Sizable DRAM memory • Connected by robust communications network • Motivated by cost and development of commodity components • No single programming methodology is dominant

Other considerations • Processor Count • Number of nodes relative to • price of most expensive supercomputer / cost of node • Communication Interval lags far behind processor memory bandwidth • Presence of adaptive routing and fault-recovery networking systems • Affects algorithm design • Parallel algorithms developed with large number of data elements per processor • Attempts to exploit network topology or processor count is not very robust

Model Parameters • Latency (L) • Delay incurred in communicating a message from source to destination • Hop count and Hop delay • Communication Overhead (o) • Length of time a processor is engaged in sending or receiving a message • Node overhead for processing a send or receive • Communication bandwidth (g) • Minimum time interval between messages • Processor count (P) • Processor count

LogP Model g sender o receiver L o t

Bridging Models 2 • Bulk Synchronous Parallel(BSP) • P processors with local memory • Router • Facilities for periodic global synchronization • Every l steps • Models • Bandwidth limitations • Latency • Synchronization costs • Does not model • Communication overhead • Processor topology

BSP Computer • Distributed memory architecture • 3 components • Node • Processor • Local memory • Router (Communication Network) • Point-to-point, message passing (or shared variable) • Barrier synchronizing facility • All or subset

P P P M M M Illustration of BSP Node (w) Node Node Barrier (l) Communication Network (g)

Three Parameters • w parameter • Maximum computation time within each superstep • Computation operation takes at most w cycles. • g parameter • # of cycles for communication of unit message when all processors are involved in communication - network bandwidth • h relation coefficient • Communication operation takes gh cycles. • l parameter • Barrier synchronization takes l cycles.

BSP Program • A BSP computation consists of S super steps. • A superstep is a sequence of steps followed by a barrier synchronization. • Superstep • Any remote memory accesses take effect at barrier - loosely synchronous

BSP Program P1 P2 P3 P4 Superstep 1 Computation Communication Barrier Superstep 2

Model Survey Summary • No single model is acceptable! • Between models, subset of characteristics are focused in majority of models • Computational Parallelism • Communication Latency • Communication Overhead • Communication Bandwidth • Execution Synchronization • Memory Hierarchy • Network Topology

Computational Parallelism • Number of physical processors • Static versus dynamic parallelism • Should number of processors be fixed? • Fault-recovery networks allow for node failure • Many parallel systems allow incremental upgrades by increasing node count

Latency • Fixed message length or variable message length? • Network topology? • Communication Overhead? • Contention based latency? • Memory hierarchy?

Bandwidth • Limited resource • With low latency • Tendency for bandwidth abuse by flooding network

Synchronization • Ability to solve a wide class of problems require asynchronous parallelism • Synchronization achieved via message passing • Synchronization as a communication cost

Unified Model? • Compelling motivation • Difficult • Parallel machines are complicated • Still evolving • Different users from diverse disciplines • Requires a common set of characteristics derived from needs of different users • Again need for balance between descriptivity and prescriptivity

Algorithms and Concurrency • Introduction to Parallel Algorithms • Tasks and Decomposition • Processes and Mapping • Processes Versus Processors • Decomposition Techniques • Recursive Decomposition • Data Decomposition • Exploratory Decomposition • Hybrid Decomposition • Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions.

Concurrency and Mapping • Mapping Techniques for Load Balancing • Static and Dynamic Mapping • Methods for Minimizing Interaction Overheads • Maximizing Data Locality • Minimizing Contention and Hot-Spots • Overlapping Communication and Computations • Replication vs. Communication • Group Communications vs. Point-to-Point Communication • Parallel Algorithm Design Models • Data-Parallel, Work-Pool, Task Graph, Master-Slave, Pipeline, and Hybrid Models

Preliminaries: Decomposition, Tasks, and Dependency Graphs • The first step in developing a parallel algorithm is to decompose the problem into tasks that can be executed concurrently • A given problem may be decomposed into tasks in many different ways • Tasks may be of same, different, or even interminate sizes • A decomposition can be illustrated in the form of a directed graph with nodes corresponding to tasks and edges indicating that the result of one task is required for processing the next. Such a graph is called a task dependency graph

Example: Multiplying a Dense Matrix with a Vector Computation of each element of output vector y is independent of other elements. Based on this, a dense matrix-vector product can be decomposed into n tasks. The figure highlights the portion of the matrix and vector accessed by Task 1. Observations: While tasks share data (namely, the vector b ), they do not have any control dependencies - i.e., no task needs to wait for the (partial) completion of any other. All tasks are of the same size in terms of number of operations. Is this the maximum number of tasks we could decompose this problem into?

Example: Database Query Processing Consider the execution of the query: MODEL = ``CIVIC'' AND YEAR = 2001 AND (COLOR = ``GREEN'' OR COLOR = ``WHITE) on the following database:

Example: Database Query Processing The execution of the query can be divided into subtasks in various ways. Each task can be thought of as generating an intermediate table of entries that satisfy a particular clause. Decomposing the given query into a number of tasks. Edges in this graph denote that the output of one task is needed to accomplish the next.

Example: Database Query Processing Note that the same problem can be decomposed into subtasks in other ways as well. An alternate decomposition of the given problem into subtasks, along with their data dependencies. Different task decompositions may lead to significant differences with respect to their eventual parallel performance.

Granularity of Task Decompositions • The number of tasks into which a problem is decomposed determines its granularity. • Decomposition into a large number of tasks results in fine-grained decomposition and that into a small number of tasks results in a coarse grained decomposition. A coarse grained counterpart to the dense matrix-vector product example. Each task in this example corresponds to the computation of three elements of the result vector.

Degree of Concurrency • The number of tasks that can be executed in parallel is the degree of concurrency of a decomposition. • Since the number of tasks that can be executed in parallel may change over program execution, the maximum degree of concurrency is the maximum number of such tasks at any point during execution. What is the maximum degree of concurrency of the database query examples? • The average degree of concurrency is the average number of tasks that can be processed in parallel over the execution of the program. Assuming that each tasks in the database example takes identical processing time, what is the average degree of concurrency in each decomposition? • The degree of concurrency increases as the decomposition becomes finer in granularity and vice versa.

Critical Path Length • A directed path in the task dependency graph represents a sequence of tasks that must be processed one after the other. • The longest such path determines the shortest time in which the program can be executed in parallel. • The length of the longest path in a task dependency graph is called the critical path length.

Critical Path Length Consider the task dependency graphs of the two database query decompositions: What are the critical path lengths for the two task dependency graphs? If each task takes 10 time units, what is the shortest parallel execution time for each decomposition? How many processors are needed in each case to achieve this minimum parallel execution time? What is the maximum degree of concurrency?

并行算法概述