Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration

Parallel Computing 5Parallel Application DesignOndřej JaklInstitute of Geonics, Academy of Sci. of the CR

Outline of the lecture • Task/channel model • Foster’s design methodology • Partitioning • Communication analysis • Agglomeration • Mapping to processors • Examples

Design of parallel algorithms • In general a very creative process • Only methodical frameworks available • Usually more alternatives to be considered • The best parallel solution may differ from suggestions of the sequential approach

Task/channel model (1) • Introduced in Ian Foster’s Designing and Building Parallel Programs [Foster 1995] • http://www-unix.mcs.anl.gov/dbpp • Represents a parallel computation as set of tasks • task is a program, its local memory and a collection of I/O ports • task can send local data values to other tasks via output ports • task can receive data values from other tasks via input ports • The tasks may interact with each other by sending messages through channels • channel is a message queue that connects one task’s output port with another task’s input port • nonblocking asynchronous send and blocking receive issupposed • An abstraction close to the message passing model

Task/channel model (2) Input port Program Output port Task after [Quinn 2004] Channel Directed graph of tasks (vertices) and channels (edges)

Foster’s methodology [Foster 1995] • Design stages: • partitioning into concurrent tasks • communication analysis to coordinate tasks • agglomeration into larger tasks with respect to the target platform • mapping of tasks to processors • 1, 2 conceptual level, 3, 4 implementation dependent • In practice often considered simultaneously

Partitioning (decomposition) • Process of dividing the computation and the data into pieces – primitive tasks • Goal: Expose the opportunities for parallel processing • Maximal (fine-grained) decomposition for greater flexibility • Complementary techniques: • domain decomposition (data centric approach) • functional decomposition (computation centric approach) • Combinations possible • usual scenario:primary decomposition – functional secondary decomposition – domain

Domain (data) decomposition • Primary object of decomposition: processed data • first, data associated with the problem is divided into pieces • focus on the largest and/or most frequently accessed data • pieces should be of comparable size • next, the computation is partitioned according to the data on which it operates • usually the same code for each task (SPMD – Single Program Multiple Data) • may be non-trivial, may bring up complex mathematical problems • Most often used technique in parallel programming 3D grid data: one-, two-, three-dimensional decomposition[Foster 1995]

Functional (task) decomposition Climate model [Foster 1995] • Primary object of decomposition: computation • first, computation is decomposed into disjoint tasks • different codes of the tasks (MPMD – Multiple Program Multiple Data) • methodological benefits: implies program structuring • gives rise to simpler modules with interfaces • c.f. object oriented programming, etc. • next, data is partitioned according to the requirements of the tasks • data requirements may be disjoint, or overlap ( communication) • Sources of parallelism: • concurrent processing of independent tasks • concurrent processing of a stream of data through pipelining • a stream of data is passed on through a succession of tasks, each of which perform some operation on it • MPSD – Multiple Program Single Data • The number of task usually does not scale with the problem size – for greater scalability combine with domain decomposition on the subtasks

Good decomposition • More tasks (at least by order of magnitude) then processors • if not: little flexibility • No redundancy in processing and data • if not: little scalability • Comparable size of tasks • if not: difficult load balancing • Number of task proportional to the size of the problem • if not: problems utilizing additional processors • Alternate partitions available?

4.0 F(x) = 4/(1+x2) 2.0 0.0 1.0 x Example: PI calculation • Calculation ofπ by the standard numerical integration formula • Consider numerical integration based on the rectangle method • integral is approximated by the area of evenly spaced rectangular strips • height of the strips is calculated as the value of the integrated function at the midpoint of the strips

PI calculation – sequential algorithm Seqential pseudocode setn (number of strips) for each strip calculate the height y of the strip (rectangle) at its midpoint sum all y to the result S endfor multiply S by the width of the strips print result

PI calculation – parallel algorithm Parallel pseudocode (for the task/channel model) if master then setn (number of strips) send n to the workers else // worker receive n from the master endif for each strip assigned to this task calculate the height y of the strip (rectangle) at midpoint sum all y to the (partial) result S endfor if master then receive Sfrom all workers sum all Sand multiply by the width of the strips print result else // worker send Sto the master endif

Parallel PI calculation – partitioning • Domain decomposition: • primitive task– calculation of one strip height • Functional decomposition: • manager task: controls the computationworker task(s): perform the main calculation • manager/worker technique (also called control decomposition) • more or less technical decomposition • A perfectly/embarrassingly parallelproblem: the (worker) processes are (almost) independent

Communication analysis • Determination of the communication pattern among the primitive tasks • Goal: Expose the information flow • The tasks generated by partitioning are as a rule not independent– they cooperate by exchanging data • Communication means overhead – minimize! • not included in the sequential algorithm • Efficient communication may be difficult to organize • especially in domain-decomposed problems

Parallel communication Cathegorization local: between small number of “neighbours” global: many “distant” tasks participate structured: regular and repeated communication patterns in placeand time unstructured: communication networks are arbitrary graphs static: communication partners do not change over time dynamic: communication depends on the computation history and changes at runtime synchronous: communication partners cooperate in data transfer operations asynchronous: producers are not able to determine data requests of consumers The first items are to be preferred in parallel programs

Good communication • Preferably no communication involved in parallel algorithm • if not: overhead decreasing parallel efficiency • Tasks have comparable communication demands • if not: little scalability • Tasks communicate only with a small number of neighbours • if not: loss of parallel efficiency • Communication operations and computation in different tasks can proceed concurrentlycommunication and computation can overlap • if not: inefficient and nonscalable algorithm

Example: Jacobi differences • Jacobi finite difference method • Repeated update (in timesteps) of values assigned to points of a multidimensional grid • In 2-D, the grid point i, j may get in timestep t+1 a value given by the formula (weighted mean) [Foster 1995]

Jacobi: parallel algorithm • Decomposition (domain): • primitive task – calculation of the weighted mean in one grid point • Parallel codemain loop • for each timestep t • send Xi,j(t) to each neighbour • receive Xi-1,j(t), Xi+1,j(t), Xi,j-1(t), Xi,j+1(t) from neighbours • calculate Xi,j(t+1) • endfor • Communication: • communication channels between neighbours • local, structured, static, synchronous

Example: Gauss-Seidel scheme • More efficient in sequential computing • Not easy to parallelize [Foster 1995]

Agglomeration • Process of gouping primitive tasks into larger tasks • Goal: revision of the (abstract, conceptual) partitioning and communication to improve performance • choose granularity appropriate to the target parallel computer • Large number of fine-grained tasks tend to be inefficient because of great • communication cost • task creation cost • spawn operation rather expensive (and to simplify programmingdemands) • Agglomeration increases granularity • potential conflict with retaining flexibility and scalability [next slides] • Closely related with mapping to processors

Agglomeration& granularity • Measure characterizing the size and quantity of tasks • Increasing granularity by combining several tasks into larger ones • reduces communication cost • less communication (a) • fewer, but larger messages (b) • reduces task creation cost • less processes • Agglomerate tasks that • frequently communicate with each other • increaseslocality • cannot execute concurrently • Consider also [next slides] • surface-to-volume effects • replicationof computation/data [Quinn 2004]

Surface-to-volume effects (1) • The communication/computation ratio decreases with increasing granularity: • computation cost is proportional to the “volume” of the subdomain • communication cost is proportional to the “surface” • Agglomeration in all dimension is most efficient • reduces surface for given volume • in practice is more difficult to code • Difficult with unstructured communication • Ex.: Jacobi finite differences [next slide]

Agglomeration 4 x 4: No agglomeration: Surface-to-volume effects (2) Ex.: Jacobi finite differences – agglomeration [Foster 1995] >

Agglomeration& flexibility • Ability to make use of diverse computing environments • good parallel programs are resilient to changes in processor count • scalability - ability to employ increasing number of tasks • Too coarse granularity reduces flexibility • Usual practical design: agglomerate one task per processor • can be controlled by a compile-time or runtime parameters • with some MPS (PVM, MPI-2) on-the-fly (dynamic spawn) • But consider also creating more tasks than processors: • when tasks often wait for remote data: several tasks mapped to one processor permitoverlapping computation and communication • greater scope for mapping strategies that balance computational load over available processors • a rule of thumb: an order of magnitude more tasks • Optimal number of tasks: determined by a combination of analytic modelling and empirical studies

s s s s s s d d d s Replicating computation • To reduce communication requirements, the same computation is repeated in several tasks • compute once & distribute vs. compute repeatedly & don’t communicate – a trade off • Redundant computation pays off when its computational cost is less then the communication cost • moreover it removes dependences • Ex.: summation of numbers (located on separate processors) with distribution of the result • Without replication:2(n – 1) steps • (n – 1) additions • necessary minimum • With replication:(n – 1) steps • n (n – 1) additions • (n – 1)2 redundant

Good agglomeration • Increased locality of communication • Beneficial replication of computation • Replication of data does not compromise scalability • Similar computation and communication costs of the agglomerated tasks • Number of tasks can scale with the problem size • Fewer larger-grained tasks is usually more efficient than more fine-grained tasks

Mapping • Process of assigning (agglomerated) tasks to processors for execution • Goal: Maximize processor utilization, minimize interprocessor communication • load balancing • Concerns multicomputers only • multiprocessors: automatic task scheduling • Guidelines to minimize execution time: • concurrent task place on different processors (increase concurrency) • tasks with frequent communication place on the same processor (enhance locality) • Optimal mapping is generally an NP-complete problem • strategies, heuristics for special classes of problems available conflicting

Basic mapping strategies [Quinn 2004]

barrier Load balancing • Mapping strategy with the aim to keep all processors busy during the execution of the parallel program • minimization of the idle time • In heterogeneous computing environmentevery parallel application may need (dynamic) load balancing • Static load balancing • performed before the program enters the solution phase • Dynamic load balancing • needed when task created/destroyed at run-time and/or comm./comp requirements of tasks vary widely • invoked occasionally during the execution of the parallel program • analyses the current computation and rebalances it • may imply significant overhead! Bad load balancing [LLNL 2010]

Load-balancing algorithms • Most appropriate for domain decomposed problems • Representative examples [next slides] • recursive bisection • probabilistic methods • local algorithms

Recursive bisection • Recursive cuts into subdomains of nearly equal computational cost while attempting to minimize communication • allows the partitioning algorithm itself to be executed in parallel Irregular grid for a superconductivity simulation[Foster 1995] • Coordinate bisection: • for irregular grids with local communication • cuts into halves based on physical coordinates of grid points • simple, but does care for communication • unbalanced bisection: does not necessarily divide into halves • to reduce communication • a lot of variants • e.g. recursive graph bisection 

proc. #1 proc. #2 proc. #3 4.0 F(x) = 4/(1+x2) 2.0 0.0 1.0 x Probabilistic methods • Allocate tasks randomly on processors • about the same computation load can be expected for large number of tasks • typically at least ten times as many tasks as processors required • Communication is usually not considered • appropriate for tasks with little communication and/or little locality in communication • Simple, low cost, scalable • Variant: cyclic mappingfor spatial locality in load levels • each of p processors is allocated every pth task • Variant: block cyclic distributions • blocks of tasks are allocated to processors 

Local algorithms • Compensate for changes in computational load using only local information obtained from a small number of “neighbouring” tasks • do not require expensive global knowledge of computational state • If imbalance exists (threshold), some computation load is transferred to the less loaded neighbour • Simple, but less efficient then global algorithms • slow when adjusting major changes in load characteristics • Advantageous for dynamic load balancing  Local algorithm for a grid problem [Foster 1995]

Task-scheduling algorithms • Suitable for a poolof independent tasks • represent stand-alone problems, contain solution code + data • can be conceived as special kind of data • Often obtained from functional decomposition • many tasks with weak locality • Centralized or distributed variants • Dynamic load balancing by default • Examples: • (hierarchical) manager/worker • decentralized schemes

Manager/worker • Simple task scheduling scheme • sometimes called “master/slave” • Central manager task is responsible for problem allocation • maintains a pool (queue) of problems • e.g. a search in a particular tree branch • Workers run on separate processors and repeatedly request and solve assigned problems • may also sent new problems to the manager • Efficiency: • consider cost of problemtransfer • prefetching, caching applicable • manager must not become a bottleneck • Hierarchical manager/worker variant • introduces a layer of submanagers responsible for subset of workers [Wilkinson 1999]

Decentralized schemes • Task-scheduling without global management • Task pool is a data structure distributed among many processors • The pool is accessed asynchronously by idle workers • various access polices: neighbours, by random, etc. • Termination detection may be difficult

Good mapping • In general: Try to balance conflicting requirements for equitable load distribution and low communication cost • When possible, use static mapping allocating each process to a single processor • Dynamic load balancing / task scheduling can be appropriate when the number or size of tasks is variable or not known until runtime • With centralized load-balancing schemes verify that the manager will not become a bottleneck • Consider implementation cost

Conclusions • Foster’s design methodology is conveniently applicable • in [Quinn 2004] made use of for the design of many parallel programs in MPI (OpenMP) • In practice, all phases often considered in parallel • In bad practice, conceptual phases skipped • machine-dependent design from the very beginning • Some kind of a “life-belt” (“fix point”) when the development comes into troubles

Further study • [Foster 1995] Designing and Building Parallel Programs • [Quinn 2004] Parallel Programming in C with MPI and OpenMP • In most textbooks a chapter like “Principles of parallel algorithm design” • often concentrated on the mapping step

Example tree search

Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration

Task/channel model Foster’s design methodology Partitioning Communication analysis Agglomeration

Presentation Transcript

From Task Analysis and Task Modeling to Task Model Engineering

From Task Analysis and Task Modeling to Task Model Engineering

Part 2: Channel Design Chapter 3: Supply-Side Channel Analysis Channel Flows and Efficiency Analysis

Back Channel Communication

Communication Channel

Robust Channel Shortening Equali s er Design

RPS Model Methodology

Design Methodology

Introduction to Data Communication: the discrete channel model

Problem Analysis/Statement Behaviour Analysis Participant Analysis Communication Channel Analysis

Application-Aware Memory Channel Partitioning

Channel Design

Agglomeration

Wireless communication channel

Channel Design

Partitioning Design

Channel Design

Design Task

Design Methodology

Elements in a fast communication channel 100GB/s

DESIGN APPROACH analysis design evaluation BUSINESS MODEL Business model analysis

Meta-Analysis on Agglomeration Externalities