1 / 21

Models for Parallel Computers

Models for Parallel Computers. Goals of a machine model accurately reflect constraints of existing machines broad applicability with respect to existing and future machines allows accurate prediction of performance Models being discussed PRAM BSP LogP.

marla
Download Presentation

Models for Parallel Computers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Models for Parallel Computers • Goals of a machine model • accurately reflect constraints of existing machines • broad applicability with respect to existing and future machines • allows accurate prediction of performance • Models being discussed • PRAM • BSP • LogP

  2. Parallel Random Access Machine (PRAM) • PRAM • S. Fortune, J. Wyllie, Parallelism in Random Access Machines, 10th ACM Symposium on Theory of Computing, pp. 114-118, 1978 • Abstraction of a synchronized MIMD computer with shared memory • Definition of a n-PRAM • It is a parallel register machine with n processors P0, ..., Pn-1 with a shared memory. • In each step, each processor can work as a separate register machine or can access a cell of the shared memory. • The processors are working synchronously, one step takes unit time.

  3. PRAM (2): Example • Task • Given are n variables X0, ..., Xn-1{0,1}. • Compute function ORn with ORn(X0,...,Xn-1)=1 iff i{0,...,n-1} with Xi=1 • Solution 1: • Use n-PRAM • Calculate on a tree network in which each inner node performs the operation OR2 with data sent from its sons. The root gets the result. • This solution has runtime O(log n).

  4. PRAM (3): Example • Solution 2: • Use a n-PRAM, let processor Pi access variable Xi. for all processors Pi pardo if Xi=1 then Result :=1 od • This solution has constant runtime.

  5. PRAM (4) • Different PRAM models: • Exclusive Read Exclusive Write (EREW): It is not possible to read or write a memory cell simultaneously with several processors. • Concurrent Read Exclusive Write (CREW): It is only possible to read a cell simultaneously. • Concurrent Read Concurrent Write (CRCW): Processors can read and write a cell simultaneously. Concurrent write forces to define which one of the concurrent processors will win. • Arbitrary: One processor wins, but it is not known in advance which one wins. • Common: All processors must write identical data. • Priority: The processor with the largest or lowest index wins.

  6. PRAM(5) • A CRCW PRAM can be simulated on a EREW PRAM • Delay O(log n), e.g. 256 processors can be simulated with 8c where c is a constant factor - the simulation overhead.

  7. PRAM (6) • The PRAM model favors maximal parallelism. • It does not take aspects of real machines into account. Machines do not provide: • Unit time communication cost, no contention • Synchronous execution and therefore explicit synchronization is required. • Hardware implementation of a PRAM at Fachbereich 14, Universität des Saarlandes (Prof. Wolfgang Paul) • Fork language (Christoph Kessler, Universität Trier)

  8. Bulk-Synchronous Parallel Model (BSP) • Definition BSP machine • It consists of processors with local memory connected by a communication network. • It performs a sequence of supersteps consisting of three phases: • Local computation phase: • asynchronous operations • extract messages from an input buffer • insert messages into an output buffer • Communication phase: • messages are transferred from output buffer to destination input buffer • Barrier synchronization concludes superstep and makes moved data available in the local memories.

  9. Superstep Processors • Execution time of a superstep w + g h + l • w is the maximum number of local operations performed by any processor • h is the maximum number of messages sent or received by any processor (h-relation) • g reflects the bandwidth measured in time units per message • l upper bound of global synchronization time Synchronization

  10. Other Cost Functions of a Superstep • Overlapping communication and computation MAX(w, hg) + L • Taking into account the sequence of sending and receiving messages (hin + hout) g • The cost model does not take into account: • splitting cost in startup and transfer cost • locality: mapping the processes in the right way onto the machine will reduce contention and communication distance. • L.G. Valiant: A Bridging Model for Parallel Computation, Communications of the ACM, Vol. 33, No. 8, pp. 103-111, 1990 • users.comlab.ox.ac.uk/bill.mccoll/oparl.html

  11. Broadcast

  12. Broadcast

  13. Broadcast

  14. Broadcast • Summary: Broadcasting • When N is small ( i.e., N<l/(pg-2g) use the one-stage broadcast • The two-stage broadcast is always better than the tree broadcast when p>4 • Most communication libraries still implement broadcast by the tree based technique • Good BSP design = Balanced design

  15. BSP • The BSP model describes a special but realistic computational structure • It allows to study the influence of synchronization and communication. • It allows for asynchronous computation.

  16. LogP • LogP model takes into account latency, overhead and bandwidth of the communication. • David Culler, A Practical Model of Parallel Computation, Communications of the ACM, Vol. 39, No.11, November 1996

  17. LogP • A parallel computer is described in the LogP model by the following parameters: • L: An upper bound on the latency, or delay, incurred in communicating a message containing a word from its source node to its target node. • o: The overhead, defined as the time that a processor is engaged in the transmission or reception of each message. During this time the processor cannot perform other operations. • g: The gap, defined as the minimum time interval between consecutive message transmissions or receptions. The reciprocal of g corresponds to the available per-processor bandwidth • P: The number of processor/memory modules.

  18. LogP (2) P P P M M M • The latency experienced by any message is unpredictable, but it is bound by L. • The network is treated as a pipeline of length L with initiation rate g and a processor overhead of o on each end. • The capacity of the network is finite. No more than L/g messages can be in transit from a processor at any time. g (gap) o (overhead) o L (latency) Interconnection Network

  19. LogP Implementation of Broadcast P=8, L=6, g=4, o=2 Total execution time: 24

  20. LogP (3) • The LogP model takes into account latency and overhead for message transmission. • The technique of multithreading is often suggested as a way of masking latency. In practice this technique is limited by the available communication bandwidth. This is taken into account by the finite network capacity. • It does encourage • techniques optimizing the data placement to reduce amount and frequency of communication. • latency hiding techniques. • It does not model • local operations • explicit synchronization operations • details of the interconnection topology

  21. Summary • Models highlight different aspects of the machine. • PRAM: parallelism • BSP: coarse communication model, special programming structure • LogP: detailed communication model

More Related