1 / 29

Introduction to Parallel Computing, Fall 2009

Introduction to Parallel Computing, Fall 2009. Sinan Kockara Department of Computer Science UCA. Invalidate Protocol Systems. Snoopy cache systems Directory based systems Distributed directory systems Permits O(p) simultaneous coherence operations where p is number of processors

cutter
Download Presentation

Introduction to Parallel Computing, Fall 2009

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parallel Computing, Fall 2009 Sinan Kockara Department of Computer Science UCA

  2. Invalidate Protocol Systems • Snoopy cache systems • Directory based systems • Distributed directory systems • Permits O(p) simultaneous coherence operations where p is number of processors • Thus, more scalable than snoopy and director based systems

  3. Cost of Communication • Programming model semantics • Network topology • Data handling and routing • Message passing cost • Time to prepare a message for transmission • Routing cost • Time taken by the message to traverse the network to its final destination • Associated software protocols

  4. Principal Parameters for communication latency • Message Startup time (tstartup) • Occurs only once per message • Message Per-hop time (thop) • Node latency • Message Per-word transfer time (tdata) • If channel bandwidth is r words per second, then tdata = 1/r

  5. Message routing schemes • Store-and-forward routing • Each node stores entire message then passes to the next node • Packet routing • Message is broken into smaller parts • Provides better utilization of communication resources • Lower overhead from packet loss or errors • Packets may take different paths • Provides better error correction • Because of all these nice features packet routing is basis of internet • Overhead • Each packet must carry routing, error correction, and sequencing information • Cut-through routing • Resulting from optimizations of routing, error correction, and sequencing information • Message is partitioned into fixed size units called flits (flow control digits) • Problem: deadlock may occur

  6. Packet Routing

  7. Cut-through routing deadlock example

  8. Granularity for Message passing • Granularity: size of a process, # of processes • Coarse granularity • Each process contains large number of sequential instructions and takes substantial time to execute • Fine granularity • Each process consist of few instructions (sometimes one) to execute • Middle granularity • We want to increase granularity (means coarse granularity) to reduce the tstartup and interprocess communication costs • This will cause reduce the amount of parallelism • Thus, suitable compromise has to be made • Granularity is related to number of processors being used

  9. Granularity metric • Computation time/Communication time • tcomp / tcomm • It is very important to maximize the granularity metric while maintaining sufficient parallelism • In general, we would like to design a parallel program in which it is easy to vary the granularity i.e. a scalable program design

  10. Speedup Factor A measure of relative performance between a multiprocessor System and a single processor system is speedup factor tsingle Execution time using one processor (best sequential algorithm) • S(p) = tp Execution time using a multiprocessor with p processors • Use best sequential algorithm with single processor system. Underlying algorithm for parallel implementation might be (and is usually) different. • Speedup factor can also be cast in terms of computational steps: • Can also extend time complexity to parallel computations. Number of computational steps using one processor S(p) = Number of parallel computational steps with p processors

  11. Maximum Achievable Speedup with multiprocessor system • Maximum speedup is usually p with p processors (linear speedup). • E.g. one process mapped on one processor in multiprocessor system which is consisting of n processors What is S(n)? • Possible to get superlinear speedup (greater than p) but usually a specific reason such as: • Extra memory in multiprocessor system • Nondeterministic algorithm tsingle • S(p) = tp

  12. Superlinear Speedup example - Searching (a) Searching each sub-space sequentially Start Time t s t /p s Sub-space D t search x t /p s Solution found x indeterminate

  13. (b) Searching each sub-space in parallel D t Solution found

  14. Speed-up then given by t s ´ D x + t p S(p) = D t

  15. Worst case for sequential search when solution found in last sub-space search. Then parallel version offers greatest benefit, i.e. p – 1 ´ D t + t s p ® ¥ S(p) = D t D as t tends to zero

  16. Least advantage for parallel version when solution found in first sub-space search of the sequential search, i.e. Actual speed-up depends upon which subspace holds solution but could be extremely large. D t S(p) = = 1 D t+…

  17. Overheads to Speedup • There are several factors that limit the speedup • Periods when some of the processors are idle or not performing useful work • Extra computations that does not exist in sequential version e.g. to recompute constants locally • Communication time for sending messages

  18. What is the maximum speedup for a parallel program • Amdahl’s law (1960) • Constant problem size scaling • Independent from number of processors • Gustafson’s law (1988) • Time constrained scaling • Dependent to number of processors

  19. Amdahl’s law Fraction of the computation that cannot be divided into concurrent tasks is f t s ft (1 - f ) t s s Serial section Parallelizable sections (a) One processor (b) Multiple processors p processors (1 - f ) t / p s t p

  20. Question • According to previous slide what is the maximum speedup? • With Amdahl’s Speedup formulation, what happens if number of processors goes to infinity?

  21. Amdahl’s law: Speedup formulation Speedup factor is given by: This equation is known as Amdahl’s law

  22. Gustafson’s Law • Rather than assuming that the problem size is fixed, assuming that the parallel execution time is fixed • Also assumes that increasing the problem size does not increase the serial section of the code. • Gustafson’s speedup factor is called scaled speedup factor

  23. Gustafson’s Scaled Speedup Factor • Let s be the time for executing the serial part of the computation and p the time for executing parallel part of the computation on a single processor. Suppose we fix the total execution time on a single processor, s+p, as 1 so that s and p are now actual fractions of the total computation and s becomes the same as f previously as in Amdahl’s law (see slide #21 for reference). Then Amdahl’s law becomes:

  24. Gustafson’s Scaled Speedup Factor 2 • The execution time on a single computer will be s+pn as the n parallel parts must be executed sequentially. Then:

  25. Gustafson’s Speedup Factor Example • Suppose we had a serial section of 5% and 20 processers. • What is the speedup according to Gustafson’s Law? • What is the speedup according to Amdahl’s Law?

  26. Solution • According to Gustafson: • According to Amdahl’s:

  27. Speedup against number of processors f = 0% 20 16 12 f = 5% 8 f = 10% f = 20% 4 4 8 12 16 20 Number of processors , p

  28. Parallel program’s Efficiency • Efficiency defined as Execution time using one processor • E = Execution time using multiprocessor x Number of Processors • Efficiency gives the fraction of the time that • processors are being used on the computation • So, what does mean 100% efficiency and • when it occurs?

  29. Parallel Program’s Cost • Cost= (execution time) x (total # of processors) • Cost = tp x p • Parallel execution time is given by tp=ts/S(p) • From Efficiency equation above, • Cost = ts / E

More Related