1 / 35

Performance models Execution time: computation, communication, idle Experimental studies

Parallel Computing 6 Performance Analysis Ond řej Jakl Institute of Geonics, Academy of Sci. of the CR. Outline of the lecture. Performance models Execution time: computation, communication, idle Experimental studies Speed, efficiency, cost Amdahl’s and Gustafson’s law

keefe
Download Presentation

Performance models Execution time: computation, communication, idle Experimental studies

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computing 6Performance AnalysisOndřej JaklInstitute of Geonics, Academy of Sci. of the CR

  2. Outline of the lecture • Performance models • Execution time: computation, communication, idle • Experimental studies • Speed, efficiency, cost • Amdahl’s and Gustafson’s law • Scalability – fixed and scaled problem size • Isoefficiency function

  3. Why analysis of (parallel) algorithms • Common pursuit at the design of parallel programs: maximum speed • but in fact tradeoffs between performance, simplicity, portability, user friendliness, etc., and also development / maintenance cost • higher development cost in comparison with sequential software • Mathematical performance models of parallel algorithms can help • predict performance before implementation • improvement on increasing number of processors? • compare design alternatives and make decisions • explain barriers to higher performance of existing codes • guide optimization efforts • i.e. (not unlike a scientific theory) • explain existing observations • predict future behaviour • abstract unimportant details • tradeoff between simplicity and accuracy • For many common algorithms, perf. models can be found in literature • e.g. [Grama 2003] Introduction to Parallel Computing

  4. Performance models • Performance – a multifaceted issue, with application-dependent importance • Examples of metrics for measuring parallel performance: • execution time • parallel efficiency • memory requirements • throughput and/or latency • scalability • ratio of execution time to system cost • Performance model: mathematical formalization of a given metrics • take into account (parallel application + target parallel architecture) • = parallel system • Ex: Performance model for the parallel execution timeT T = f (N, P, U, ...) N – problem size, P – number of processors, U – number of tasks, ... – other hw and sw characteristics depending on the level of detail

  5. T P1 P2 P3 P4 Execution time • Probably the most important metrics, not only in parallel processing • Simple definition:The time elapsed from when the first processor starts executing on the (parallel) program to when the last processor completes the execution • Parallel execution time can be divided into computation (comp), communication (comm) and idle (idle) times [next slides] • Execution time T equals the execution time Ti on any (ith) processor T = Ti = Ti,comp + Ti,comm + Ti,idle or, using sums of timesTcomp, Tcomm, Tidleoverall P processors T = (Tcomp + Tcomm + Tidle) / P • Assumption: one-to-one task-processor mapping, identical processors( =processing elements)

  6. Process-time diagram, real application generated in XPVM

  7. Computation time • Tcomp – time spent on the proper computation • sequential programs are supposed to run only in Tcomp • Depends on: • the performance characteristics of processors and their memory systems • the size of the problemN (may be a set of parameters) • the number of processorsP • in particular if replication of computation is applied • cannot assume constant computation time when number of processors varies

  8. time bandwidth startup time message length Communication time (1) • Tcomm – time spent sending and receiving messages • Major component of overhead • Depends on: • the size of the message • interconnection system structure • mode of the transfer • e.g. store-and-forward, cut-through • Simple (idealized) timing model: Tmsg = ts + tw . L ts .. startup time(latency)L ..message size inbytestw .. transfer time per data word • bandwidth (throughput): 1/tw, transfer rate per second,usually recalculated to bits/sec

  9. Communication time (2) • Substantial platform-dependentdifferences in ts, tw – cf. [Foster 1995] • measurements necessary (ping-pong test) • great impact on the parallelization approach • Ex. IBM SP timings: to : tw : ts = 1 : 55 : 8333 • to .. arithmetic operation time • latency dominates with small messages! • Internode versus intranode communication: • location of the communicating tasks: the same x different computing nodes • intranode communication in general conceived faster • valid e.g. on Ethernet networks • on supercomputers often quite comparable

  10. Real communication timings Communication time Bandwidth data of IBM SP

  11. Idle time • Tidle –time spent waiting for computation and/or data • Another component of parallel overhead • Due to lack of work • uneven distribution of work to processors (load imbalance) • consequence of synchronization and communication • Can be reduced by • load-balancing techniques • overlapping computation and communication • In practice difficult to determine • depends on the order of operations • Often neglected in performance models

  12. N Z N/P Ex.: Timing Jacobi finite differences • 2-D gridN x Z of points, P processors • 1-D decomposition to P subgrids of (N/P) x Z points • Model parameters: tc.. average computation time at a single grid point, ts.. latency, tw.. transfer time per word • Totalcomputation time, summed over all nodes: Tcomp = tc N Z • Total communication time, summed over P processors: Tcomm = 2 P (ts + Z tw) • NeglectingTidle (structured, synchronous communication) • Execution time per iteration: T = (Tcomp + Tcomm + Tidle) / P = (tc N Z + 2 P (ts + Z tw) + 0) / P = = tc (N / P) Z + 2(ts + Z tw) ( = Ti,comp + Ti,comm )

  13. Reducing model complexity • Idealized multicomputer • no low-level hardware details, e.g. memory hierarchies, network topologies • Scale analysis • e.g. neglect one-time initialization step of an iterative algorithm • Empirical constants for model calibration • instead of modelling details • Trade-off between model complexity and acceptable accuracy

  14. Experimental studies • Parallel computing is primarily an experimental discipline • Goals of experimental studies: • parameters for performance models (e.g. ts, tw in Tcomm) • comparison of observed and modelled performance • calibration of performance models • Design of experiments – issues: • data to be measured • measurement methods and tools • accuracy and reproducibility (always repeat to verify!) • Often greater variations in results – possible causes: • a nondeterministic algorithm (e.g. due to random numbers) • timer problems (inaccurate, limited resolution) • startup and shutdown costs (expensive, system dependent) • interference from other programs (even on dedicated processors) • communication contention (e.g. on the Ethernet) • random resource allocation (if processor nodes are not equivalent)

  15. Comparative performance metrics • Execution time not always convenient • varies with problem size • comparison with original sequential code needed • More adequate measures of parallelization quality: • speedup • efficiency • cost • Base for qualitative analysis

  16. Speedup • Quantifies the performance gain achieved by parallelizing given application over a sequential implementation • Relative speedup on P processors: Sr = T1/ Tp T1 ..execution time on one processor • of the parallel program • of the original sequential program Tp.. execution time on P (equal) processors • Absolute speedupon P processors: S = T1/ Tp T1 ..execution time for the best-knownsequential algorithm Tp..see above • S is more objective, Sr used in practice • Sr more or less predicates scalability • 0 < S <= Sr <= P expected

  17. Superlinear speedup • Theoretically, (absolute) speedup can never exceed the number of processors • otherwise another sequential algorithm could emulate the parallel run ina shorter time • In practiceS > P sometimes observed – superlinear speedup • “bonus” of parallelization efforts • Reasons: • sequential algorithm is not optimal • sequential algorithm is penalized by hardware • e.g. slower access to data (cache effects) • sequential and parallel algorithms do not perform the same work • e.g. tree search [Grama 2003]

  18. Typical speedup curves linear speedup Program 1 superlinear speedup Program 2 [Lin 2009]

  19. Efficiency • Measure of the fraction of time for which a processing element is usefully employed • characterize the effectiveness with which a program uses the resources of a parallel computer • Relative efficiency on P processors: Er = Sr/ P = T1 / (P · Tp) Sr.. relative speedup • Absolute efficiencyon P processors: E = S / P • 0 < E <= Er <= 1

  20. Cost • Characterizes the amount of work performed by the processors when solving the problem • Cost on P processors:C = Tp· P = T1/ E • also called processor-time product • cost of a sequential computation is its execution time • Cost-optimal parallel system: The cost of solving a problem on a parallel computer is proportional to (matches) the cost ( = execution time) of the fastest-known sequential algorithm • i.e. efficiency is asymptotically constant, speedup is linear • cost optimality implies very good scalability [further slides]

  21. Amdahl’s law (1) • Observation: Every parallel algorithm has a fraction of operations that must be performed sequentially (sequential component); that componentlimits its speedup • Gene Amdahl (1967): If rs (0 < rs <= 1) is the sequential component of the execution time, then the maximal possible speedup achievable on a parallel computer is 1/ rs , no matter how many processors are used • E.g. if 5% of the computation is serial (rs = 0.05), then the maximum speedup is 20

  22. Amdahl’s law (2) Proof: Let rp is the parallelizable part of the algorithm, i.e. rs + rp = 1. Then Tp, the parallel execution time on P processors, is Thus, for the speedup Sp on P processors holds and

  23. Amdahl’s law (3) • Some retarding effect for the development of parallel computing • Practice showed that Amdahl’s reasoning is too pessimistic • greater speedup encountered than Amdahl’s law predicted • sequential components are usually not inherent – reformulation of the problem may eliminate the bottleneck • increasing the problem size may decrease the percentage of the sequentialpart of the algorithm • reflected in the newer Gustafson’s law [next slide] • Amdahl’s law relevant when sequential programs are parallelized incrementally / partially • e.g. data-parallel programs with some part not being amenable to a data-parallel formulation

  24. Gustafson(-Barsis)’s law • Observation: A larger multicomputer usually allows larger problems to be solved in reasonable time • John Gustafson (1988): Given a parallel program solving a problem of size N using P processors, let rs denotes the sequential component (i.e. (1 –rs ) is the parallelizable component). The maximum speedupS achievable by this program is • E.g. if 5% of the computation is sequential (rs = 0.05), then on 20 processors the maximum speedup is 20-0.05·19 = 19.05 • Amdahl: 10.26 • Gustafson – time constrained scaling, scaled speedup • the problem size is an increasing function of the processor count • constant parallel execution time, decreasing serial component • Amdahl – constant problem size scaling

  25. Quantitative analysis • Investigates the adaptability of the parallel system to changes in the computing environment • problem size, number of processors, communication speed, memory size, etc. • Based on substitution of machine-specific numeric values for the various parameters in performance models • caution necessary – performance models are idealizations of complex phenomena • Most interesting: the ability to utilize increasing number of processors • studied in scalability analysis [next slides]

  26. Scalability • Scalability of a parallel system is a measure of its ability to increase performance (speedup) as the number of processors increases • hardware scalability: the parallel computer can incorporate more processors without degrading the communication subsystem • Naively, one would assume that more processors (automatically) improve performance • Definition of a scalable parallel program (system) varies in literature; often imprecise formalization • e.g. “a parallel system is scalableif the performance is linearly proportional to the number of processors used ”

  27. T E 1 1 P 1 P Fixed problem size (1) • Scalability with fixed problem size: dependence of the parallel system performance (execution time, efficiency) on the changing processor count when the problem size (and other machine parameters) are fixed • Analysis answers questions such as “what is the fastest one can solve the given problem on the given computer?“ Execution timeshould actually increase after reaching some maximum number of processors Efficiency will generally decrease monoto-nically with increasing processor count

  28. Fixed problem size (2) • Nontrivial parallel algorithm: In reality, for any fixed problem there is an optimum number of processors that minimizes overall execution time • computation timeTcomp component decreases • communication timeTcomm (+ idle time Tidle) component increases • usually an upper limit on the number ofprocessors that can be usefully employed • An execution time model aspiring for perfor-mance extrapolation (prediction) accommodates a term with Px, x > 0 • Choosing problem size is difficult, if the processor range is large • must provide enough data for large-scale computations • data must fit into memory for small-scale computations • Solution: scaling the problem size with the processor count [next slide] [Quinn 2004]

  29. E N Scaled problem size (1) • Scalability with scaled problem size: dependence of parallel system performance on the number of processors when the problem size is allowed to change • Encouraged by the fact that parallelization is employed not only to solve (fixed-sized) problems faster, but also to solve larger problems • typically the problem size is increased when moved to more powerful machines with more processors • with some problems scaling not possible (e.g. with functional decomposition) • Observations: • Efficiency will often increase with increasingproblems size and constantprocessor count  • Efficiency will generally decrease with increasing processor count [prev. slide]

  30. T N = 1000 E 1 N = 500 N = 1000 1 P N = 500 1 P Scaled problem size (2) Larger problems (N) have higher execution time (T - left) and usually better efficiency (E - right) on the same number of processors (P) than smaller ones

  31. 2N Z 2N/2P Isoefficiency metric of scalability • Of particular interest: How the amount of computation must scale with the number of processors to keep the efficiency constant? • Isoefficiency function(P): gives the growth rate of problem size N which is necessary to keep E constant with increasing P • does not exist for unscalable parallel systems • T1 = E (Tp P) = E (Tcomp + Tcomm + Tidle) • to maintain constant efficiency, the amount of essential computation must increase at the same rate as overheads • If  is O(P), then the parallel system is highly scalable: • the amount of computation needs to increase only linearly with respect to Pto keep efficiency constant • ex. Jacobi finite differences: for N = O(P) isT1 = tc Z N  E (tc Z N + 2 P (ts + Z tw))thus the problem is highly scalable

  32. Other evaluation methods • Extrapolation from observations • statements like “speedup of 10.8 on 12 processors with problem size 100” • small number of observations in a multidimensional space • says little about the quality of the parallel system as a whole • Asymptotic analysis • statements like “algorithm requires O(N log N) time on O(N) processors” • deals with large N and P, usually out of scope of practical interest • says nothing about absolute cost • usually assumes idealized machine models (e.g. PRAM) • more important for theory than practice

  33. Conclusions • The lecture provides only with a “feel and taste“ introduction to the analytical modelling of parallel programs • Good knowledge required especially when supercomputing is concerned • practical experience from small parallel system is difficult to extrapolate to large problems targeted on machines with thousands of processors

  34. Further study • Covered to some extent in all textbooks on parallel programming/computing • with attempts to specific point of view • The most profound coverage can be probably found in [Grama 2003] Introduction to Parallel Computing

More Related