1 / 40

Computer architecture II

Computer architecture II. Introduction. Recap. Parallelization strategies What to partition? Embarrassingly Parallel Computations Divide-and-Conquer Pipelined Computations Application examples Parallelization steps 3 programming models Data parallel Shared memory Message passing.

ringo
Download Presentation

Computer architecture II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer architecture II Introduction Computer Architecture II

  2. Recap • Parallelization strategies • What to partition? • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations • Application examples • Parallelization steps • 3 programming models • Data parallel • Shared memory • Message passing Computer Architecture II

  3. 4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

  4. Plan for today Programming for performance • Amdahl’s law • Partitioning for performance • Addressing decomposition and assignment • Orchestration for performance Computer Architecture II

  5. Performance(p) Performance(1) Time(1) Time(p) Creating a Parallel Program • Assumption: Sequential algorithm is given • Sometimes need very different algorithm, but beyond scope • Pieces of the job: • Identify work that can be done in parallel • Partition work and perhaps data among processes • Manage data access, communication and synchronization • Note: work includes computation, data access and I/O • Main goal: Speedup (plus low prog. effort and resource needs) Speedup (p) = • For a fixed problem: Speedup (p) = Computer Architecture II

  6. Amdahl´s law • Suppose a fraction f of your application is not parallelizable • 1-f : parallelizable on p processors Speedup(P) = T1 /Tp <= T1/(f T1 + (1-f) T1 /p) = 1/(f + (1-f)/p) <= 1/f Computer Architecture II

  7. Amdahl’s Law (for 1024 processors) See: Gustafson, Montry, Benner, “Development of Parallel Methods for a 1024 Processor Hypercube”, SIAM J. Sci. Stat. Comp. 9, No. 4, 1988, pp.609. Computer Architecture II

  8. Amdahl´s law • But: • There are many problems can be “embarrassingly” parallelized • Ex: image processing, differential equation solver • In some cases the serial fraction does not increase with the problem size • Additional speedup can be achieved from additional resources (super-linear speedup due to more memory) “ Computer Architecture II

  9. Speedup linear speedup speedup superlinear speedup!! sub-linear speedup p Computer Architecture II

  10. Superlinear Speedup? • Possible causes • Algorithm • e.g., with optimization problems, throwing many processors at it increases the chances that one will “get lucky” and find the optimum fast • Hardware • e.g., with many processors, it is possible that the entire application data resides in cache (vs. RAM) or in RAM (vs. Disk) Computer Architecture II

  11. Parallel Efficiency • Effp = Sp / p • Typically 1, unless superlinear speedup • Used to measure how well the processors are utilized • If increasing the number of process by a factor 10 increases the speedup by a factor 2, perhaps it’s not worth it: efficiency drops by a factor 5 Computer Architecture II

  12. Performance Goal => Speedup • Architect Goal • observe how program uses machine and improve the design to enhance performance • Programmer Goal • observe how the program uses the machine and improve the implementation to enhance performance Computer Architecture II

  13. 4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

  14. Partitioning for Performance • First two phases of parallelization process: decomposition & assignment • Goal • Balancing the workload and reducing wait time at synch points • Reducing inherent communication • Reducing extra work for determining and managing a good assignment (static versus dynamic) • Tensions between the 3 goals • Maximize load balance => smaller tasks => increase communication • No communication (run on 1 processor) => extreme load imbalance (all others idle) • Load balance => extra work to compute or manage the partitioning (ex. dynamic techniques) Computer Architecture II

  15. Sequential Work Speedup ≤ Max Work on any Processor 1. Load Balance • Work: data access, computation • Not just equal work, but must be busy at same time • Ex: Speedup ≤1000/400 = 2.5 Computer Architecture II

  16. 1. Load balance • Identify enough concurrency • Data and functional parallelism (last class) • Managing concurrency • Task granularity • Reduce communication and synchronization Computer Architecture II

  17. 1 b) Static versus Dynamic assignment • Static: before the program starts is clear who does what #pragma omp parallel for schedule(static) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} • Dynamic • External scheduler • Self-scheduled Each process picks a chunk of loop iterations and executes them #pragma omp parallel for schedule(dynamic,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} • Dynamic guided self-scheduling: processes take first larger chunks and then reduce this number progressively #pragma omp parallel for schedule(guided,4) for(i=0;I<N;i++) { a[i] = a[i] + b[i];} Computer Architecture II

  18. Dynamic Tasking with Task Queues • Centralized queue: simple protocol • Problems: Communication, synchronization, contention • Distributed queues: complicated protocol • Initial distribution of jobs • May cause load imbalance • Solution: task stealing: whom to steal from, how many tasks to steal, ... • Termination detection Computer Architecture II

  19. 1.c Task granularity • Task granularity: amount of work associated with a task • General rule: • Coarse-grained: often less load balance • Fine-grained: better load balance, but more overhead, often more communication and contention Processor 1 Processor 1 Processor 2 Processor 2 Computer Architecture II

  20. Sequential Work Speedup < Max (Work + Synch Wait Time) 1.d Reducing Serialization • Synchronization for task assignment may cause serialization (for instance the access to a queue) Process 1 Process 2 Process 3 Work Synchronization point Synchronization wait time Computer Architecture II

  21. Reducing Serialization • Event synchronization • Reduce use of conservative synchronization • point-to-point instead of barriers • finer granularity of access may reduce the synchronization time • But fine-grained synchronization more difficult to program, more synchronization operations. • Mutual exclusion • Separate locks for separate data • lock per task in task queue, not per queue • finer grain => less contention/serialization, more space, less reuse • Smaller, less frequent critical sections • don’t do reading/testing in critical section, only modification • e.g. searching for task to dequeue outside critical section Computer Architecture II

  22. Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost) 2. Reducing Inherent Communication • Communication is expensive! • Measure: communication to computation ratio • Inherent communication • Determined by assignment of tasks to processes • Actual communication may be larger (artifactual) • One principle: Assign tasks that access same data to same process Process 1 Process 2 Process 3 Communication Work Synchronization point Synchronization wait time Computer Architecture II

  23. Domain Decomposition • Ocean Example: communicate with the neighbors, compute in the assigned domain • Perimeter to Area communication-to-computation ratio (area to volume in 3-d): Depends on n,p: decreases with n, increases with p Computer Architecture II

  24. 4*√p n 2*p n Domain Decomposition • Best domain decomposition depends on information requirements • Block versus strip decomposition: Communication/computation: for block, for strip • Block better • Application dependent: strip may be better in other cases Computer Architecture II

  25. Finding a Domain Decomposition GOALS: load balance & low communication • Static, by inspection • Must be predictable: Ocean • Static, but not by inspection • Input-dependent, require analyzing input structure • E.g sparse matrix computations • Semi-static (periodic repartitioning) • Characteristics change but slowly; e.g. Barnes-Hut • Static or semi-static, with dynamic task stealing • Initial decomposition, but highly unpredictable; e.g ray tracing Computer Architecture II

  26. Sequential Work Speedup < Max (Work + Synch Wait Time + Comm Cost + Extra Work) 3. Reducing Extra Work • Common sources of extra work: • Computing a good partition • e.g. partitioning in Barnes-Hut • Using redundant computation to avoid communication • Task, data and process management overhead • applications, languages, runtime systems, OS • Imposing structure on communication • coalescing small messages Computer Architecture II

  27. PART II: memory aware optimizations • So far we have seen the parallel computer as a collection of communicating processors • Goals: balance load, reduce inherent communication and extra work • We have assumed an unlimited memory • In reality the parallel computer uses a multi-cache, multi-memory system Computer Architecture II

  28. Memory-oriented View • Multiprocessor as Extended Memory Hierarchy • Levels in extended hierarchy: • Registers, caches, local memory, remote memory • Glued together by communication architecture • Levels communicate at a certain granularity of data transfer Proc Proc Cache Cache L2 Cache L2 Cache L3 Cache L3 Cache potential interconnects Memory Memory Granularity increases, access time increases Capacity increases, cost/unit decreases Computer Architecture II

  29. Memory-oriented view • Performance depends heavily on memory hierarchy • Time spent by a program (usually given in cycles) Timeprog =Timecompute + Timeaccess • Data access time can be reduced by: • Optimizing machine • larger caches • lower latency • Larger bandwidth • Optimizing program • temporal and spatial locality Computer Architecture II

  30. Artifactual Communication in Extended Hierarchy • poor allocation of data across distributed memories • Data accessed by a node is allocated in the memory of another even though accessed only by the remote node • unnecessary data in a transfer (e.g. sender sends more) • unnecessary transfers due to system granularities (e.g. cache coherent at block granularity) • redundant communication of data (e.g. update-protocols : data sent at every modification, but needed only once) • finite replication capacity (in cache or main memory): replicated data has to be evicted from memory y brought again later Computer Architecture II

  31. Replication induced artifactual communication • Communication induced by finite capacity is most fundamental artifact • Like cache size and miss rate memory traffic in uniprocessors • View as three level hierarchy for simplicity • Local cache, local memory, remote memory (ignore network topology) • Classify “misses” in “cache” at any level as for uniprocessors (4 “C”s) • compulsory or cold misses (no size effect) • capacity misses (yes): has to evict smth because limited space • conflict or collision misses (yes): memory that maps on the same cache blocks • communication or coherence misses (no) Computer Architecture II

  32. Working set • Working set: size of data that fits in a certain level of memory • 1. A few points for near-neighbor reuse • 2. Three sub-rows • 3. Whole matrix 15. while (!done) do 16. diff = 0; ¬ 17. for i 1 to n do ¬ 18. for j 1 to n do 19. temp = A[ i,j]; ¬ 20. A[ i,j] 0.2 * (A[ i,j] + A[i,j-1] + A[i-1,j] + 21. A[i,j+1] + A[i +1,j]); 22. diff += abs(A[ i,j] - temp); 23. end for 24. end for 25. if ( diff/( n*n) < TOL) then done = 1; 26. end while Computer Architecture II

  33. fic First working set Data traf Capacity-generated traf fic (including conflicts) Second working set Other capacity-independent communication Inher ent communication Cold-start (compulsory) traf fic Replication capacity (cache size) Working Set Perspective • Hierarchy of working sets • At first level cache (fully assoc, one-word block), inherent to algorithm • working set curve for program • Traffic from any type of miss can be local or nonlocal (communication) Computer Architecture II

  34. Orchestration for Performance • Reducing amount of communication: • Inherent: change the partitioning (seen earlier) • Artifactual: exploit spatial, temporal locality in the memory hierarchy • Techniques often similar to those on uniprocessors • Structuring communication to reduce cost Computer Architecture II

  35. Reducing Artifactual Communication • Message passing model • Communication and replication are both explicit • Even artifactual communication is in explicit messages • Shared address space model • Occurs transparently due to interactions of program and system • used for explanation Computer Architecture II

  36. (a) Unblocked access pattern in a sweep (b) Blocked access pattern with B = 4 Exploiting Temporal Locality • Def: reusing of data elements already brought into cache • Structure algorithm so working sets fit into the cache • often techniques to reduce inherent communication: • assign tasks accessing the same elements to the same processor • schedule tasks for data reuse once assigned • Ocean Solver example: blocking • Each grid element accessed 5 times • First time brought into cache then reused • Rewrite the loops Computer Architecture II

  37. Exploiting Spatial Locality • Def: when a data element is accessed, its neighbors are accessed • Major spatial-related causes of artifactual communication: • Conflict misses • Data distribution/layout (allocation granularity) • Fragmentation (communication granularity) • False sharing of data (coherence granularity) • AVOIDING ARTIFACTUAL COMMUNICATION: keep contiguous data accessed by one processor • Fix problems by modifying data structures, or layout/alignment Computer Architecture II

  38. Contiguity in memory layout P P P P P P P P 0 1 2 3 0 1 2 3 P P P P 4 5 6 7 P P P P 4 5 6 7 P P 8 8 Page straddles partition boundaries: difficult to distribute memory well Page does not straddle partition boundary Cache block straddles partition boundary Cache block is within a partition (a) Two-dimensional array (b) Four-dimensional array Spatial Locality Example • Repeated sweeps over 2-d grid, each time adding 1 to elements • 4-d grid to achieve spatial locality ( line processor x column processor x line index x column index) Computer Architecture II

  39. Good spacial locality on nonlocal accesses at row-oriented boudary Poor spacial locality on nonlocal accesses at column-oriented boundary Tradeoffs with Inherent Communication • Partitioning grid solver: blocks versus rows • Blocks have a spatial locality problem on remote data: when accessing the elements of neighboring processors whole cache blocks are fetched at column boundary • Row-wise can perform better despite worse inherent communication-to-computation ratio Computer Architecture II

  40. Example Performance Impact on Origin2000 Ocean Kernel solver Computer Architecture II

More Related