1 / 46

Computer architecture II

Computer architecture II. Introduction. Plan for today. Parallelization strategies What to partition? Embarrassingly Parallel Computations Divide-and-Conquer Pipelined Computations Application examples Parallelization steps 3 programming models Data parallel Shared memory

kare
Download Presentation

Computer architecture II

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computer architecture II Introduction Computer Architecture II

  2. Plan for today • Parallelization strategies • What to partition? • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations • Application examples • Parallelization steps • 3 programming models • Data parallel • Shared memory • Message passing Computer Architecture II

  3. Parallelization strategies: what is partitioned? • Computation (anyhow) • Uniform (all processors apply the same operation) • Functional (each processor applies a different type of operation) • mixed • Data • Not always necessary (ex. : computing pi) • When data is partitioned, in most cases computation is partitioned as well (computation accompanies data partitioning) • Locality reasons: a processor works on the assigned data Computer Architecture II

  4. Parallelization strategies(Barry Wilkinson) • Embarrassingly Parallel Computations • Divide-and-Conquer • Pipelined Computations Computer Architecture II

  5. Embarrassingly Parallel Computations • A computation that can obviously be divided into a number of completely independent parts, each of which can be executed by a separate process. • No communication or very little communication between processes • Each process can do its tasks without any interaction with other processes Input data … Processes Output data Computer Architecture II

  6. Area circle ¶ Pts. in circle ≈ = Area square 4 Pts. in square Embarrassingly Parallel Computations examples • Image processing • Monte Carlo methods • Use random selection for independent computation • Compute Pi • Choose points within square randomly • Count in parallel how many lie inside circle ( x2 + y2 < 1 ) and how many not Computer Architecture II

  7. Divide-and-Conquer • Problem resolution = dividing the problem into several parts and solving them + combining the results • Examples: • Reduce (MPI example) • merge sort • N-body problem Computer Architecture II

  8. Divide and conquer: merge sort Computer Architecture II

  9. Pipelined computation • Problem divided into a series of tasks that have to be completed one after the other • Each task executed by a separate process or processor • May perform different tasks • Functional partitioning (like the pipeline of a microprocessor) Computer Architecture II

  10. Pipelined computation example • Sieve of Eratosthenes • Generating series of prime numbers • Count up from 2 • A prime number is kept by the first free processor • A non-prime is dropped Computer Architecture II

  11. (a) Cross sections Application examples: Simulating Ocean Currents • Model as several two-dimensional grids • Discretized in space and time • finer spatial and temporal resolution => greater accuracy • One time step: set up and solve equations • Concurrency • Intra-step and inter-steps • Static and regular (b) Spatial discretization of a cross section Computer Architecture II

  12. m1m2 r2 Simulating Galaxy Evolution (N-body problem) • Simulate the interactions of many stars evolving over time • Discretized in time • Computing forces is expensive • O(n2) brute force approach • Hierarchical Methods take advantage of force law: G • Many time-steps, plenty of concurrency across stars within one Computer Architecture II

  13. Rendering Scenes by Ray Tracing • Shoot rays into scene through pixels in image plane • Follow their paths • they bounce around as they strike objects • they generate new rays: ray tree per input ray • Result is color, opacity, brightness for 2D pixels • Parallelism across rays Computer Architecture II

  14. Definitions • Task: • Arbitrary piece of work in parallel computation • Executed sequentially; concurrency is only across tasks • E.g. a particle/cell in Barnes-Hut, a ray or ray group in Raytrace • Fine-grained versus coarse-grained tasks • Process (thread): • Abstract entity that performs the tasks assigned to processes • Processes communicate and synchronize to perform their tasks • Processor: • Physical engine on which process executes • Processes virtualize machine to programmer • write program in terms of processes, then map to processors Computer Architecture II

  15. 4 Steps in Creating a Parallel Program • Decomposition of computation in tasks • Assignment of tasks to processes • Orchestration of data access, comm, synch. • Mapping processes to processors Computer Architecture II

  16. Decomposition • Identify concurrency and decide level at which to exploit it • Break up computation into tasks to be divided among processes • Tasks may become available dynamically • No. of available tasks may vary with time • Goal: Enough tasks to keep processes busy, but not too many • Number of tasks available at a time is upper bound on achievable speedup Computer Architecture II

  17. Assignment • Specify mechanism to divide work up among processes • E.g. which process computes forces on which stars, or which rays • Goals • Load Balance workload • reduce communication and synchronization • Low run-time management cost (e.g. management of scheduling queues) • Structured programs can help • Code inspection • Identifying phases • parallel loops • Static versus dynamic assignment • Decomposition & assignment first steps of parallel programming • Usually independent of architecture or programming model • But cost and complexity of using system primitives (eg. communication, synchronization costs) may affect decisions => may have to revisit them Computer Architecture II

  18. Orchestration • Naming data • Synchronization • Structuring communication • Organizing data structures and scheduling tasks temporally • Goals • Reduce cost of communication and synchronization • Preserve locality of data reference • Schedule tasks to satisfy dependences early Computer Architecture II

  19. Mapping • Two aspects: • Which process runs on which particular processor? • mapping to a network topology • Will multiple processes run on same processor? • Space-sharing • Machine divided into subsets of nodes, only one application runs at a time in a subset • Processes can be pinned to processors, or dynamically migrated by the OS • Time-sharing • Several processes share the same processor • We assume: 1 process = 1 processor Computer Architecture II

  20. High-level goals • Major goal of parallel programs: high performance (speedup) • Low resource usage • Decrease development effort (programming, testing, debugging) • High-productivity computing (trade-off among the three) Computer Architecture II

  21. Parallel Programming Models Computer Architecture II

  22. Random access machine (RAM) • Abstract machine with an unlimited number of registers • Model used in sequential programming • Bad programming practices • Assume unlimited memory • Ignore cache hierarchy Computer Architecture II

  23. Parallel Random Access Machine (PRAM) • abstract machine for designing the algorithms for parallel computers. • multiple processors may communicate through a shared memory • a synchronous PRAM: all processors execute synchronously • The programmer is not concerned with synchronization and communication, but only with concurrency. • Depending on the read or write accesses • Exclusive Read Exclusive Write (EREW) • Concurrent Read Exclusive Write (CREW) • Exclusive Read Concurrent Write (ERCW) • Concurrent Read Concurrent Write (CRCW) • Used for architecture-independent parallel algorithms design and analysis • Very similar to the data-parallel model Computer Architecture II

  24. Example: iterative equation solver • Simplified version of a piece of Ocean simulation • Illustrate program in low-level parallel language • C-like pseudocode with simple extensions for parallelism • Expose basic comm. and synch. primitives • State of most real parallel programming today Computer Architecture II

  25. Grid Solver • Gauss-Seidel (near-neighbor) sweeps to convergence • interior n-by-n points of (n+2)-by-(n+2) updated in each sweep • updates done in-place in grid • difference from previous value computed • accumulate partial diffs into global diff at end of every sweep • check if has converged • to within a tolerance parameter Computer Architecture II

  26. Sequential Version Computer Architecture II

  27. Decomposition • Identify concurrency: looking at loop iterations • dependence analysis; • if not enough concurrency, then look for other methods • Not much concurrency here at this level (all loops sequential) • Examine fundamental dependences • Solutions • Retain loop structure, use pt-to-pt sync. Problem: too many synch ops. • Restructure loops. Concurrency along anti-diagonals. Problem: load imbalance, global sync. • Exploit the application knowledge Computer Architecture II

  28. Exploit Application Knowledge • Reorder grid traversal: red-black ordering • Different ordering of updates: may converge quicker or slower • Red sweep and black sweep are each fully parallel (Ocean) • Global synch between them (conservative but convenient) • We use simpler, asynchronous one to illustrate • no red-black, simply ignore dependences within sweep • parallel program nondeterministic (depends on the scheduling order) Computer Architecture II

  29. Decomposition • Decomposition into elements: degree of concurrency n2 • Implicit synchronization after for_all construct • Line 18: for_all loop made for: what happens? Computer Architecture II

  30. i p Possible Assignments • Static assignment: decomposition into rows • block assignment of rows: Row i is assigned to process • cyclic assignment of rows: process i is assigned rows i, i+p, ... • Dynamic assignment • get a row index, process the row, get a new row, ... Computer Architecture II

  31. Orchestration • Here we need to pin-down the programming model • Data parallel solver • Shared address space solver • Message passing solver Computer Architecture II

  32. Data parallel solver • Characteristics • Data decomposition needed • All the processes execute the same code • Global variables are shared • Local variables are private • Implicit parallel processes (no need for a proc_id, no need to create threads) • Implicit synchronization (for instance at for_all) • Good for regular distributions Computer Architecture II

  33. Data Parallel Solver Computer Architecture II

  34. Shared Address Space Solver • Characteristics • Shared variables • Explicit threads/processes (proc_id) • Explicit synchronization • Explicit work assignment controlled by values of variables computed in function of a proc_id Computer Architecture II

  35. Shared Address Space Solver • Assignment controlled by values of variables computed in function of a proc_id Computer Architecture II

  36. Generating Threads Computer Architecture II

  37. Assignment Mechanism Computer Architecture II

  38. SAS Program • SPMD: not lockstep like data parallel. Not necessarily same instructions • Assignment controlled by process specific proc_id variables • done condition evaluated redundantly by all • Code that does the update identical to sequential program • each process has private mydiff variable • Most interesting special operations are for synchronization • accumulations into shared diff have to be mutually exclusive • barriers Computer Architecture II

  39. Mutual Exclusion • Why is it needed? • Provided by LOCK-UNLOCK around critical section • Set of operations we want to execute atomically • Implementation of LOCK/UNLOCK must guarantee mutual exclusion Computer Architecture II

  40. Global Event Synchronization • BARRIER(nprocs): wait here till nprocs processes get here • Built using lower level primitives • Often used to separate phases of computation Process P_1 Process P_2 Process P_nprocs set up eqn system set up eqn system set up eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) solve eqn system solve eqn system solve eqn system Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) apply results apply results apply results Barrier (name, nprocs) Barrier (name, nprocs) Barrier (name, nprocs) • Conservative form of preserving dependences, but easy to use • WAIT_FOR_END (nprocs-1) Computer Architecture II

  41. Message Passing Grid Solver • Cannot declare A to be global shared array • compose it logically from per-process private arrays • usually allocated in accordance with the assignment of work • process assigned a set of rows allocates them locally • Transfers of entire rows between traversals • Structurally similar to SPMD SAS • Orchestration different • data structures and data access/naming • communication • synchronization • Ghost rows Computer Architecture II

  42. P 0 P 0 P 1 P 1 P 2 P 4 P 2 Data partition allocated per processor Add ghost rows to hold boundary data Send edges to neighbors P 4 Receive into ghost rows Data Layout and Orchestration Compute as in sequential program Computer Architecture II

  43. Computer Architecture II

  44. Notes on Message Passing Program • Use of ghost rows • Receive does not transfer data, send does • unlike SAS which is usually receiver-initiated (load fetches data) • Communication done at beginning of iteration, so no asynchrony • Communication in whole rows, not element at a time • Core similar, but indices/bounds in local and not in global space • Synchronization through sends and receives • Update of global diff and event synch for done condition • Could implement locks and barriers with messages • Can use REDUCE and BROADCAST library calls to simplify code Computer Architecture II

  45. Send/Receive Synchronous Asynchronous Blocking asynchronous Non-blocking asynchronous Send and Receive Alternatives • Semantic flavors: based on when control is returned when data structures or buffers can be reused at either end • Synchronous: • both send/recv return when data arrived • Asynchonous blocking: • send returns when data copied from user buffer • recv returns when data arrived w/o acknowledging the sender • Asynchronous non-blocking • Send returns control immediately • Recv returns after posting the operation (has to be probed later) Computer Architecture II

  46. Orchestration: comparison SAS and MP Computer Architecture II

More Related