1 / 44

Decomposition, Locality and (finishing synchronization lecture) Barriers

Decomposition, Locality and (finishing synchronization lecture) Barriers. Kathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07. Lecture Schedule for Next Two Weeks. Fri 9/14 11-12:00 Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed)

haruko
Download Presentation

Decomposition, Locality and (finishing synchronization lecture) Barriers

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Decomposition, Locality and (finishing synchronization lecture) Barriers Kathy Yelick yelick@cs.berkeley.edu www.cs.berkeley.edu/~yelick/cs194f07 CS194 Lecture

  2. Lecture Schedule for Next Two Weeks • Fri 9/14 11-12:00 • Mon 9/17 10:30-11:30 Discussion (11:30-12 as needed) • Wed 9/19 10:30-12:00 Decomposition and Locality • Fri 9/21 10:30-12:00 NVIDIA Lecture • Mon 9/24 10:30-12:00 NVIDIA lecture (David Kirk) • Tue 9/25 3-:4:30 NVIDIA research talk in Woz (optional) • Wed 9/26 10:30-11:30 Discussion • Fri 9/27 11-12 Lecture topic TBD CS194 Lecture

  3. Outline • Task Decomposition • Review parallel decomposition and task graphs • Styles of decomposition • Extending task graphs with interaction information • Example: Sharks and Fish (Wator) • Parallel vs. sequential versions • Data decomposition • Partitioning rectangular grids (like matrix multiply) • Ghost regions (unlike matrix multiply) • Bulk-synchronous programming • Barrier synchronization CS194 Lecture

  4. Designing Parallel Algorithms • Parallel software design starts with decomposition • Decomposition Techniques • Recursive Decomposition • Data Decomposition: Input, Output, or Intermediate Data • And others • Characteristics of Tasks and Interactions • Task Generation, Granularity, and Context • Characteristics of Task Interactions. CS194 Lecture

  5. Recursive Decomposition: Example • Consider parallel Quicksort. Once the array is partitioned, each subarry can be processed i parallel Source: Ananth Grama CS194 Lecture

  6. Data Decomposition • Identify the data on which computations are performed. • Partition this data across various tasks. • This partitioning induces a decomposition of the of the computation, often using the following rule: • Owner Computes Rule: the thread assigned to a particular data item is responsible for all computation associated with it. • The owner computes rule is especially common on output data. Why? Source: Ananth Grama CS194 Lecture

  7. Input Data Decomposition • Recall: applying a function square to the elements of array A, then computing its sum: • Conceptualize a decomposition as a task dependency graph: • A directed graph with • Nodes corresponding to tasks • Edges indicating dependencies, that the result of one task is required for processing the next. sqr (A[0]) sqr(A[1]) sqr(A[2]) sqr(A[n]) … sum CS194 Lecture

  8. Output Data Decomposition: Example Consider matrix multiply: Task 1: Task 2: Task 3: Task 4: Source: Ananth Grama CS194 Lecture

  9. A Model Problem: Sharks and Fish • Illustration of parallel programming • Original version (discrete event only) proposed by Geoffrey Fox • Called WATOR • Basic idea: sharks and fish living in an ocean • Rules for breeding, eating, and death • Could also add: forces in the ocean, between creatures, etc. • Ocean is toroidal (2D donut) CS194 Lecture

  10. Serial vs. Parallel • Updating left-to-right row-wise order, we get a serial algorithm • Cannot be parallelized, because of dependencies, so instead we use a “red-black” order forall black points grid(i,j) update … forall red points grid(i,j) update … • For 3D or general graph, use graph coloring • Can use repeated Maximal Independent Sets to color • Graph(T) is bipartite => 2 colorable (red and black) • Nodes for each color can be updated simultaneously • Can also use two copies (old and new), but in both cases • Note we have changed behavior of original algorithm ! CS194 Lecture

  11. P4 Repeat compute locally to update local system barrier() exchange state info with neighbors until done simulating P1 P2 P3 P5 P6 P7 P8 P9 Parallelism in Wator • The simulation is synchronous • use two copies of the grid (old and new). • the value of each new grid cell depends only on 9 cells (itself plus 8 neighbors) in old grid. • simulation proceeds in timesteps-- each cell is updated at every step. • Easy to parallelize by dividing physical domain: Domain Decomposition • Locality is achieved by using large patches of the ocean • Only boundary values from neighboring patches are needed. • How to pick shapes of domains? CS194 Lecture

  12. Regular Meshes (eg Game of Life) • Suppose graph is nxn mesh with connection NSEW neighbors • Which has less communication (less potential false sharing)? • Minimizing communication on mesh  minimizing “surface to volume ratio” of partition 2*n*(p1/2 –1) edge crossings n*(p-1) edge crossings CS194 Lecture

  13. Ghost Nodes • Overview of (possible!) Memory Hierarchy Optimization • Normally done with 1-wide ghost • Can you imagine a reason for wider? To compute green Copy yellow Compute blue CS194 Lecture

  14. Bulk Synchronous Coputation • With this decomposition, computation is mostly independent • Need to synchronize between phases • Picking up from last time: barrier synchronization CS194 Lecture

  15. Barriers • Software algorithms implemented using locks, flags, counters • Hardware barriers • Wired-AND line separate from address/data bus • Set input high when arrive, wait for output to be high to leave • In practice, multiple wires to allow reuse • Useful when barriers are global and very frequent • Difficult to support arbitrary subset of processors • even harder with multiple processes per processor • Difficult to dynamically change number and identity of participants • e.g. latter due to process migration • Not common today on bus-based machines

  16. A Simple Centralized Barrier • Shared counter maintains number of processes that have arrived: increment when arrive, check until =procs struct bar_type {int counter; struct lock_type lock; int flag = 0;} bar_name; BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) /* reset if first */ bar_name.flag = 0; /* to reach */ mycount = ++bar_name.counter; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait */ } What is the problem if barriers are done back-to-back?

  17. A Working Centralized Barrier • Consecutively entering the same barrier doesn’t work • Must prevent process from entering until all have left previous instance • Could use another counter, but increases latency and contention • Sense reversal: wait for flag to take different value consecutive times • Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense = !(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = ++bar_name.counter; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag != local_sense) {}; } }

  18. Centralized Barrier Performance • Latency • Centralized has critical path length at least proportional to p • Traffic • About 3p bus transactions • Storage Cost • Very low: centralized counter and flag • Fairness • Same processor should not always be last to exit barrier • No such bias in centralized • Key problems for centralized barrier are latency and traffic • Especially with distributed memory, traffic goes to same node

  19. Improved Barrier Algorithms for a Bus • Separate arrival and exit trees, and use sense reversal • Valuable in distributed network: communicate along different paths • On bus, all traffic goes on same bus, and no less total traffic • Higher latency (log p steps of work, and O(p) serialized bus xactions) • Advantage on bus is use of ordinary reads/writes instead of locks • Software combining tree • Only k processors access the same location, where k is degree of tree

  20. Combining Tree Barrier CS194 Lecture

  21. Combining Tree Barrier CS194 Lecture

  22. Tree-Based Barrier Summary CS194 Lecture

  23. Tree-Based Barrier Cost Latency Traffic Memory CS194 Lecture

  24. Dissemination Based Barrier CS194 Lecture

  25. Dissemination Barrier CS194 Lecture

  26. Latency Traffic Memory CS194 Lecture

  27. Barrier Performance on SGI Challenge • Centralized does quite well • Will discuss fancier barrier algorithms for distributed machines • Helpful hardware support: piggybacking of reads misses on bus • Also for spinning on highly contended locks

  28. Synchronization Summary • Rich interaction of hardware-software tradeoffs • Must evaluate hardware primitives and software algorithms together • primitives determine which algorithms perform well • Evaluation methodology is challenging • Use of delays, microbenchmarks • Should use both microbenchmarks and real workloads • Simple software algorithms with common hardware primitives do well on bus • Will see more sophisticated techniques for distributed machines • Hardware support still subject of debate • Theoretical research argues for swap or compare&swap, not fetch&op • Algorithms that ensure constant-time access, but complex

  29. References • John Mellor-Crummey’s 422 Lecture Notes CS194 Lecture

  30. Tree-Based Computation • The broadcast and reduction operations in MPI are a good example of tree-based algorithms • For reductions: take n inputs and produce 1 output • For broadcast: take 1 input and produce n outputs • What can we say about such computations in general? CS194 Lecture

  31. A log n lower bound to compute any function of n variables • Assume we can only use binary operations, one per time unit • After 1 time unit, an output can only depend on two inputs • Use induction to show that after k time units, an output can only depend on 2k inputs • After log2 n time units, output depends on at most n inputs • A binary tree performs such a computation CS194 Lecture

  32. Broadcasts and Reductions on Trees CS194 Lecture

  33. Parallel Prefix, or Scan • If “+” is an associative operator, and x[0],…,x[p-1] are input data then parallel prefix operation computes • Notation: j:k mean x[j]+x[j+1]+…+x[k], blue is final value y[j] = x[0] + x[1] + … + x[j] for j=0,1,…,p-1 CS194 Lecture

  34. Mapping Parallel Prefix onto a Tree - Details • Up-the-tree phase (from leaves to root) 1) Get values L and R from left and right children 2) Save L in a local register Lsave 3) Pass sum L+R to parent By induction, Lsave = sum of all leaves in left subtree • Down the tree phase (from root to leaves) 1) Get value S from parent (the root gets 0) 2) Send S to the left child 3) Send S + Lsave to the right child • By induction, S = sum of all leaves to left of subtree rooted at the parent CS194 Lecture

  35. E.g., Fibonacci via Matrix Multiply Prefix • Consider computing of the Fibbonacci numbers: Fn+1 = Fn + Fn-1 • Each step can be viewed as a matrix multiplication: Can compute all Fn by matmul_prefix on [ , , , , , , , , ] then select the upper left entry Derived from: Alan Edelman, MIT CS194 Lecture

  36. Adding two n-bit integers in O(log n) time • Let a = a[n-1]a[n-2]…a[0] and b = b[n-1]b[n-2]…b[0] be two n-bit binary numbers • We want their sum s = a+b = s[n]s[n-1]…s[0] • Challenge: compute all c[i] in O(log n) time via parallel prefix • Used in all computers to implement addition - Carry look-ahead c[-1] = 0 … rightmost carry bit for i = 0 to n-1 c[i] = ( (a[i] xor b[i]) and c[i-1] ) or ( a[i] and b[i] ) ... next carry bit s[i] = ( a[i] xor b[i] ) xor c[i-1] for all (0 <= i <= n-1) p[i] = a[i] xor b[i] … propagate bit for all (0 <= i <= n-1) g[i] = a[i] and b[i] … generate bit c[i] = ( p[i] and c[i-1] ) or g[i] = p[i] g[i] * c[i-1] = C[i] * c[i-1] 1 1 0 1 1 1 … 2-by-2 Boolean matrix multiplication (associative) = C[i] * C[i-1] * … C[0] * 0 1 … evaluate each P[i] = C[i] * C[i-1] * … * C[0] by parallel prefix CS194 Lecture

  37. Other applications of scans • There are several applications of scans, some more obvious than others • add multi-precision numbers (represented as array of numbers) • evaluate recurrences, expressions • solve tridiagonal systems (numerically unstable!) • implement bucket sort and radix sort • to dynamically allocate processors • to search for regular expression (e.g., grep) • Names: +\ (APL), cumsum (Matlab), MPI_SCAN • Note: 2n operations used when only n-1 needed CS194 Lecture

  38. Evaluating arbitrary expressions • Let E be an arbitrary expression formed from +, -, *, /, parentheses, and n variables, where each appearance of each variable is counted separately • Can think of E as arbitrary expression tree with n leaves (the variables) and internal nodes labeled by +, -, * and / • Theorem (Brent): E can be evaluated in O(log n) time, if we reorganize it using laws of commutativity, associativity and distributivity • Sketch of (modern) proof: evaluate expression tree E greedily by • collapsing all leaves into their parents at each time step • evaluating all “chains” in E with parallel prefix CS194 Lecture

  39. Multiplying n-by-n matrices in O(log n) time • For all (1 <= i,j,k <= n) P(i,j,k) = A(i,k) * B(k,j) • cost = 1 time unit, using n^3 processors • For all (1 <= I,j <= n) C(i,j) = S P(i,j,k) • cost = O(log n) time, using a tree with n^3 / 2 processors n k =1 CS194 Lecture

  40. xi = fi(xi-1) = (ai * xi-1 + bi )/( ci * xi-1 + di ) can be written as xi = numi / deni = (ai * numi-1 + bi * deni-1)/(ci * numi-1 + di * deni-1) or numi = ai bi * numi-1 = Mi * numi-1 = Mi * Mi-1 * … * M1* num0 demi ci di deni-1 deni-1 den0 Can use parallel prefix with 2-by-2 matrix multiplication Evaluating recurrences • Let xi = fi(xi-1), fi a rational function, x0 given • How fast can we compute xn? • Theorem (Kung): Suppose degree(fi) = d for all i • If d=1, xn can be evaluated in O(log n) using parallel prefix • If d>1, evaluating xn takes W(n) time, i.e. no speedup is possible • Sketch of proof when d=1 • Sketch of proof when d>1 • degree(xi) as a function of x0 is di • After k parallel steps, degree(anything) <= 2k • Computing xi take W(i) steps CS194 Lecture

  41. Summary • Message passing programming • Maps well to large-scale parallel hardware (clusters) • Most popular programming model for these machines • A few primitives are enough to get started • send/receive or broadcast/reduce plus initialization • More subtle semantics to manage message buffers to avoid copying and speed up communication • Tree-based algorithms • Elegant model that is a key piece of data-parallel programming • Most common are broadcast/reduce • Parallel prefix (aka scan) has produces partial answers and can be used for many surprising applications • Some of these or more theoretical than practical interest CS194 Lecture

  42. Extra Slides CS194 Lecture

  43. 2 Inverting Dense n-by-n matrices in O(log n) time • Lemma 1: Cayley-Hamilton Theorem • expression for A-1 via characteristic polynomial in A • Lemma 2: Newton’s Identities • Triangular system of equations for coefficients of characteristic polynomial, matrix entries = sk • Lemma 3: sk = trace(Ak) = S Ak [i,i] = S [li (A)]k • Csanky’s Algorithm (1976) • Completely numerically unstable n n i=1 i=1 1) Compute the powers A2, A3, …,An-1 by parallel prefix cost = O(log2 n) 2) Compute the traces sk = trace(Ak) cost = O(log n) 3) Solve Newton identities for coefficients of characteristic polynomial cost = O(log2 n) 4) Evaluate A-1 using Cayley-Hamilton Theorem cost = O(log n) CS194 Lecture

  44. Summary of tree algorithms • Lots of problems can be done quickly - in theory - using trees • Some algorithms are widely used • broadcasts, reductions, parallel prefix • carry look ahead addition • Some are of theoretical interest only • Csanky’s method for matrix inversion • Solving general tridiagonals (without pivoting) • Both numerically unstable • Csanky needs too many processors • Embedded in various systems • CM-5 hardware control network • MPI, Split-C, Titanium, NESL, other languages CS194 Lecture

More Related