300 likes | 438 Views
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: A A A A A A A A A A. Taming parallelism. Task-parallelism. Message-passing. Data parallelism: Highly coarse-grained (MapReduce) Highly fine-grained (numeric computations on dense arrays)
 
                
                E N D
TexPoint fonts used in EMF. Read the TexPoint manual before you delete this box.: AAAAAAAAAA
Taming parallelism Task-parallelism Message-passing • Data parallelism: • Highly coarse-grained • (MapReduce) • Highly fine-grained • (numeric computations on dense arrays) • Problem-specific methods
Taming parallelism Our target: Data-parallel computations over large, unstructured, shared-memory graphs Unknown granularity High-level correctness as well as efficiency.
Delaunay mesh refinement • Triangulate a given set of points. • Delaunay property: No point is contained within the circumcircle of a triangle. • Quality property: No bad triangles—i.e., triangles with an angle > 120o. • Mesh refinement: Fix bad triangles through an iterative algorithm.
Retriangulation Cavity: all triangles whose circumcircle contains new point. Quality constraint may not hold for all new triangles.
Sequential mesh refinement Mesh m = /* read input mesh */ Worklist wl = new Worklist(m.getBad()); foreach triangle t in wl { Cavity c = new Cavity(t); c.expand(); c.retriangulate(); m.updateMesh(c); wl.add(c.getBad()); } • Cavities are contiguous “regions” in the mesh. • Worst-case cavities can encompass the whole mesh.
Parallelization • Computation over complex, unstructured graphs Mesh = Heap-allocated graph. Nodes = triangles. Edges = adjacency • Atomicity: Cavities must be retriangulated atomically. • Non-overlapping cavities can be processed in parallel. • Seems impossible to handle with static analysis: • Shape of data structure changes greatly over time. • Shape of data structure is highly input-dependent. • Without deep algorithmic knowledge, impossible to say if statically if cavities will overlap. • Lots of recent work, notably by Pingali et al.
List of similar applications • Delaunay mesh refinement, Delaunay triangulation • Agglomerative clustering, ray tracing • Social network maintenance • Minimum spanning tree, Maximum flow • N-body simulation, epidemiological simulation • Sparse matrix-vector multiplication, sparse Cholesky factorization • Belief propagation, survey propagation in Bayesian inference • Iterative dataflow analysis, Petri net simulation • Finite-difference PDE solution
Locality of updates in Chorus Cavity • On a mesh of ~100,000 triangles from Lonestar benchmarks: Average cavity size = 3.75 triangles. • Maximum cavity size = 12 triangles • Average-case locality the essence of parallelism. • Chorus: parallel computation driven by “neighborhoods” in heaps.
Heaps, regions, assemblies • Heap = directed graphNodes = objectsLabeled edges = pointers • Region = induced subgraph • Assembly = region + thread of control Typically speculativeand shortlived.
Programs, assembly classes • Assembly class = set of local variables + set of guarded updates + constructor + public variables. • Program = set of classes • Synchronization happens in guard evaluation. terminated busy executingupdate ready to be preempted or execute next update :: Guard: Update
Guards can merge assemblies :: merge (u.f): S :: merge (u.f) when g: S f u • g is a condition on thelocal variables and owned objects of • gets a bigger region, keeps local state • dies. • must be in ready state while merge happens
Updates can split an assembly split(T) • Split into assemblies of class T. • Other assemblies not affected. • Not a synchronization construct.
Local updates • Attempts to access objects outside region lead to exceptions. x = u.f; x.f= y; f u
Delaunay mesh refinement • Use two assembly classes: Triangle and Cavity. • Cavity = local region in mesh. • Each triangle: • Determines if it is bad (local check). • If so, merges with neighbors to become cavity. • Each cavity: • Determines if it is complete (local check). • If no, merges with a neighbor. • If yes, retriangulates (locally) and splits.
Delaunay mesh refinement: sketch assembly Triangle:: ... action:: merge (v.f, Cavity) when isBad: skip assembly Cavity:: ... action:: merge (v.f) when (not isComplete): ... isComplete: retriangulate(); split(Triangle)
Delaunay mesh refinement: sketch assem Triangle:: ... action:: merge (v.f, Cavity, u) when bad?: skip assem Cavity:: ... action:: merge (v.f) when (not complete?): skip complete?: retriangulate(); split(Triangle) • What happens on a conflict? • Cavity i “absorbed” by cavity j. • Cavity j now has some “unnecessary” triangles. • j will later split.
Boruvka’s algorithm for minimum spanning tree • Assembly = spanning tree • Initially, each assembly hasone node. • As algorithm progresses, trees merge.
Race-freedom • No aliasing, only ownership transfer. • can merge with only when is not in the middle of an update.
Deadlock-freedom • Classic definition: Process P waits for a resource from Q and vice versa. • Deadlock in Chorus: • has a locally enabled merge with • has a locally enabled merge with • No other progress is possible. • But one of the merges can always be carried out. (An assembly can always be killed at its ready state.) u
JChorus 7: assembly Cavity { 8: action { // expand cavity 9: merge(outgoingedges, TriangleObject t): 10: { outgoingedges.remove(t); 11: frontier.add(t); 12: build(); } 13: } 14: Set members; Set border; 15: Queue frontier; // current frontier 16: List outgoingedges; // outgoing edges on which to merge 17: TriangleObject initial; ... • Chorus + sequential Java. • Assembly classes in addition to object classes.
Division-based implementation • Division = set of assemblies mapped to a core. • Local access: Merge-actions within a division Split-actions Local updates • Remote access:Merge-actions issued across divisions • Uses assembly-level locks.
Implementation strategies • Adaptive divisions. Heuristic for reducing the number of remote merges. • During a merge, not only the target assembly, but also assemblies reachable by k pointer indirections, are migrated. • Adaptation heuristic does elementary load balancing. • Union-find data structure to relate objects and assemblies that they belong to • Needed for splits and merges. • Token-passing for deadlock prevention and termination detection.
Experiments: Delaunay refinement from Lonestar benchmarks • Large dataset from Lonestar benchmarks. • 100,364 triangles. • 47,768 initially bad. • 1 to 8 threads. • Competing approaches: • Object-level locking • DSTM (Software transactions)
Locality: mesh snapshots The initial mesh and divisions Mesh after several thousand retriangulations
Related models • Threads + explicit locking: Global heap abstraction, arbitrary aliasing. • Software transactions: Burden of reasoning passed to transaction manager. In most implementations, heap is viewed as global. • Static data partitioning: Unpredictable nature of the computation makes static analysis hard. • Actors: Based on low-level messaging. If sending references, potential of races. If copying triangles, inefficient. • Pingali et al’s Galois: Same problem, but ours is an alternative.
More information Parallel programming with object assemblies. Roberto Lublinerman, SwaratChaudhuri, PavolCerny. OOPSLA 2009. http://www.cse.psu.edu/~swarat/chorus