1 / 15

Structure-driven Optimizations for Amorphous Data-parallel Programs

Structure-driven Optimizations for Amorphous Data-parallel Programs. Mario Méndez-Lojo 1 Donald Nguyen 1 Dimitrios Prountzos 1 Xin Sui 1 M. Amber Hassaan 1 Milind Kulkarni 2 Martin Burtscher 1 Keshav Pingali 1 1 The University of Texas at Austin (USA) 2 Purdue University (USA).

anisa
Download Presentation

Structure-driven Optimizations for Amorphous Data-parallel Programs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Structure-driven Optimizations for Amorphous Data-parallel Programs Mario Méndez-Lojo1Donald Nguyen1 Dimitrios Prountzos1Xin Sui1 M. Amber Hassaan1 Milind Kulkarni2 Martin Burtscher1Keshav Pingali1 1The University of Texas at Austin (USA) 2Purdue University (USA)

  2. Irregular algorithms • Operate on pointer-based data structures like graphs • mesh refinements, min. spanning tree, max-flow… • Plenty of available parallelism [Kulkarni et al., PPoPP’09] • Baseline Galois system [Kulkarni et al., PLDI’07] • uses speculation to exploit this parallelism • may have high overheads for some algorithms • Solution explored in paper • exploit algorithmic structure to reduce overheads of baseline system • We will show: • common algorithmic structures • optimizations that exploit those structures • performance results

  3. Operator formulation of algorithms i3 • Algorithm = repeated application of operator to graph • active node: • node where computation is needed • activity: • application of operator to active node • can add/remove nodes from graph • neighborhood: • set of nodes and edges read/written to perform activity • can be distinct from neighbors in graph • Focus: algorithms in which order of activities does not matter • Amorphous data-parallelism • parallel execution of activities, subject to neighborhood constraints i1 i2 i4 i5 : active node : neighborhood

  4. Delaunay mesh refinement • Iterative refinement to remove badly shaped triangles: add initial bad triangles to workset while workset is not empty: pick a bad triangle find its cavity retriangulate cavity add new bad triangles to workset • Multiple valid solutions • Parallelism: • bad triangles with cavities that do not overlap can be processed in parallel • parallelism is dependent on runtime values • compilers cannot find this parallelism

  5. Baseline execution model • Parallel execution model • shared-memory • optimistic execution of Galois iterators • Implementation • threads get active nodes from workset • apply operator to them • Neighborhood independence • each node/edge has an associated token • graph operations → acquire tokens on read/written nodes • token owned by another thread → conflict → activity rolled back • software TLS/TM variety main() … for node: workset … … … … i3 i1 i2 i4 i5 program concurrent graph

  6. Sources of overhead • Dynamic assignment of work • the centralized worksetrequires synchronization • Enforcing neighborhood constraints • acquiring/releasing tokens on neighborhood • Copying data for rollbacks • when an activity modifies a graph element • Aborted activities • work is wasted + roll back the activity ≡ ≡ ≡ R W R activity time

  7. Proposed optimizations • Baseline execution model is very general • many irregular algorithms do not need its full generality • “optimize the common case” • Identify general-purpose optimizations and evaluate their performance impact • Optimizations • cautious • one-shot • iteration coalescing

  8. Cautious Algorithmic structure: operator reads all elements of its neighborhood before modifying any of them • conflicts detected before modifications occur Optimizations: • Enforcing neighborhood constraints • token acquisition unnecessary after first modification • Copying data for rollbacks examples: Delaunay refinement, Boruvka minimum spanning tree, etc. ≡ ≡ R W R activity time

  9. One-shot Algorithmic structure: neighborhood can be predicted before activity begins Optimizations: • Enforcing neighborhood constraints • token acquisition only necessary when activity starts • Copying data for rollbacks • Aborted activities • waste little computation examples: preflow-push, survey propagation, stencil codes like Jacobi, etc. ≡ ≡ R W R activity time

  10. Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit • Algorithmic structure: • activities generate new active nodes • same token acquired many times across related activities Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token ≡ ≡ R W R R W R

  11. Iteration coalescing • Iteration coalescing = data-centric loop chunking • place new active nodes in thread-local worksets • release tokens only on abort/commit Benefits: • Dynamic assignment of work • less contention • Enforcing neighborhood constraints • locality: thread probably owns the token Drawback: • number of tokens held by thread increases • higher conflict ratio

  12. Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvementover baseline • Delaunay refinement • cautious → 15% • cautious + coalescing → 18% • one-shot not applicable • max speedup: 8.4x • Boruvka (MST) • cautious → 22% • one-shot → 8% • coalescing has no impact • max speedup: 2.7x

  13. Evaluation • Experiments on Niagara (8 cores, 2 threads/core) • Average % improvement over baseline • Preflow-push (max flow) • cautious → 33% • one-shot→ 44% • one-shot + coalescing → 59% • max speedup: 1.12x • Survey propagation (SAT) • baseline times out • one-shot → 28% over cautious • max speedup: 1.04x

  14. Conclusions • There is structure in irregular algorithms • Our optimizations exploit this structure • cautious • one-shot • iteration coalescing • The evaluation confirms the importance of reducing the overheads of speculation • other optimizations waiting to be discovered?

  15. Thank you!धन्यवाद! slides available at http://www.ices.utexas.edu/~marioml

More Related