1 / 95

Program Analysis and Synthesis of Parallel Systems

Program Analysis and Synthesis of Parallel Systems. Roman Manevich Ben-Gurion University. Three papers. A Shape Analysis for Optimizing Parallel Graph Programs [POPL’11] Elixir: a System for Synthesizing Concurrent Graph Programs [OOPSLA’12]

bevis
Download Presentation

Program Analysis and Synthesis of Parallel Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Program Analysis and Synthesis of Parallel Systems Roman Manevich Ben-Gurion University

  2. Three papers A Shape Analysis for Optimizing Parallel Graph Programs[POPL’11] Elixir: a System for Synthesizing Concurrent Graph Programs[OOPSLA’12] Parameterized Verification of Transactional Memories[PLDI’10]

  3. What’s the connection? From analysisto language design A Shape Analysisfor Optimizing ParallelGraph Programs[POPL’11] Elixir: a System for SynthesizingConcurrent Graph Programs[OOPSLA’12] Creates opportunitiesfor more optimizations.Requires other analyses Similarities betweenabstract domains Parameterized Verificationof Transactional Memories[PLDI’10]

  4. What’s the connection? From analysisto language design A Shape Analysisfor Optimizing ParallelGraph Programs[POPL’11] Elixir: a System for SynthesizingConcurrent Graph Programs [OOPSLA’12] Creates opportunitiesfor more optimizations.Requires other analyses Similarities betweenabstract domains Parameterized Verificationof Transactional Memories [PLDI’10]

  5. A Shape Analysis for Optimizing Parallel Graph Programs Roman Manevich2 Kathryn S. McKinley1 Dimitrios Prountzos1 Keshav Pingali1,2 1: Department of Computer Science, The University of Texas at Austin 2: Institute for Computational Engineering and Sciences, The University of Texas at Austin

  6. Motivation • Graph algorithms are ubiquitous • Goal: Compiler analysis for optimizationof parallel graph algorithms Computer Graphics Computational biology Social Networks

  7. Minimum Spanning Tree Problem 7 1 6 c d e f 2 4 4 3 a b g 5

  8. Minimum Spanning Tree Problem 7 1 6 c d e f 2 4 4 3 a b g 5

  9. Boruvka’sMinimum Spanning Tree Algorithm 7 1 6 c d e f lt 2 4 4 3 1 6 a b g d e f 5 7 4 4 a,c b g 3 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

  10. Parallelism in Boruvka 7 1 6 c d e f 2 4 4 3 a b g 5 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

  11. Non-conflicting iterations 7 1 6 c d e f 2 4 4 3 a b g 5 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

  12. Non-conflicting iterations 1 6 f,g d e 7 4 a,c b 3 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

  13. Conflicting iterations 7 1 6 c d e f 2 4 4 3 a b g 5 • Build MST bottom-up • repeat { • pick arbitrary node ‘a’ • merge with lightest neighbor ‘lt’ • add edge ‘a-lt’ to MST • } until graph is a single node

  14. Optimistic parallelization in Galois i2 i1 i3 • Programming model • Client code has sequential semantics • Library of concurrent data structures • Parallel execution model • Thread-level speculation (TLS) • Activities executed speculatively • Conflict detection • Each node/edge has associated exclusive lock • Graph operations acquire locks on read/written nodes/edges • Lock owned by another thread  conflict  iteration rolled back • All locks released at the end • Two main overheads • Locking • Undo actions

  15. Generic optimization structure ProgramAnalyzer AnnotatedProgram Program ProgramTransformer OptimizedProgram

  16. Overheads(I): locking • Optimizations • Redundant locking elimination • Lock removal for iteration private data • Lock removal for lock domination • ACQ(P): set of definitely acquiredlocks per program point P • Given method call M at P: Locks(M)  ACQ(P)  Redundant Locking

  17. Overheads (II): undo actions foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } foreach (Node a : wl) { … … } Lockset Grows Failsafe Lockset Stable … Program point Pis failsafe if: Q : Reaches(P,Q)  Locks(Q)  ACQ(P)

  18. Lockset analysis GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } • Redundant Locking • Locks(M)  ACQ(P) • Undo elimination • Q : Reaches(P,Q) Locks(Q)  ACQ(P) • Need to compute ACQ(P) : Runtime overhead

  19. The optimization technically • Each graph method m(arg1,…,argk, flag) contains optimization level flag • flag=LOCK – acquire locks • flag=UNDO – log undo (backup) data • flag=LOCK_UNO D – (default) acquire locks and log undo • flag=NONE – no extra work • Example:Edge e = g.getEdge(lt, n, NONE)

  20. Analysis challenges • The usual suspects: • Unbounded Memory  Undecidability • Aliasing, Destructive updates • Specific challenges: • Complex ADTs: unstructured graphs • Heap objects are locked • Adapt abstraction to ADTs • We use Abstract Interpretation [CC’77] • Balance precision and realistic performance

  21. Shape analysis overview Predicate Discovery Graph { @rep nodes @rep edges … } Set { @rep cont … } … … HashMap-Graph Set Spec Shape Analysis Graph Spec Tree-based Set Concrete ADT Implementations in Galois library ADTSpecifications Boruvka.java Optimized Boruvka.java

  22. ADT specification Abstract ADT state by virtual set fields Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges Set<Node> neighbors(Node n); } ... Set<Node> S1 = g.neighbors(n); ... @locks(n + n.rev(src) + n.rev(src).dst + n.rev(dst) + n.rev(dst).src) @op( nghbrs = n.rev(src).dst + n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Boruvka.java Graph Spec Assumption: Implementation satisfies Spec

  23. Modeling ADTs Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } c src dst src dst dst src a b Graph Spec

  24. Modeling ADTs Abstract State Graph<ND,ED> { @rep set<Node> nodes @rep set<Edge> edges @locks(n + n.rev(src) + n.rev(src).dst+n.rev(dst) + n.rev(dst).src) @op( nghbrs= n.rev(src).dst+ n.rev(dst).src , ret = new Set<Node<ND>>(cont=nghbrs) ) Set<Node> neighbors(Node n); } nodes edges c src dst ret cont src dst dst src a b Graph Spec nghbrs

  25. Abstraction scheme S1 S2 L(S2.cont) L(S1.cont) (S1 ≠S2) ∧ L(S1.cont) ∧ L(S2.cont) cont cont  • Parameterized by set of LockPaths: L(Path)o . o ∊ Path  Locked(o) • Tracks subset of must-be-locked objects • Abstract domain elements have the form: Aliasing-configs 2LockPaths …

  26. Joining abstract states ( L(y.nd) )  (  L(y.nd) L(x.rev(src)) )  ( () L(x.nd) ) ( L(y.nd) )  ( () L(x.nd) ) ( L(y.nd) )  ( () L(x.nd) ) Aliasing is crucial for precision May-be-locked does not enable our optimizations #Aliasing-configs: small constant (6)

  27. Example invariant in Boruvka GSet<Node> wl = new GSet<Node>(); wl.addAll(g.getNodes()); GBag<Weight> mst = new GBag<Weight>(); foreach (Node a : wl) { Set<Node> aNghbrs = g.neighbors(a); Node lt = null; for (Node n : aNghbrs) { minW,lt = minWeightEdge((a,lt), (a,n)); } g.removeEdge(a, lt); Set<Node> ltNghbrs = g.neighbors(lt); for (Node n : ltNghbrs) { Edge e = g.getEdge(lt, n); Weight w = g.getEdgeData(e); Edge an = g.getEdge(a, n); if (an != null) { Weight wan = g.getEdgeData(an); if (wan.compareTo(w) < 0) w = wan; g.setEdgeData(an, w); } else { g.addEdge(a, n, w); } } g.removeNode(lt); mst.add(minW); wl.add(a); } • The immediate neighbors • of a and lt are locked lt a ( a ≠ lt ) ∧ L(a) ∧ L(a.rev(src)) ∧ L(a.rev(dst)) ∧ L(a.rev(src).dst)∧ L(a.rev(dst).src) ∧ L(lt) ∧ L(lt.rev(dst)) ∧ L(lt.rev(src)) ∧ L(lt.rev(dst).src) ∧ L(lt.rev(src).dst) …..

  28. Heuristics for finding LockPaths S Set<Node> cont Node nd NodeData • Hierarchy Summarization (HS) • x.( fld )* • Type hierarchy graph acyclic  bounded number of paths • Preflow-Push: • L(S.cont) ∧ L(S.cont.nd) • Nodes in set S and their data are locked

  29. Footprint graph heuristic • Footprint Graphs (FG)[Calcagno et al. SAS’07] • All acyclic paths from arguments of ADT method to locked objects • x.( fld | rev(fld) )* • Delaunay Mesh Refinement: L(S.cont) ∧ L(S.cont.rev(src)) ∧ L(S.cont.rev(dst)) ∧ L(S.cont.rev(src).dst) ∧ L(S.cont.rev(dst).src) • Nodes in set S and all of their immediate neighbors are locked • Composition of HS, FG • Preflow-Push: L(a.rev(src).ed) HS FG

  30. Experimental evaluation • Implement on top of TVLA • Encode abstraction by 3-Valued Shape Analysis [SRW TOPLAS’02] • Evaluation on 4 Lonestar Java benchmarks • Inferred all available optimizations • # abstract states practically linear in program size

  31. Impact of optimizations for 8 threads 8-core Intel Xeon @ 3.00 GHz

  32. Note 1 • How to map abstract domain presented so far to TVLA? • Example invariant: (x≠yL(y.nd))  (x=y L(x.nd)) • Unary abstraction predicate x(v) for pointer x • Unary non-abstraction predicateL[x.p] for pointer x and path p • Use partial join • Resulting abstraction similar to the one shown

  33. Note 2 • How to come up with abstraction for similar problems? • Start by constructing a manual proof • Hoare Logic • Examine resulting invariants and generalize into a language of formulas • May need to be further specialized for a given program – interesting problem (machine learning/refinement) • How to get sound transformers?

  34. Note 3 How did we avoid considering all interleavings? Proved non-interference side theorem

  35. Elixir : A System for Synthesizing Concurrent Graph Programs Dimitrios Prountzos1 Roman Manevich2 Keshav Pingali1 1. The University of Texas at Austin 2. Ben-Gurion University of the Negev

  36. Goal • Graph algorithms are ubiquitous Social network analysis, Computer graphics, Machine learning, … • Difficult to parallelize due to their irregular nature • Best algorithm and implementation usually • Platform dependent • Input dependent • Need to easily experiment with different solutions • Focus: Fixed graph structure • Only change labels on nodes and edges • Each activity touches a fixed number of nodes Allow programmer to easilyimplement correct and efficient parallel graph algorithms

  37. Example: Single-Source Shortest-Path S 5 2 A B A 2 1 7 C C 3 4 3 12 D E 2 2 F 9 1 G if dist(A) + WAC < dist(C) dist(C) = dist(A) + WAC • Problem Formulation • Compute shortest distancefrom source node Sto every other node • Many algorithms • Bellman-Ford (1957) • Dijkstra (1959) • Chaotic relaxation (Miranker 1969) • Delta-stepping (Meyer et al. 1998) • Common structure • Each node has label distwith knownshortest distance from S • Key operation • relax-edge(u,v)

  38. Dijkstra’s algorithm <B,5> <C,3> <B,5> <E,6> <B,5> <D,7> S 5 2 A B 5 3 1 7 C 3 4 D E 7 2 2 6 F 9 1 G Scheduling of relaxations: • Use priority queueof nodes, ordered by label dist • Iterate over nodes u in priority order • On each step: relax all neighbors v of u • Apply relax-edgeto all (u,v)

  39. Chaotic relaxation S 5 2 • Scheduling of relaxations: • Use unordered set of edges • Iterate over edges (u,v) in any order • On each step: • Apply relax-edge to edge (u,v) A B 5 1 7 C 3 4 12 D E 2 2 F 9 1 G (C,D) (B,C) (S,A) (C,E)

  40. Insights behind Elixir Parallel Graph Algorithm What should be done How it should be done Operators Schedule Unordered/Ordered algorithms Order activity processing Identify new activities Operator Delta “TAO of parallelism” PLDI 2011 : activity Static Schedule Dynamic Schedule

  41. Insights behind Elixir Parallel Graph Algorithm q = new PrQueue q.enqueue(SRC) while (! q.empty ) { a = q.dequeue for each e = (a,b,w) { if dist(a) + w < dist(b) { dist(b) = dist(a) + w q.enqueue(b) } } } Operators Schedule Order activity processing Identify new activities Static Schedule Dynamic Schedule Dijkstra-style Algorithm

  42. Contributions Parallel Graph Algorithm • Language • Operators/Schedule separation • Allows exploration of implementation space • Operator Delta Inference • Precise Delta required for efficient fixpoint computations • Automatic Parallelization • Inserts synchronization to atomically execute operators • Avoids data-races / deadlocks • Specializes parallelization based on scheduling constraints Operators Schedule Order activity processing Identify new activities Static Schedule Dynamic Schedule Synchronization

  43. SSSP in Elixir Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int)] Graph type relax = [ nodes(node a, dist ad) nodes(node b, distbd) edges(src a, dst b, wt w)bd> ad + w ] ➔ [ bd = ad + w ] Operator Fixpoint Statement sssp = iterate relax ≫ schedule

  44. Operators Cautiousby construction – easy to generalize Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int)] relax = [ nodes(node a, dist ad) nodes(node b, distbd) edges(src a, dst b, wt w)bd> ad + w ] ➔ [ bd = ad + w ] Redex pattern Guard Update sssp = iterate relax ≫ schedule ad bd ad ad+w w w a b a b if bd > ad + w

  45. Fixpoint statement Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int)] relax = [ nodes(node a, dist ad) nodes(node b, distbd) edges(src a, dst b, wt w)bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Scheduling expression Apply operator until fixpoint

  46. Scheduling examples q = new PrQueue q.enqueue(SRC) while (! q.empty ) { a = q.dequeue for each e = (a,b,w) { if dist(a) + w < dist(b) { dist(b) = dist(a) + w q.enqueue(b) } } } Graph [ nodes(node : Node, dist : int) edges(src : Node, dst : Node, wt : int)] relax = [ nodes(node a, dist ad) nodes(node b, distbd) edges(src a, dst b, wt w)bd > ad + w ] ➔ [ bd = ad + w ] sssp = iterate relax ≫ schedule Locality enhanced Label-correcting group b ≫unroll 2 ≫approx metric ad Dijkstra-style metric ad ≫group b

  47. Operator Delta Inference Parallel Graph Algorithm Operators Schedule Order activity processing Identify new activities Static Schedule Dynamic Schedule

  48. Identifying the delta of an operator ? b relax1 ? a

  49. Delta Inference Example c relax2 w2 a b w1 SMT Solver relax1 assume(da + w1< db) assume¬(dc + w2 < db) db_post =da + w1 assert¬(dc + w2 < db_post) SMT Solver (c,b) does not become active Query Program

  50. Delta inference example – active Apply relax on all outgoing edges (b,c) such that: dc > db +w2 and c ≄ a relax1 relax2 a b c w1 w2 SMT Solver assume(da + w1< db) assume¬(db+ w2 < dc) db_post =da + w1 assert¬(db_post+ w2< dc) SMT Solver Query Program

More Related