1 / 37

Multithreaded Algorithms for Graph Coloring

Multithreaded Algorithms for Graph Coloring. Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ Assefaw Gebremedhin , Mahantesh Halappanavar (PNNL), John Feo (PNNL), Umit Catalyurek (Ohio State) CSC’11 Workshop May 2011. References.

apria
Download Presentation

Multithreaded Algorithms for Graph Coloring

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multithreaded Algorithms for Graph Coloring Alex Pothen Purdue University CSCAPES Institute www.cs.purdue.edu/homes/apothen/ AssefawGebremedhin, MahanteshHalappanavar (PNNL), John Feo (PNNL), UmitCatalyurek (Ohio State) CSC’11 Workshop May 2011

  2. References • Multithreaded algorithms for graph coloring. Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, 40 pp., Submitted to Parallel Computing. • New multithreaded ordering and coloring algorithms formulticore architectures. Patwary, Gebremedhin, Pothen, 12pp., EuroPar 2011. • Graph coloring for derivative computation and beyond: Algorithms, software and analysis. Gebremedhin, Nguyen, Pothen, and Patwary, 32 pp., Submitted to TOMS. • Distributed Memory Parallel Algorithms for Matching and Coloring. Catalyurek, Dobrian, Gebremedhin, Halappanavar, and Pothen, 10pp., IPDPS Workshop PCO, 2011.

  3. Algorithm Thread Granularity, Synchronization Concurrency Performance Architecture Graph Latency Tolerance

  4. Outline • The many-core and multi-threaded world • Intel Nehalem • Sun Niagara • Cray XMT • A case study on multithreaded graph coloring • An Iterative and Speculative Coloring Algorithm • A Dataflow algorithm • RMAT graphs: ER, G, and B • Experimental results • Conclusions

  5. Architectural Features • B = max back degree • over entire seq. • B+1 colors suffice • to color G. O(|E|)-time implementations possible for all four

  6. Multithreaded: Iterative Greedy Algorithm Forbidden Colors v v V ci

  7. Multi-threaded: Data Flow Algorithm

  8. Multi-threaded: Data Flow Algorithm ci Forbidden Colors v v V

  9. RMAT Graphs • R-MAT: Recursive MATrix method • Experiments • RMAT-ER (0.25, 0.25, 0.25, 0.25) • RMAT-G (0.45, 0.15, 0.15, 0.25) • RMAT-B (0.55, 0.15, 0.15, 0.15) • Chakrabarti, D. and Faloutsos, C. 2006. Graph mining: Laws, generators, and algorithms. ACM Comput. Surv. 38, 1.

  10. RMAT Graphs

  11. Nehalem: Strong Scaling (Niagara) RMAT-ER RMAT-G RMAT-B

  12. Cray XMT: Strong and Weak Scaling Iter-G Iter-B DF-G DF-B

  13. Comparing Three Platforms a) ER c) Good e) Bad

  14. No. Colors in Parallel Algorithms b) G a) ER c) B

  15. Computing SL Orderings in Parallel: RMAT-G graphs (Nehalem) SL Ordering Relaxed SL Ordering

  16. Our contributions: Multithreaded Coloring • Massive multithreading • Can tolerate memory latency for graphs/sparse matrices • Dataflow algorithms easier to implement than distributed memory versions • Thread concurrency ameliorates lack of caches, and lower clock speeds • Thread parallelism can be exploited at fine grain if supported by lightweight synchronization • Graph structure critically influences performance • Many-core machines • Developed an iterative algorithm for greedy coloring (distance-1 and -2) and ordering algorithms that port to different machines • Simultaneous multithreading can hide latency (X threads on 1 core vs. 1 thread on X cores) • Decomposition into tasks at a finer grain than distributed-memory version, andrelax synchronization to enhance concurrency • Will form nodes of Peta- and Exa-scale machines, so single node performance studies are needed

  17. Multi-threaded Parallelism

  18. Time Figure from Robert Golla, Sun • Memory access times determine performance • By issuing multiple threads, mask memory latency if a ready thread is available when a functional unit becomes free • Interleaved vs. Simultaneous multithreading (IMT or SMT)

  19. Multi-core: Sun Niagara 2 • Two 8-core sockets, • 8 hw threads per core • 1.2 GHz processors linked by 8 x 9 crossbar to L2 cache banks • Simultaneous multithreading • Two threads from a core can be issued in a cycle • Shallow pipeline

  20. Multicore: Intel Nehalem • Two quad-core sockets, 2.5 GHz • Two hyperthreads per core support SMT • Off chip-data latency 106 cycles • Advanced architectural features: • Cache coherence protocol to reduce traffic, loop-stream detection, improved branch prediction, • out-of-order execution

  21. Massive Multithreading: Cray XMT • Latency tolerance via massive multi-threading • Context switch between threads in a single clock cycle • Global address space, hashed to memory banks to reduce hot-spots • No cache or local memory, average latency 600 cycles • Memory request doesn’t stall processor • Other threads work while the request is fulfilled • Light-weight, word-level synchr. (full/empty bits) • Notes: • 500 MHz clock • 128 Hardware thread streams/proc., • Interleaved multithreading

  22. Multithreaded Algorithms for Graph Coloring • We developed two kinds of multithreaded algorithms for graph coloring: • An iterative, coarse-grained method for generic shared-memory architectures • A dataflow algorithm designed for massively multithreaded architectures with hardware support for fine-grain synchronization, such as the Cray XMT • Benchmarked the algorithms on three systems: • Cray XMT, Sun Niagara 2 andIntel Nehalem • Excellent speedup observed on all three platforms

  23. Coloring Algorithms

  24. Greedy coloring algorithms • Distance-k, star, and acyclic coloring are NP-hard • Approximating coloring to within O(n1-e) is NP-hard for any e>0 GREEDY(G=(V,E)) Order the vertices in V for i = 1 to |V| do Determine colors forbidden to vi Assign vi the smallest permissible color end-for • A greedy heuristic usually gives a near-optimal solution • The key is to find good orderingsfor coloring, and many have been developed Ref: Gebremedhin, Tarafdar, Manne, Pothen, SIAM J. Sci. Compt. 29:1042--1072, 2007.

  25. Distance-1Coloring, Greedy Alg. a v a v

  26. Many-core greedy coloring • Given a graph, parallelize greedy coloring on many-core machines such that Speedup is attained, and Number of colors is roughly same as in serial • Difficult task since greedy is inherently sequential, computation small relative to communication, and data accesses are irregular • D1 coloring: Approaches based on Luby’s parallel algorithm for maximal independent set had limited success • Gebremedhin and Manne (2000) developed a parallel greedy coloring algorithm on shared memory machines • Uses speculative coloring to enhance concurrency, randomized partitioning to reduce conflicts, and serial conflict resolution • Number of conflicts bounded, so this approach yields an effective algorithm • Extended to distance-2 coloring by G, M and P (2002) • We adapt this approach to implement the greedy algorithm for many-core computing

  27. Parallel Coloring

  28. Parallel Coloring: Speculation w a v w a v

  29. Experimental results Dataflow Iterative Cray XMT: RMAT-G with 224, …, 227 vertices and 134M, …, 1B edges

  30. Experimental results Niagara 2 Iterative Perf. With doubling threads on a core = Doubling cores!

  31. Experimental results RMAT-G with 224 = 16M vertices and 134M edges RMAT-B, 224 vertices,134M edges All Platforms

  32. Iterative Greedy Coloring: Multithreaded Algorithm Adj(v), color(w), forbidden(v): d(v) reads each forbidden(v): d(v) writes Adj(v), color(w): d(v) reads each

  33. Experimental results RMAT-G with 224 = 16M vertices and 134M edges All Platforms

  34. Tentative Conclusions, Future Work

  35. Future Plans: Multithreaded Coloring • Massive multithreading • Microbechmarking to understand where the cycles go: thread management, data accesses, synchronization, instruction scheduling, function unit limitations… • Develop a performance model of the computation • Experiment with other graph classes • Consider new algorithmic paradigms • Many-core machines • Four items as above • Ordering for coloring: Archetype of a problem for computing a sequential ordering in a parallel environment (MostofaPatwary and AssefawGebremedhin) • Extend to nodes of Peta-scale machines, so single node performance is enhanced, and complete our work on the Blue Gene and the Cray XT5

  36. Thanks • Rob Bisseling, Erik Boman,ÜmitÇatalürek, Karen Devine, Florin Dobrian, John Feo, AssefawGebremedhin,MahanteshHalappanavar, Bruce Hendrickson, Paul Hovland, Gary Kumfert, Fredrik Manne, AliPınar, Sivan Toledo, Jean Utke

  37. Further readingwww.cscapes.org • Gebremedhin and Manne, Scalable parallel graph coloring algorithms,Concurrency: Practice and Experience, 12: 1131-1146, 2000. • Gebremedhin, Manne and Pothen, Parallel distance-k coloring algorithms for numerical optimization, Lecture Notes in Computer Science, 2400: 912-921, 2002. • Bozdag, Gebremedhin, Manne, Boman and Catalyurek. A framework for scalable greedy coloring on distributed-memory parallel computers. J. Parallel Distrib. Comput. 68(4):515-535, 2008. • Catalyurek, Feo, Gebremedhin, Halappanavar and Pothen, Multi-threaded algorithms for graph coloring, Preprint, Aug. 2010.

More Related