1 / 10

Better Speedups for Parallel Max-Flow

Better Speedups for Parallel Max-Flow. George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4 th , 2011. Experience with an Easy-to-Program Parallel Architecture. XMT ( eXplicit Multi-Threading) Platform

zuzana
Download Presentation

Better Speedups for Parallel Max-Flow

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4th, 2011

  2. Experience with an Easy-to-Program Parallel Architecture • XMT (eXplicit Multi-Threading) Platform • Design goal: easy to program many-core architecture • PRAM-based design, PRAM-On-Chip programming • Ease of programming demonstrated by order-of-magnitude ease-of-teaching/learning • 64-processor hardware, compiler, 20+ papers, 9 grad degrees, 6 US Patents • Only one previous single-application paper (Dascal et. al, 1999) • Parallel Max-Flow results • [IPDPS 2010] 2.5x speedup vs. serial using CUDA • [Caragea and Vishkin, SPAA 2011] up to 108.3x speedup vs. serial using XMT • 3-page paper

  3. How to publish application papers on an easy-to-program platform? • Reward game is skewed • Easier to publish on “hard-to-program” platforms • Remember STI Cell? • Application papers for easy-to-program architectures are considered “boring” • Even when they show good results • Recipe for academic publication: • Take simple application (e.g. Breadth-First Search in graph) • Implement it on latest (difficult to program) parallel architecture • Discuss challenges and work-arounds

  4. Parallel Programming Today • Current Parallel Programming • High-friction navigation - by implementation [walk/crawl] • Initial program (1week) begins trial & error tuning (½ year; architecture dependent) • PRAM-On-Chip Programming • Low-friction navigation – mental design and analysis [fly] • No need to crawl • Identify most efficient algorithm • Advance to efficient implementation

  5. PRAM-On-Chip Programming • High-school student comparing parallel programming approaches • “I was motivated to solve all the XMT programming assignments we got, since I had to cope with solving the algorithmic problems themselves, which I enjoy doing. In contrast, I did not see the point of programming other parallel systems available to us at school, since too much of the programming was effort getting around the way the systems were engineered, and this was not fun”

  6. Maximum Flow in Networks • Extensively studied problem • Numerous algorithms and implementations (general graphs) • Application domains • Network analysis • Airline scheduling • Image processing • DNA sequence alignment • Parallel Max-Flow algorithms and implementations • Paper has overview • SMPs and GPUs • Difficult to obtain good speedups vs. serial • e.g. 2.5x for hybrid CPU-GPU solution

  7. XMT Max-Flow Parallel Solution • First stage: identify/design parallel algorithm • [Shiloach,Vishkin 1982] designed O(n2log n) time, O(nm) space PRAM algorithm • [Goldberg, Tarjan 1988] introduced distance labels in S-V: Push-Relabel algorithm with O(m) space complexity • [Anderson, Setubal 1992] observed poor practical performance for G-T, augmented with S-V-inspired Global Relabeling heuristic • Solution: Hybrid SV-GT PRAM algorithm • Second stage: write PRAM-On-Chip implementation • Relax PRAM lock-step synchrony by grouping several PRAM steps in an XMT spawn block • Insertsynchronization points (barriers) where needed for correctness • Maintain active node set instead of polling all graph nodes for work • Use hardware supported atomic operations to simplify reductions

  8. Input Graph Families • Performance is highly dependent on the structure of the graph • Graph structures proposed in DIMACS challenge [DIMACS90] • Used by virtually every Max-Flow publication

  9. Speed-Up Results • Compared to “best serial implementation”, running on recent x86 processor [Goldberg2006] • Clock cycle count speedups: • Two XMT configurations: • XMT.64: 64 core FPGA prototype • XMT.1024: 1024-core, cycle-accurate simulator XMTSim • Speedups: 1.56x to 108.3x for XMT.1024

  10. Conclusion • XMT aims at being easy-to-program, general-purpose architecture • Performance improvements on hard-to-parallelize applications like Max-Flow • Ease of programming: by showing order-of-magnitude improvement in ease-of-teaching/learning • Achieved difficult speedups at much earlier developmental stage (10th graders in HS versus graduate students). UCSB/UMD experiment, Middle-School, Magnet HS, Inner City HS, freshmen course, UIUC/UMD-experiment: J. Sys. & SW08 SIGCSE10, EduPar11. • Current stage of XMT project: develop more complex applications beyond benchmarks • Max-Flow is a step in that direction • More needed • Without an easy-to-program many-core architecture, rejection of parallelism by mainstream programmers is all but certain • Affirmative action: drive more researchers to work and seek publications on easy-to-program architectures • This work should not be dismissed as ‘too easy’ Thank you!

More Related