Better Speedups for Parallel Max-Flow

Better Speedups for Parallel Max-Flow George C. Caragea Uzi Vishkin Dept. of Computer Science University of Maryland, College Park, USA June 4th, 2011

Experience with an Easy-to-Program Parallel Architecture • XMT (eXplicit Multi-Threading) Platform • Design goal: easy to program many-core architecture • PRAM-based design, PRAM-On-Chip programming • Ease of programming demonstrated by order-of-magnitude ease-of-teaching/learning • 64-processor hardware, compiler, 20+ papers, 9 grad degrees, 6 US Patents • Only one previous single-application paper (Dascal et. al, 1999) • Parallel Max-Flow results • [IPDPS 2010] 2.5x speedup vs. serial using CUDA • [Caragea and Vishkin, SPAA 2011] up to 108.3x speedup vs. serial using XMT • 3-page paper

How to publish application papers on an easy-to-program platform? • Reward game is skewed • Easier to publish on “hard-to-program” platforms • Remember STI Cell? • Application papers for easy-to-program architectures are considered “boring” • Even when they show good results • Recipe for academic publication: • Take simple application (e.g. Breadth-First Search in graph) • Implement it on latest (difficult to program) parallel architecture • Discuss challenges and work-arounds

Parallel Programming Today • Current Parallel Programming • High-friction navigation - by implementation [walk/crawl] • Initial program (1week) begins trial & error tuning (½ year; architecture dependent) • PRAM-On-Chip Programming • Low-friction navigation – mental design and analysis [fly] • No need to crawl • Identify most efficient algorithm • Advance to efficient implementation

PRAM-On-Chip Programming • High-school student comparing parallel programming approaches • “I was motivated to solve all the XMT programming assignments we got, since I had to cope with solving the algorithmic problems themselves, which I enjoy doing. In contrast, I did not see the point of programming other parallel systems available to us at school, since too much of the programming was effort getting around the way the systems were engineered, and this was not fun”

Maximum Flow in Networks • Extensively studied problem • Numerous algorithms and implementations (general graphs) • Application domains • Network analysis • Airline scheduling • Image processing • DNA sequence alignment • Parallel Max-Flow algorithms and implementations • Paper has overview • SMPs and GPUs • Difficult to obtain good speedups vs. serial • e.g. 2.5x for hybrid CPU-GPU solution

XMT Max-Flow Parallel Solution • First stage: identify/design parallel algorithm • [Shiloach,Vishkin 1982] designed O(n2log n) time, O(nm) space PRAM algorithm • [Goldberg, Tarjan 1988] introduced distance labels in S-V: Push-Relabel algorithm with O(m) space complexity • [Anderson, Setubal 1992] observed poor practical performance for G-T, augmented with S-V-inspired Global Relabeling heuristic • Solution: Hybrid SV-GT PRAM algorithm • Second stage: write PRAM-On-Chip implementation • Relax PRAM lock-step synchrony by grouping several PRAM steps in an XMT spawn block • Insertsynchronization points (barriers) where needed for correctness • Maintain active node set instead of polling all graph nodes for work • Use hardware supported atomic operations to simplify reductions

Input Graph Families • Performance is highly dependent on the structure of the graph • Graph structures proposed in DIMACS challenge [DIMACS90] • Used by virtually every Max-Flow publication

Speed-Up Results • Compared to “best serial implementation”, running on recent x86 processor [Goldberg2006] • Clock cycle count speedups: • Two XMT configurations: • XMT.64: 64 core FPGA prototype • XMT.1024: 1024-core, cycle-accurate simulator XMTSim • Speedups: 1.56x to 108.3x for XMT.1024

Conclusion • XMT aims at being easy-to-program, general-purpose architecture • Performance improvements on hard-to-parallelize applications like Max-Flow • Ease of programming: by showing order-of-magnitude improvement in ease-of-teaching/learning • Achieved difficult speedups at much earlier developmental stage (10th graders in HS versus graduate students). UCSB/UMD experiment, Middle-School, Magnet HS, Inner City HS, freshmen course, UIUC/UMD-experiment: J. Sys. & SW08 SIGCSE10, EduPar11. • Current stage of XMT project: develop more complex applications beyond benchmarks • Max-Flow is a step in that direction • More needed • Without an easy-to-program many-core architecture, rejection of parallelism by mainstream programmers is all but certain • Affirmative action: drive more researchers to work and seek publications on easy-to-program architectures • This work should not be dismissed as ‘too easy’ Thank you!

Better Speedups for Parallel Max-Flow

Better Speedups for Parallel Max-Flow

Presentation Transcript

Max Flow

Max Flow

Parallel Flow Visualization Overview

PARALLEL FLOW VISUALIZATION PROJECT

Brief Announcement: Speedups for Parallel Graph Triconnectivity

Max-flow/min-cut theorem

Max-flow min-cut

PARALLEL FLOW VISUALIZATION PROJECT

Max-Flow Min-Cut Applications:

V11: Max-Flow Min-Cut

Max Flow Applications

Max Flow – Min Cut Problem

Max Flow

Quantum Speedups

V15: Max-Flow Min-Cut

Max Flow Application: Precedence Relations

Max Flow

Max Flow Application: Precedence Relations

Max Flow Applications

Max-Flow Min-Cut Applications:

Parallel Flow Visualization Goals

7.7 Extensions to Max Flow