A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

A Scalable Heterogeneous Parallelization Framework forIterative Local Searches Martin Burtscher1 and Hassan Rabeti2 1Department of Computer Science, Texas State University-San Marcos 2Department of Mathematics, Texas State University-San Marcos

Problem: HPC is Hard to Exploit • HPC application writers are domain experts • They are not typically computer scientists and have little or no formal education in parallel programming • Parallel programming is difficult and error prone • Modern HPC systems are complex • Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node • Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance

Target Area: Iterative Local Searches • Important application domain • Widely used in engineering & real-time environments • Examples • All sorts of random restart greedy algorithms • Ant colony opt, Monte Carlo, n-opt hill climbing, etc. • ILS properties • Iteratively produce better solutions • Can exploit large amounts of parallelism • Often have exponential search space

Our Solution: ILCS Framework • Iterative Local Champion Search (ILCS) framework • Supports non-random restart heuristics • Genetic algorithms, tabu search, particle swarm opt, etc. • Simplifies implementation of ILS on parallel systems • Design goal • Ease of use and scalability • Framework benefits • Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)

User Interface • User writes 3 serial C functions and/or 3 single-GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion); • See paper for GPU interface and sample code • Framework runs Exec (map) functions in parallel

Internal Operation: Threading master forks a worker per core workers evaluate seeds, record local opt master sporadically finds global opt via MPI, sleeps ILCS master thread starts handlers launch GPU code, sleep, record result GPU workers evaluate seeds, record local opt master forks a handler per GPU

Internal Operation: Seed Distribution each node gets chunk of 64-bit seed range CPUs process chunk bottom up GPUs process chunk top down • E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2) • Benefits • Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance) • Users can generate other distributions from seeds • Any injective mapping results in no redundant evaluations

Related Work • MapReduce/Hadoop/MARS and PADO • Their generality and unnecessary features for ILS incur overhead and increase learning curve • Some do not support accelerators, some require Java • ILCS framework is optimized for ILS applications • Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters

Evaluation Methodology datacenterknowledge.com Three HPC Systems (at TACC and NICS) Largest tested configuration

Sample ILS Codes • Traveling Salesman Problem (TSP) • Find shortest tour • 4 inputs from TSPLIB • 2-opt hill climbing • Finite State Machine (FSM) • Find best FSM config to predict hit/miss events • 4 sizes (n = 3, 4, 5, 6) • Monte Carlo method

FSM Transitions/Second Evaluated 21,532,197,798,304 s-1 GPU shmem limit Ranger uses twice as many cores as Stampede

TSP Tour-Changes/Second Evaluated 12,239,050,704,370 s-1 based on serial CPU code GPU re-computes: O(n) memory CPU pre-computes: O(n2) memory each core evals a tour change every 3.6 cycles

TSP Moves/Second/Node Evaluated GPUs provide >90% of performance on Keeneland

ILCS Scaling on Ranger (FSM) >99% parallel efficiency on 2048 nodes other two systems are similar

ILCS Scaling on Ranger (TSP) >95% parallel efficiency on 2048 nodes longer runs are even better

Intra-Node Scaling on Stampede (TSP) >98.9% parallel efficiency on 16 threads framework overhead is very small

Tour Quality Evolution (Keeneland) quality depends on chance: ILS provides good solution quickly, then progressively improves it

Tour Quality after 6 Steps (Stampede) larger node counts typically yield better results faster

Summary and Conclusions • ILCS Framework • Automatic parallelization of iterative local searches • Provides MPI, OpenMP, and multi-GPU support • Checkpoints currently best solution every few seconds • Scales very well (decentralized) • Evaluation • 2-opt hill climbing (TSP) and Monte Carlo method (FSM) • AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs • ILCS source code is freely available • http://cs.txstate.edu/~burtscher/research/ILCS/ Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

Presentation Transcript

An OpenCL Framework for Heterogeneous Multicores with Local Memory

Scalable Framework for Heterogeneous Clustering of Commodity FPGAs

A MapReduce Framework on Heterogeneous Systems

Performance Tools for GPU-Powered Scalable Heterogeneous Systems

iMapReduce : A Distributed Computing Framework for Iterative Computation

PAGE: A Framework for Easy Parallelization of Genomic Applications

A flexible, scalable genomics framework for integrating heterogeneous vector sequence data

A Framework for Digital Local Public Services

OpenCL Framework for Heterogeneous CPU/GPU Programming

Scalable Group Communication In Heterogeneous Cluster

Architectural Support for Scalable Speculative Parallelization in Shared-Memory Multiprocessors

A Software Framework for Easy Parallelization of PDE Solvers

Scalable Parallelization of CPAIMD using Charm++

A Distributed Security Framework for Heterogeneous Wireless Sensor Networks

An iterative framework for registration with reconstruction

Giggle: A Framework for Constructing Scalable Replica Location Services

Cost Framework for a Heterogeneous Distributed Semi-structured Environment

Scalable and transparent parallelization of multiplayer games

A Fast Iterative Algorithm for Fisher Discriminant using Heterogeneous Kernels

Scalable Group Communication In Heterogeneous Cluster

A Framework for Learning to Query Heterogeneous Data

A Software Framework for Easy Parallelization of PDE Solvers