1 / 19

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches. Martin Burtscher 1 and Hassan Rabeti 2 1 Department of Computer Science, Texas State University-San Marcos 2 Department of Mathematics, Texas State University-San Marcos. Problem: HPC is Hard to Exploit.

hector
Download Presentation

A Scalable Heterogeneous Parallelization Framework for Iterative Local Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Scalable Heterogeneous Parallelization Framework forIterative Local Searches Martin Burtscher1 and Hassan Rabeti2 1Department of Computer Science, Texas State University-San Marcos 2Department of Mathematics, Texas State University-San Marcos

  2. Problem: HPC is Hard to Exploit • HPC application writers are domain experts • They are not typically computer scientists and have little or no formal education in parallel programming • Parallel programming is difficult and error prone • Modern HPC systems are complex • Consist of interconnected compute nodes with multiple CPUs and one or more GPUs per node • Require parallelization at multiple levels (inter-node, intra-node, and accelerator) for best performance

  3. Target Area: Iterative Local Searches • Important application domain • Widely used in engineering & real-time environments • Examples • All sorts of random restart greedy algorithms • Ant colony opt, Monte Carlo, n-opt hill climbing, etc. • ILS properties • Iteratively produce better solutions • Can exploit large amounts of parallelism • Often have exponential search space

  4. Our Solution: ILCS Framework • Iterative Local Champion Search (ILCS) framework • Supports non-random restart heuristics • Genetic algorithms, tabu search, particle swarm opt, etc. • Simplifies implementation of ILS on parallel systems • Design goal • Ease of use and scalability • Framework benefits • Handles threading, communication, locking, resource allocation, heterogeneity, load balance, termination decision, and result recording (check pointing)

  5. User Interface • User writes 3 serial C functions and/or 3 single-GPU CUDA functions with some restrictions size_t CPU_Init(int argc, char *argv[]); void CPU_Exec(long seed, void const *champion, void *result); void CPU_Output(void const *champion); • See paper for GPU interface and sample code • Framework runs Exec (map) functions in parallel

  6. Internal Operation: Threading master forks a worker per core workers evaluate seeds, record local opt master sporadically finds global opt via MPI, sleeps ILCS master thread starts handlers launch GPU code, sleep, record result GPU workers evaluate seeds, record local opt master forks a handler per GPU

  7. Internal Operation: Seed Distribution each node gets chunk of 64-bit seed range CPUs process chunk bottom up GPUs process chunk top down • E.g., 4 nodes w/ 4 cores (a,b,c,d) and 2 GPUs (1,2) • Benefits • Balanced workload irrespective of number of CPU cores or GPUs (or their relative performance) • Users can generate other distributions from seeds • Any injective mapping results in no redundant evaluations

  8. Related Work • MapReduce/Hadoop/MARS and PADO • Their generality and unnecessary features for ILS incur overhead and increase learning curve • Some do not support accelerators, some require Java • ILCS framework is optimized for ILS applications • Reduction is provided, does not require multiple keys, does not need secondary storage to buffer data, directly supports non-random restart heuristics, allows early termination, works with GPUs and MICs, targets single-node workstations through HPC clusters

  9. Evaluation Methodology datacenterknowledge.com Three HPC Systems (at TACC and NICS) Largest tested configuration

  10. Sample ILS Codes • Traveling Salesman Problem (TSP) • Find shortest tour • 4 inputs from TSPLIB • 2-opt hill climbing • Finite State Machine (FSM) • Find best FSM config to predict hit/miss events • 4 sizes (n = 3, 4, 5, 6) • Monte Carlo method

  11. FSM Transitions/Second Evaluated 21,532,197,798,304 s-1 GPU shmem limit Ranger uses twice as many cores as Stampede

  12. TSP Tour-Changes/Second Evaluated 12,239,050,704,370 s-1 based on serial CPU code GPU re-computes: O(n) memory CPU pre-computes: O(n2) memory each core evals a tour change every 3.6 cycles

  13. TSP Moves/Second/Node Evaluated GPUs provide >90% of performance on Keeneland

  14. ILCS Scaling on Ranger (FSM) >99% parallel efficiency on 2048 nodes other two systems are similar

  15. ILCS Scaling on Ranger (TSP) >95% parallel efficiency on 2048 nodes longer runs are even better

  16. Intra-Node Scaling on Stampede (TSP) >98.9% parallel efficiency on 16 threads framework overhead is very small

  17. Tour Quality Evolution (Keeneland) quality depends on chance: ILS provides good solution quickly, then progressively improves it

  18. Tour Quality after 6 Steps (Stampede) larger node counts typically yield better results faster

  19. Summary and Conclusions • ILCS Framework • Automatic parallelization of iterative local searches • Provides MPI, OpenMP, and multi-GPU support • Checkpoints currently best solution every few seconds • Scales very well (decentralized) • Evaluation • 2-opt hill climbing (TSP) and Monte Carlo method (FSM) • AMD + Intel CPUs, NVIDIA GPUs, and Intel MICs • ILCS source code is freely available • http://cs.txstate.edu/~burtscher/research/ILCS/ Work supported by NSF, NVIDIA and Intel; resources provided by TACC and NICS

More Related