260 likes | 343 Views
Explore a framework for efficient parallel linked data structures on distributed memory systems with a focus on optimizations and high-level operations
E N D
Global Trees: A Framework for Linked Data Structures on Distributed Memory Parallel Systems • D. Brian Larkins, James Dinan, Sriram Krishnamoorthy, Srinivasan Parthasarthy, Atanas Rountev, P. Sadayappan
Background • Trees and graphs can concisely represent relationships between data • Data sets are becoming increasingly large and can require compute-intensive processing • Developing efficient, memory hierarchy-aware applications is hard
Sample Applications • n-body simulation • Fast Multipole Methods (FMM) • multiresolution analysis • clustering and classification • frequent pattern mining
Key Contributions • Efficient fine-grained data access with a global view of data • Exploit linked structure to provide fast global pointer dereferencing • High-level, locality-aware, parallel operations on linked data structures • Application-driven customization • Empirical validation of the approach
Global Chunk Layer (GCL) • API and run-time library for managing chunks - built on ARMCI • Abstracts common functionality for handling irregular, linked data • Provides a global namespace with access and modification operations • Extensible and highly customizable to maximize functionality and performance
Chunks • A chunk is: • Contiguous memory segment • Globally accessible • Physically local to only one process • Collection of user-defined elements • Unit of data transfer
Programming Model • SPMD with MIMD-style parallelism • Global pointers permit fine-grained access • Chunks allow coarse-grained data movement • Uses get/compute/put model for globally shared data access • Provides both uniform global view and chunked global view of data
Global Pointers p c } } c = &p.child[i] + p.child[i].ci + p.child[i].no 4252 + -4252 + 4340
Global Trees (GT) • Run-time library and API for global view programming trees on DM clusters • Built on GCL chunk communication framework • High-level tree operations which work in parallel and are locality aware • Each process can asynchronously access any portion of the shared tree structure
GT Concepts • Tree Groups • set of global trees • allocations are made from the same chunk pool • Global Node Pointers • Tree Nodes • link structure managed by GT • body is user-defined structure
Tree Traversals • GT provides optimized, parallel traversals for common traversal orders • Visitor callbacks are application-defined computations on a single node • GT currently provides top-down, bottom-up, and level-wise traversals
Custom Allocation • No single mapping of data elements to chunks will be optimal • GT/GCL supports custom allocators to improve spatial locality • Allocators can use a hint from call-site and can keep state between calls • Default allocation is local-open
Experimental Results • Evaluate using: • Barnes-Hut from SPLASH-2 • Compression operation from MADNESS • GT compared with: • Intel’s Cluster OpenMP and TreadMarks runtime • UPC
Global Pointer Overhead compress() Barnes-Hut
Chunk Size and Bandwidth Experiments run on the department WCI Cluster - 2.33GHz Intel Xeon, 6GB RAM, Infiniband
Impact of Chunk UtilizationBarnes-Hut Experiments run on the department WCI Cluster - 2.33GHz Intel Xeon, 6GB RAM, Infiniband
Barnes-Hut Chunk Size Selection Barnes-Hut application from SPLASH-2 suite
Barnes-Hut Scaling chunk size = 256, bodies = 512k
Related Approaches • Distributed Shared Memory (DSM) • Cluster OpenMP, TreadMarks, Cashmere, Midway • Distributed Shared Objects (DSO) • Charm++, Linda, Orca • Partitioned Global Address Space (PGAS) Languages and Systems • UPC, Titanium, CAF, ARMCI, SHMEM, GASNET • Shared pointer-based data structure support on distributed memory clusters • Parallel Irregular Trees, Olden • HPCS Programming Languages • Chapel, X10
Future Work • Global Graphs • GT data reference locality tools • More applications
Questions email: larkins@cse.ohio-state.edu