CRMC Research: Multicore Computing Overview | Ken Kennedy & Rice University

Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf

CRMC Overview • Initial Participation from Three Institutions • Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner • Indiana: Geoffrey Fox, Dennis Gannon • Tennessee: Jack Dongarra • Activities • Research and prototype development • Community building • Workshops and meetings • Other outreach components (separately funded) • Planning and Management • Coordinated management and vision-building • Model: CRPC

Management Strategy • Pioneered in CRPC and honed in GrADS/VGrADS/LACSI • Leadership forms a broad vision-building team • Problem identification • Willingness to redirect research to address new challenges • Complementary research areas • Willingness to look at problems from multiple dimensions • Joint projects between sites • Community-building activities • Focused workshops on key topics • CRPC: TSPLib, BLACS/ScaLAPACK • LACSI: Autotuning • Informal standardization • CRPC: MPI and HPF • Annual planning cycle • Plan, research, report, evaluate,…

Research Areas I • Compilers and programming tools • Tools: Performance analysis and prediction (HPCToolkit) • Transformations: memory hierarchy and parallelism • Automatic tuning strategies • Programming models and languages • High-level languages: Matlab, Python, R, etc • HPCS languages • Programming models based on component integration • Run-time systems • Core run-time data movement library • Integration with MPI • Libraries • Adaptive, reconfigurable libraries optimized for multicore systems

Research Areas II • Applications for multicore systems • Classical parallel/scientific applications • Commercial applications with advice from industrial partners • Interface between software and architecture • Facilities for managing bandwidth (controllable caches, scratch memory) • Sample-based profiling facilities • Heterogeneous cores • Fault tolerance • Redundant components • Diskless checkpointing • Multicore emulator • Research platform for future systems

Performance Analysis and Prediction • HPC Toolkit (Mellor-Crummey) • Uses sample-based profiling combined with binary analysis to report performance issues (recompilation not required) • How to extend to multicore environment? • Performance prediction (Mellor-Crummey) • Currently using a performance prediction methodology that accurately accounts for memory hierarchy • Reuse-distance histograms based on training data, parameterized by input data size • Accurately determines frequency of miss at each reference • Extension to shared-cache multicore systems (underway)

Bandwidth Management • Multicore raises computational power rapidly • Bandwidth onto chip unlikely to keep up • Multicore systems will feature shared caches • Replaces false sharing with enhanced probability of conflict misses • Challenges for effective use of bandwidth • Enhancing reuse when multiple processors are using cache • Reorganizing data to increase density of cache block use • Reorganizing computation to ensure reuse of data by multiple cores • Inter-core pipelining • Managing conflict misses • With and without architectural help • Without architectural help • Data reorganization within pages and synchronization to minimize conflict misses • May require special memory allocation run-time primitives

2 3 1 Conflict Misses • Unfortunate fact: • If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and • All k strips overlap in one associativity group, then • Every access to the overlap group location is a miss On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1 In a 2-way associative cache, all are misses! This limits loop fusion, a profitable reuse strategy

Controlling Conflicts: An Example • Cache and Page Parameters • 256K Cache, 4-way set associative, 32-byte blocks • 1024 associativity groups • 64K Page • 2048 cache blocks • Each block in a page maps to a unique associativity group • 2 different lines in a page map to the same associativity group • In General • Let A = number of associativity groups in cache • Let P = number of cache blocks in a page • If P ≥ A then each block in a page maps to a single associativity group • No matter where the page is loaded • If P < A then a block can map to A/P different associativity groups • Depending on where the page is loaded

Questions • Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation? • Extensive work on minimizing self-conflict misses • Little work on inter-array conflict minimization • No work, to my knowledge, on interprocessor conflict minimization • Can we synchronize computations so that multiple cores do not interfere with one another? • Even reuse blocks across processors • Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses • Allocation of part of cache as a scratchpad • Dynamic modification of cache mapping

Parallelism • On a shared-cache multicore chip, running the same program using multiple processors has a major advantage • Possibility for reuse of cache blocks across processors • Some chance for controlling conflict misses • How can parallelism be found and exploited? • Automatic methods on scientific languages • Much progress was made in the 90s • Explicit parallel programming and thread management paradigms • Data parallel (HPF, Chapel) • Partitioned global address space (Co-Array Fortran, UPC) • Lightweight threading (OpenMP,CCR) • Software synchronization primitives • Integration of parallel component libraries • Telescoping languages • Parallel Matlab

Automatic Tuning • Following ATLAS • Tuning generalized component libraries in advance • For different platforms • For different contexts in the same platform • May wish to chose a variant that uses a subset of the cache that does not conflict with the calling program • Extensive work at Rice and Tennessee • Heuristic search combined with compiler models cut tuning time • Many transformations: unroll-and-jam, tiling, fusion, etc. • Interact with one another • New challenges for multicore • Tuning of on-chip multiprocessors to use shared (and non-shared) memory hierarchy effectively • Management of on-chip parallelism and threading

Other Compiler Challenges • Multicore chips used in scalable parallel machines • Multiple kinds of parallelism: on-chip, within an SMP group, distributed memory • Heterogeneous multicore chips (Grid on a chip) • In the Intel roadmap • Challenge: decomposing computations to match strengths of different cores • Static and dynamic strategies may be required • Performance models for subcomputations on different cores • Interaction of heterogeneity and memory hierarchy • Staging computations through shared cache • Workflow steps running on different cores • Component-composition programming environments • Graphical or construction from scripts

Compiler Infrastructure • D System Infrastructure • Includes full dependence analysis • Support for high-level transformations • Register, cache, fusion • Support for parallelism and communication management • Originally used for HPF • Telescoping Languages Infrastructure • Constructed for Matlab compilation and component integration • Constraint-based type analysis • Produces type-jump functions for libraries • Variant specialization and selection • Applied to parallel Matlab project • Both currently distributed under BSD-style license (no GPL) • Open64 Compiler Infrastructure • GPL License

Proposal • An NSF Center for Research on Multicore Computing • Modeled after CRPC • Core research program • Multiple participating institutions • Research • Compilers and tools • Architectural modifications, supported by simulation • Run-time systems and communication/synchronization • Driven by real applications from NSF community • Big community outreach program • Specific topical workshops • Major investment from Intel • Coupled with Multicore Computing Research Program • Designed to foster a vibrant community of researchers

Leverage • DOE SciDAC Projects • Currently proposed: A major Enabling Technology Center • Kennedy: CScADS (includes infrastructure development) • Participants in several other relevant SciDAC efforts • PERC2, PModels • LACSI Projects • Subject to ASC budget • Chip Vendors • Intel, AMD, IBM (we have relationships with all) • Microsoft • HPCS Collaborations • New languages and tools must run on systems using multicore chips • New NSF Center? • Community development as with CRPC • Major contribution from Intel

CRMC Research: Multicore Computing Overview | Ken Kennedy & Rice University