1 / 16

Center for Research on Multicore Computing (CRMC)

Center for Research on Multicore Computing (CRMC). Overview. Ken Kennedy Rice University http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf. CRMC Overview. Initial Participation from Three Institutions Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner

delila
Download Presentation

Center for Research on Multicore Computing (CRMC)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Center for Research on Multicore Computing (CRMC) Overview Ken Kennedy Rice University http://www.cs.rice.edu/~ken/Presentations/CRMC06.pdf

  2. CRMC Overview • Initial Participation from Three Institutions • Rice: Ken Kennedy, Keith Cooper, John Mellor-Crummey, Scott Rixner • Indiana: Geoffrey Fox, Dennis Gannon • Tennessee: Jack Dongarra • Activities • Research and prototype development • Community building • Workshops and meetings • Other outreach components (separately funded) • Planning and Management • Coordinated management and vision-building • Model: CRPC

  3. Management Strategy • Pioneered in CRPC and honed in GrADS/VGrADS/LACSI • Leadership forms a broad vision-building team • Problem identification • Willingness to redirect research to address new challenges • Complementary research areas • Willingness to look at problems from multiple dimensions • Joint projects between sites • Community-building activities • Focused workshops on key topics • CRPC: TSPLib, BLACS/ScaLAPACK • LACSI: Autotuning • Informal standardization • CRPC: MPI and HPF • Annual planning cycle • Plan, research, report, evaluate,…

  4. Research Areas I • Compilers and programming tools • Tools: Performance analysis and prediction (HPCToolkit) • Transformations: memory hierarchy and parallelism • Automatic tuning strategies • Programming models and languages • High-level languages: Matlab, Python, R, etc • HPCS languages • Programming models based on component integration • Run-time systems • Core run-time data movement library • Integration with MPI • Libraries • Adaptive, reconfigurable libraries optimized for multicore systems

  5. Research Areas II • Applications for multicore systems • Classical parallel/scientific applications • Commercial applications with advice from industrial partners • Interface between software and architecture • Facilities for managing bandwidth (controllable caches, scratch memory) • Sample-based profiling facilities • Heterogeneous cores • Fault tolerance • Redundant components • Diskless checkpointing • Multicore emulator • Research platform for future systems

  6. Performance Analysis and Prediction • HPC Toolkit (Mellor-Crummey) • Uses sample-based profiling combined with binary analysis to report performance issues (recompilation not required) • How to extend to multicore environment? • Performance prediction (Mellor-Crummey) • Currently using a performance prediction methodology that accurately accounts for memory hierarchy • Reuse-distance histograms based on training data, parameterized by input data size • Accurately determines frequency of miss at each reference • Extension to shared-cache multicore systems (underway)

  7. Bandwidth Management • Multicore raises computational power rapidly • Bandwidth onto chip unlikely to keep up • Multicore systems will feature shared caches • Replaces false sharing with enhanced probability of conflict misses • Challenges for effective use of bandwidth • Enhancing reuse when multiple processors are using cache • Reorganizing data to increase density of cache block use • Reorganizing computation to ensure reuse of data by multiple cores • Inter-core pipelining • Managing conflict misses • With and without architectural help • Without architectural help • Data reorganization within pages and synchronization to minimize conflict misses • May require special memory allocation run-time primitives

  8. 2 3 1 Conflict Misses • Unfortunate fact: • If a scientific calculation is sweeping across strips of > k arrays on a machine with k-way associativity and • All k strips overlap in one associativity group, then • Every access to the overlap group location is a miss On each outer loop iteration, 1 evicts 2 which evicts 3 which evicts 1 In a 2-way associative cache, all are misses! This limits loop fusion, a profitable reuse strategy

  9. Controlling Conflicts: An Example • Cache and Page Parameters • 256K Cache, 4-way set associative, 32-byte blocks • 1024 associativity groups • 64K Page • 2048 cache blocks • Each block in a page maps to a unique associativity group • 2 different lines in a page map to the same associativity group • In General • Let A = number of associativity groups in cache • Let P = number of cache blocks in a page • If P ≥ A then each block in a page maps to a single associativity group • No matter where the page is loaded • If P < A then a block can map to A/P different associativity groups • Depending on where the page is loaded

  10. Questions • Can we do data allocation precisely within a page so that conflict misses are minimized in a given computation? • Extensive work on minimizing self-conflict misses • Little work on inter-array conflict minimization • No work, to my knowledge, on interprocessor conflict minimization • Can we synchronize computations so that multiple cores do not interfere with one another? • Even reuse blocks across processors • Might it be possible to convince vendors to provide additional features to help control cache, particularly conflict misses • Allocation of part of cache as a scratchpad • Dynamic modification of cache mapping

  11. Parallelism • On a shared-cache multicore chip, running the same program using multiple processors has a major advantage • Possibility for reuse of cache blocks across processors • Some chance for controlling conflict misses • How can parallelism be found and exploited? • Automatic methods on scientific languages • Much progress was made in the 90s • Explicit parallel programming and thread management paradigms • Data parallel (HPF, Chapel) • Partitioned global address space (Co-Array Fortran, UPC) • Lightweight threading (OpenMP,CCR) • Software synchronization primitives • Integration of parallel component libraries • Telescoping languages • Parallel Matlab

  12. Automatic Tuning • Following ATLAS • Tuning generalized component libraries in advance • For different platforms • For different contexts in the same platform • May wish to chose a variant that uses a subset of the cache that does not conflict with the calling program • Extensive work at Rice and Tennessee • Heuristic search combined with compiler models cut tuning time • Many transformations: unroll-and-jam, tiling, fusion, etc. • Interact with one another • New challenges for multicore • Tuning of on-chip multiprocessors to use shared (and non-shared) memory hierarchy effectively • Management of on-chip parallelism and threading

  13. Other Compiler Challenges • Multicore chips used in scalable parallel machines • Multiple kinds of parallelism: on-chip, within an SMP group, distributed memory • Heterogeneous multicore chips (Grid on a chip) • In the Intel roadmap • Challenge: decomposing computations to match strengths of different cores • Static and dynamic strategies may be required • Performance models for subcomputations on different cores • Interaction of heterogeneity and memory hierarchy • Staging computations through shared cache • Workflow steps running on different cores • Component-composition programming environments • Graphical or construction from scripts

  14. Compiler Infrastructure • D System Infrastructure • Includes full dependence analysis • Support for high-level transformations • Register, cache, fusion • Support for parallelism and communication management • Originally used for HPF • Telescoping Languages Infrastructure • Constructed for Matlab compilation and component integration • Constraint-based type analysis • Produces type-jump functions for libraries • Variant specialization and selection • Applied to parallel Matlab project • Both currently distributed under BSD-style license (no GPL) • Open64 Compiler Infrastructure • GPL License

  15. Proposal • An NSF Center for Research on Multicore Computing • Modeled after CRPC • Core research program • Multiple participating institutions • Research • Compilers and tools • Architectural modifications, supported by simulation • Run-time systems and communication/synchronization • Driven by real applications from NSF community • Big community outreach program • Specific topical workshops • Major investment from Intel • Coupled with Multicore Computing Research Program • Designed to foster a vibrant community of researchers

  16. Leverage • DOE SciDAC Projects • Currently proposed: A major Enabling Technology Center • Kennedy: CScADS (includes infrastructure development) • Participants in several other relevant SciDAC efforts • PERC2, PModels • LACSI Projects • Subject to ASC budget • Chip Vendors • Intel, AMD, IBM (we have relationships with all) • Microsoft • HPCS Collaborations • New languages and tools must run on systems using multicore chips • New NSF Center? • Community development as with CRPC • Major contribution from Intel

More Related