130 likes | 137 Views
Software Enablement for Multicore Architectures. David Bernstein Bilha Mendelson Bernstn@il.ibm.com bilha@il.ibm.com. 20. ?. Conventional Bulk CMOS. SOI (silicon-on-insulator). 10. 8. High mobility. Double-Gate. 6. 4. Relative Device Performance. 2. 1. 0.8. 0.6.
E N D
Software Enablement for Multicore Architectures David Bernstein Bilha Mendelson Bernstn@il.ibm.com bilha@il.ibm.com
20 ? Conventional Bulk CMOS SOI (silicon-on-insulator) 10 8 High mobility Double-Gate 6 4 Relative Device Performance 2 1 0.8 0.6 0.4 0.2 1988 1992 1996 2000 2004 2008 2012 Year Technology Scaling – We’ve Hit The Wall
CMOS IBM GP ? IBM RY5 Pentium 4 IBM RY7 Pulsar IBM RY6 IBM RY4 Apache Merced Pentium II(DSIP) Has This Ever Happened Before? 140 Bipolar 120 IBM ES9000 100 80 Fujitsu VP2000 Watts / cm2 IBM 3090S NTT 60 Fujitsu M-780 40 IBM 3090 Start of CDC Cyber 205 IBM 4381 Water Cooling 20 IBM 3081 Fujitsu M380 IBM 370 IBM 3033 IBM 360 Vacuum 0 1950 1960 1970 1980 1990 2000 2010 Source: Bernie Meyerson, IBM
Sun’s 8-Core Chips: T1 - Niagra Industry trends Intel Quad-Core Cell Broadband Engine
High Speed Network Core Hierarchy of Modular Building Blocks • Systems will increasingly need to implement a hybrid execution model • New programming systems need to reduce the need for programmer awareness of the topology on which their program executes Hierarchical SMP servers with non-uniform memory access characteristics Grid/Cluster Rack Hierarchical SMP servers with NUMA characteristics High Speed Network SMP Interconnect Board • Homogenous SMP on Board • 2 – 128 HW contexts on board • Main Processor(s) with Accelerator(s) • Master-Slace relationship between entities Memory Memory Chip • Heterogenous collection of processors on chip • Heterogenity at data and control flow level • Homogenous SMP on chip • 2-32 HW contexts on chip • Various forms of resource sharing I/OAttach Cache Interconnect Fabric MemCtrl Core Core • The next gen programming system must support programming simplicity while leveraging the performance of the underlying HW topology. Core Core will support multiple HW threads sharing a single cache exhibiting SMP characteristics.
Architecture trends • Several processor cores on a chip and specialized computing engines • XML processing, cryptography, graphics • Questions: • how to interconnect large number of processor cores • how to provide sufficient memory bandwidth • how to structure the multilevel caching subsystem • how to balance the general purpose computing resources with specialized processing engines and all the supporting memory, caching and interconnect structure, given a constant power budget • Software development processes • how to program for multicore architectures • how to test and evaluate the performance of multithreaded applications
Programming multiprocessor systems • Two main directions: • explicit manual programming • exploit the combination of compiler optimization, build tool chains, and run-time subsystems • In HPC and embedded communities: • emphasis was more on explicit manual programming and special resources by expert programmers • resulted in numerous home-grown language directives and extensions, internal tools, obscure run-time systems • hardly portable to new generations of hardware
Programming languages • Very few new languages were invented in the last 2 decades • Java - virtual machine, interpreter, JIT, garbage collection, set of libraries, etc. • Can multicore spur development of new language/environment for parallelism? • map-reduce, cilk, UPC, X10, and STAPL • programmers can provide additional information related to parallelism • Multicore provide multiple types of parallelism • thread-level parallelism (TLP) – coarse-grain • OpenMP - standard for shared-memory models • MPI - standard for distributed-memory models • pthreads, java threads - explicitly use • automatic parallelization optimizations • Most of the original auto-parallelizing compilers focused on FORTRAN • data-level parallelism (DLP) – fine-grain • auto-vectorization, auto-simdification • What about asymmetric multicore architectures (like Cell processor)? • is it possible to have a single source compilation for multiple ISAs? - initial attempts… • how OpenMP can be used for programs - streaming
Performance Analysis Tools • Profile based tools – data aggregation • FDPR-Pro, Code Analyzer, Diablo • Performance evaluation is heavily influenced by thread interaction • stales, locks, races, memory thrashing, pollute hardware counters • trace-based analysis and visualization • introduces timeline views and data to deal with communication issues • lack of scalability: • tend to grow fast, making it difficult to manipulate and visualize • In HPC context: selecting arbitrary subset of cores/threads and arbitrary time intervals • tracing might disturbs program's behavior • HPCToolkit, TAU, Paraver, VTune, Code Analyzer, PDT, Trace Analyzer • Lack of determinism
Performance tools for multi-core: Cell Visual Performance Analyzer 5.0 Cell SDK 3.0 • Infrastructure for collecting profiles on several systems • Infrastructure for using databases for large data sets • Set of interconnected views • Cell support LockAnalyzer PDT ProfileAnalyzer CodeAnalyzer PipelineAnalyzer TraceAnalyzer • Infrastructure for collecting traces on SDK 3.0 libraries • Analysis of lock usage • Input for Trace Analyzer
Debugging and testing tools • Concurrent problems constitute about 10% of the bugs • Bugs like crashes (races) or freeze (deadlocks) stay in the application reducing the up-time • Testing is done at load testing - very late in the process • We have been working on a tool supported methodology • try to find the concurrency issues as early as possible: • teach how to write concurrent code • concurrent bug patterns • explain the concurrent programming constructs • teach general concurrency design patterns • reviews - developed a specialized review technique for concurrent code • teach how to do unit testing - developed synchronization coverage • ConTest - a tool supported method for measuring contention • Make the tests that are likely to exhibit bugs - changing the internal timing • Tools for pinpointing locations of bugs • if we have a test that we can cause the application to fail some of the time • healing bugs so that the impact will not be seen
Software trends • Software enablement system for multicores • Various directions for providing solutions • Active area of research • only some early results in the academic and industrial worlds in terms of established standards and technology • much more will evolve in the years to come • Need: • programming models and compiler support for multicores • performance evaluation tools • testing and debugging tools