Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking

Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking National Research Council – Italy roberto.vaccaro@na.icar.cnr.it Programming Models and Architectures for ManyCore Systems:Challenges and Opportunities for the next 10 years. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction • ■ The computational and storage needs of workloads in several areas as life science are growing exponentially. • ■ Heterogeneity/Computing Barriers Overcoming. • The scientist should be allowed to look at the data • easily, • wherever it may be, • with sufficient processing power for any desired algorithm to • process it. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■ In life science the scientist requirements concerne a range of different scales, from the local parallel component processor to the global atchitectural level of cross-organizational grid. ■ Integrated solutions capable to face the problems at the different architectural level are needed. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction Wide Area Netowrk Local Area Network System Level Network Network on Chip Grid of Clusters Cluster Commodity Machine Microprocessor ■ ManyCore Chip ■ Photonic Networks for intra-chip, inter-chip, box interconnects (*) T. Agerwala, M. Gupta, “Systems research challenges: A scale-out perspective”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 173,180 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■An ensemble of N nodes each comprising p computing elements ■ The p elements are tightly bound shared memory (e.g., smp, dsm) ■ The N nodes are loosely coupled, i.e., distributed Memory ■ p is greater than N ■ Distinction is which layer gives us the most power through parallelism Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■ GRIDs built over wide-area networks & across organisational boundaries. ■ lack of (further) improvement in newtork latency. The approach to Distributed Programmingcurrently prevailing synchronous(using RPC primitives for ex.) will have to be replaced with an ASYNCHRONOUS PROGRAMMING APPROACH more - delay-tolerant - failure-resilient Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■ A first step in that direction - peer-to-peer (P2P) architectures - service-oriented architectures (SOA) capable of support reuse of both functionalities and data. ■ Using P2P architectures and protocols it is possible to - realize distributed systems without any centralized control or hierarchical organisation, - achieve scalable and reliable location and exchange of scientific data and software in a decentralised manner. ■ Service-Oriented Architecture (SOA) and the web-service infrastructures that assist in their implementation facilitate reuse of functionality. (*) G. Kandaswamyetahi “Building Web Services for Scientific Grid Applications”, IBM Journal of Research & Development, Vol. 50, N. 23, March/May 2006, pagg. 249,260 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■ The possibility to locate and invoke a service across machine and organisational boundaries (both in a synchronous and an asynchronous manner) is provided by SOA infrastructure fundamental primitive. ■ Computational scientist will be able to flexibly orchestrate SOA services into computational workflow. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction ■ Appropriate programming languages abstractions for science has to be provided. ■ Fortran and Message Passing Interface (MPI) are no longer appropriate for the above described architecture. ■ By using abstract machines it is possible to mix compilation and interpretation as well as integrate code written language seamlessly into an application or service. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics A viable approach ■ Define a Multilevel Integrated Programming Model ■ Explore the management of concurrency in processor design on a range of different scales from instructions to programs from microgrids to global grids ■Evaluate thepossibility and modalities to implement an integrated H/W and S/W system capable to give the right answer in terms of: - Inter/intra processor latency. - More delay-tolerant and failure-resilient programming approach. - Capability of data and functionality reuse at global architecture level (distributed, cross-organisational). - Capability to take advantages of parallel and distributed resources. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Introduction By Little’s law, the amount of concurrency needed to hide the latency of memory accesses will continue to increase as the gap between memory and processor speed grows. Since the memory latency is improving at a rate of only roughly 6% each year, the gap is projected to continue growing even as the increase in processor speed decreases from the historic rate of about 60% each year to about 20% each year. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry In 2005 a historic change of direction for computer hardware Industry. ● The major microprocessor companies all announced that future products would be single-chip multiprocessors future performance improvements would rely on ○ software-specified parallelism rather than ○ additional software-transparent parallelism extracted automatically by the microarchitecture Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry ■ It is meaningfull that a multibilliondollar industry has bet its future on solving the general-purpose parallel computing problem. even if so many have previously attempted but failed to provide a satisfactory approach. ■ In order to tackle the parallel processing problem, innovative solutions are urgently needed, which in turn require extensive codevelopment of hardware and software. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry ■Advances in integrated circuit technology impose new challenges about how to implement a high performance application for low power dissipation on processors created by hundred of cores running at 200 MHz, rather than on one traditional processor running at 20 GHz. ■The convergence of the high-performance and embedded industry. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry Multicore or Manycore? ■Multicore will obviously help multiprogrammed workloads, which contain a mix of independent sequential tasks, but how will individual tasks become faster? ■Switching from sequential to modestly parallel computing will make programming much more difficult without rewarding this greater effort with a dramatic improvement in power-performance. ■Multicore is unlikely to be ideal answer and sneaking up on the problem of parallelism via multicore solutions was likely to fail. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry ■We desperately need a new solution for parallel hardware and software. ■Compatibility with old binaries and C programs is valuable to industry, and some researchers are trying to help multicore product plans succeed. ■We have been thinking bolder thoughts. Our aim is to realiza thousands of processors on a chip for new applications, and we welcome new programming models and new architectures if theysimplify the efficient programming of such highly parallel systems. ■Rather than multicore, we are, focused on “manycore”. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Computer hardware industry ■Between February 2005 and December 2006 a group of Researcher of University of California at Berkeley from many background (circuit design, computer architecture, massively parallel computing, computer-aided design, embedded h/w and s/w, programming languages, compilers, scientific programming and numerical analysis) met to discuss parallelism from these many angles. ■The result of the borrowing the good ideas regarding parallelism from different disciplines is the report. “The Landscape of Parallel Computing Research: A View from Berkeley” Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf, Samuel Webb Williams, Katherine A. Yelick Electrical Engineering and Computer Sciences University of California at Berkeley Technical Report No. UCB/EECS-2006-183 http://www.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-183.html December 18, 2006 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics The Landscape Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics The Landscape ■Seven critical questions used to frame the landscape of parallel computing research: 1. What are the applications? 2. What are common kernels of the applications? 3. What are the hardware building blocks? 4. How to connect them? 5. How to describe applications and kernels? 6. How to program the hardware? 7. How to measure success? ■This report do not have the answers - on some questions non-conventional and provocative perspectives are offered, - On others seemingly obvious sometine-neglected perspectives are stated. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics The Landscape • Embedded versus High Performance Computing • Have more in common looking forward than they did in the past • 1. Both are concerned with power, whether it is battery life for cell phones or cost of electricity and cooling in a data center. • Both are concerned with hardware utilization. Embedded systems are always sensitive to cost, but efficient use of hardware is also required when you spend $ 10M to $ 100M for high-end servers. • As the size of embedded software increases over time, the fraction of hand tuning must be limited and so the importance of software reuse must increase. • Since both embedded and high-end servers now connect to networks, both need to prevent unwanted accesses and viruses. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics The Landscape ■The Biggest difference between the two target is the traditional emphasis on realtime computing in embedded, where the computer and the program need to be just fast enough to meet the deadlines, and there is no benefit to running faster. ■Running faster is usually valuable in server computing. ■As server applications become more media-oriented, real time may become more important for server computing as well Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Information Society Technologies (IST) Network of Excellence on High Performance Embedded Architectures and Compilers (HiPEAC) Meteo Valero (UPC Barcellona) HiPEAC Coordinator, introducing the pubblication of the first HiPEAC research roadmap (*) wrote: “From the document it is clear that there are many challenges ahead of us in the design of future high-performance embedded systems. Some of them are familiar such as the memory wall, the power problem, and the interconnection bottleneck. Others are new like the proper support for reconfigurable components, fast simulation techniques for multi-core systems, new programming paradigms for parallel programming.” (*) K. De Bosschere, W. Luk, X. Martorell, N. Navarro, M. O’Boyle, D. Pnevmatikatos, A. Ramirez, P. Sainrat, A. Seznec, P. Stentrom, and O. Temam. “High-Performance Embedded Architecture and Compilation Roadmap” Transactions on HiPEAC I, Lecture Notes in Computer Science 4050, pp 5-29, Springer-Verlag, 2007 Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Parallelism For at least three decades the promise of parallelism has fascinated researchers. ■In the past, parallel computing efforts have shown promise and gathered investment, but in the end, uniprocessor computing always prevailed. ■In this time general-purpose computing is taking an irreversible step toward parallel architectures ●This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs in novel software and architectures for parallelism ●This plunge into parallelism is actually a retreat from aven greater challenges that thwart efficient silicon implementation of traditional uniprocessor architectures Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture Old & New Conventional Wisdom (CW) in Computer Architecture guiding principles illustrating how everything is changing in computing 1. Old CW: Power is free, but transistors are expensive. ▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 2. Old CW: If you worry about power, the only concern is dynamic power. ▪ New CW: For desktops and servers, static power due to leakage can be 40% of total power. 3. Old CW: Monolithic uniprocessors in silicon are reliable internally, with errors occurring only at the pins. ▪New CW: As chips drop below 65 nm feature sizes, they will have high soft and hard error rates. 4. Old CW: By building upon prior successes, we can continue to raise the level of abstraction and hence the size of hardware designs. ▪New CW: Wire delay, noise, cross coupling (capacitive and inductive), manufacturing variability, reliability, clock jitter, design validation, and so on conspire to stretch the development time and cost of large designs at 65 nm or smaller feature sizes. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture 5. Old CW: Researchers demonstrate new architecture ideas by building chips. ▪New CW: The cost of masks at 65 nm feature size, the cost of Electronic Computer Aided Design software to design such chips, and the cost of design for GHz clock rates means researchers can no longer build believable prototypes. Thus, an alternative approach to evaluating architectures must be developed. 6. Old CW: Performance improvements yield both lower latency and higher bandwidth. ▪New CW: Across many technologies, bandwidth improves by at least the square of the improvement in latency. 7. Old CW: Multiply is slow, but load and store is fast. ▪New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪New CW is the “ILP wall”: There are diminishing returns on finding more ILP. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture 9. Old CW: Uniprocessor performance doubles every 18 months. ▪New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. 10.Old CW: Don’t bother parallelizing your application, as you can just wait a little while and run it on a much faster sequential computer. ▪New CW: It will be a very long wait for a faster sequential computer. 11. Old CW: Increasing clock frequency is the primary method of improving processor performance. ▪New CW: Increasing parallelism is the primary method of improving processor performance. 12. Old CW: Less than linear scaling for a multiprocessor application is failure. ▪New CW: Given the switch to parallel computing, any speedup via parallelism is a success. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture Conventional Wisdom (CW) in Computer Archietecture 1. Old CW: Power is free, but transistors are expensive. ▪New CW is the “Power wall”: Power is expensive, but transistors are “free”. That is, we can put more transistors on a chip than we have the power to turn on. 7. Old CW: Multiply is slow, but load and store is fast. ▪New CW is the “Memory wall”: Load and store is slow, but multiply is fast. Modern microprocessors can take 200 clocks to access Dynamic Random Access Memory (DRAM), but even floating-point multiplies may take only four clock cycles. 8. Old CW: We can reveal more instruction-level parallelism (ILP) via compilers and architecture innovation. Examples from the past include branch prediction, out-of-order execution, speculation, and Very Long Instruction Word systems. ▪New CW is the “ILP wall”: There are diminishing returns on finding more ILP. 9. Old CW: Uniprocessor performance doubles every 18 months. ▪New CW is Power Wall + Memory Wall + ILP Wall = Brick Wall. In 2006, performance is a factor of three below the traditional doubling every 18 months that we enjoyed between 1986 and 2002. The doubling of uniprocessor performance may now take 5 years. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture Uniprocessor Performance (SPECint) From Hennessy and PattersonComputer Architecture: A QuantitativeApproach, 4° edition, 2006 • Sea change in chipdesign: multiple “cores” orprocessors per chip • VAX: 25%/year 1978 to 1986• RISC + x86: 52%/yaer 1986 to 2002• RISC + x86: ??%/year 2002 to present Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics CW in Computer Architecture The State of Hardware ■A Negative picture about the state of hardware is painted by CW pairs based analysis. ■There are compensating positives as well ●Moore’s Law continues: it will soon be possible to put thausands of simple processors on a single, economical chip; ●Very low latency & very high bandwidth for the communication between these processors within a chip; ●Monolithic manycore microprocessors - represent a very different design point from traditional multichip multiprocessors - provide promise for the development of new architectures and programming models. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs ■ Mining the parallelism experience of the high-performance computing community to see if there are lessons we can learn for a broader view of parallel computing. The hypothesis ●is not that traditional scientific computing is the future of parallel computing ● is that the body of knowledge created in bulding programs that run well on massively parallel computers may prove useful in parallelizing future applications ■ Many of the authors from other areas, such as embedded computing, were surprised at how well future applications in their domain mapped closely to problems in scientific computing. ■ The way to guide and evaluate architecture innovation is to study a benchmark suite based on existing programs, such as EEMBC (Embedded Microprocessors Benchmark Consortium) or SPEC (Standard Performance Evalution Corporation) or SPLASH (Stanford Parallel Applications for Shared Memory). Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs ■ It is currently unclear how to express a parallel computation best: a very big obstacle to innovation in parallel computing. ■ It seems unwise to let a set of existing source code drive an investigation into parallel computing. ■ There is a need to find a higher level of abstraction for reasoning about parallel application requirements. ■ The main aim is to delineate application requirements in a manner that is not overly specific to individual applications or the optimizations used for certain hardware platforms. ■ It is possible to draw broader conclusions about hardware requirements. ■ The approach is to define a number of “Dwarfs”, which each capture a pattern of computation and communication common to a class of important applications. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs ■ Phil Colella identified seven numerical methods that he believed will be important for science and engineering for at least the next decade ■ Seven Dwarfs ● Constitute classes where membership in a class is defined by similarity in computation and data movement ● are specified at a high level of abstraction to allow reasoning about their behavior across a broad range of applications Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs Seven Dwarfs, their descriptions, corresponding NAS benchmarks, and example computers. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Applications and Dwarfs Extensions to the original Seven Dwarfs. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Recognition, Mining, Synthesis (RMS) Intel “Era of Tera” Computation Categories Intel’s RMS and how it maps down to functions that are more primitive. Of the five categories at the top of the figure, Computer Vision is classified as Recognition, Data Mining is Mining, and Rendering, Physical Simulation, and Financial Analytics are Synthesis. [Chen 2006] Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Parallel Programming Models Comparison of 10 current parallel programming models for 5 critical tasks, sorted from most explicit to most implicit. High-performance computing applications [Pancake and Bergmark 1990] and embedded applications [Shah et al 2004a] suggest these tasks must be addressed one way or the other by a programming model: 1) Dividing the application into parallel tasks; 2) Mapping computational tasks to processing elements; 3) Distribution of data to memory elements; 4) mapping of communication to the inter-connection network; and 5) Inter-task synchronization. Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Limits of Performance of Dwarfs Limits to performance of dwarfs, inspired by an suggestion by IBM that a packaging technology could offer virtually infinite memory bandwidth. While the memory wall limited performance for almost half the dwarfs, memory latency is a bigger problem than memory bandwidth Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Transistor Integration Capacity Transistor integration capacity Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Pollack’s Rule Pollack's Rule Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Frequency and Power Consumption Frequency and Power Consumption Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics ManyCore System Illustration of a Many Core System Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Amdahl’s Law Limits Parallel Speedup Amdahl's Law limits parallel speedup Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Core Performances Performance of Large, Medium, and Small Cores Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Fine Grain Power Management Fine grain power management Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

CNR Bioinformatics Network Power Estimate Network power estimate Workshop December 19, Napoli - Italy R. Vaccaro & L. Verdoscia Programming Models and Architectures for……

Roberto Vaccaro & Lorenzo Verdoscia Institute for High Performance Computing and Networking