The Parallel Computing Landscape: A View from Berkeley 2.0

The Parallel Computing Landscape: A View from Berkeley 2.0 Krste Asanovic, Ras Bodik, Jim Demmel, Tony Keaveny, Kurt Keutzer, John Kubiatowicz, Edward Lee, Nelson Morgan, George Necula, Dave Patterson, Koushik Sen, John Wawrzynek, David Wessel, and Kathy Yelick Dagstuhl, September 3, 2007

A Parallel Revolution, Ready or Not • Embedded: per product ASIC to programmable platforms  Multicore chip most competitive path • Amortize design costs + Reduce design risk + Flexible platforms • PC, Server: Power Wall + Memory Wall = Brick Wall • End of the way we’ve scaled uniprocessors for last 40 years • New Moore’s Law is 2X processors (“cores”) per chip every technology generation, but same clock rate • “This shift toward increasing parallelism is not a triumphant stride forward based on breakthroughs …; instead, this … is actuallya retreat from even greater challenges that thwart efficient silicon implementation of traditional solutions.” The Parallel Computing Landscape: A Berkeley View • Sea change for HW & SW industries since changing the model of programming and debugging

P.S. Parallel Revolution May Fail • John Hennessy, President, Stanford University, 1/07:“…when we start talking about parallelism and ease of use of truly parallel computers, we're talking about a problem that's as hard as any that computer science has faced. … I would be panicked if I were in industry.” “A Conversation with Hennessy & Patterson,” ACM Queue Magazine, 4:10, 1/07. • 100% failure rate of Parallel Computer Companies • Convex, Encore, MasPar, NCUBE, Kendall Square Research, Sequent, (Silicon Graphics), Transputer, Thinking Machines, … • What if IT goes from a growth industry to areplacement industry? • If SW can’t effectively use 8, 16, 32, ... cores per chip  SW no faster on new computer  Only buy if computer wears out

Need a Fresh Approach to Parallelism • Berkeley researchers from many backgrounds meeting since Feb. 2005 to discuss parallelism • Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, Edward Lee, George Necula, Dave Patterson, Koushik Sen, John Shalf, John Wawrzynek, Kathy Yelick, … • Circuit design, computer architecture, massively parallel computing, computer-aided design, embedded hardware and software, programming languages, compilers, scientific programming, and numerical analysis • Tried to learn from successes in high performance computing (LBNL) and parallel embedded (BWRC) • Led to “Berkeley View” Tech. Report and new Parallel Computing Laboratory (“Par Lab”) • Goal: Productive, Efficient, Correct Programming of 100+ cores & scale as double cores every 2 years (!)

Why Target 100+ Cores? • 5-year research program aim 8+ years out • Multicore: 2X / 2 yrs  ≈ 64 cores in 8 years • Manycore: 8X to 16X multicore Automatic Parallelization, Thread Level Speculation

“The Datacenter is the Computer” • Building sized computers: Microsoft, … • “The Laptop/Handheld is the Computer” • ‘08: Dell no. laptops > desktops? ‘06 Profit? • 1B Cell phones/yr, increasing in function • Otellini demoed "Universal Communicator” • Combination cell phone, PC and video device • Apple iPhone Re-inventing Client/Server • Laptop/Handheld as future client, Datacenter as future server • Focus on single socket (Laptop/Handheld or Blade)

Try Application Driven Research? • Conventional Wisdom in CS Research • Users don’t know what they want • Computer Scientists solve individual parallel problems with clever language feature (e.g., futures), new compiler pass, or novel hardware widget (e.g., SIMD) • Approach: Push (foist?) CS nuggets/solutions on users • Problem: Stupid users don’t learn/use proper solution • Another Approach • Work with domain experts developing compelling apps • Provide HW/SW infrastructure necessary to build, compose, and understand parallel software written in multiple languages • Research guided by commonly recurring patterns actually observed while developing compelling app

4 Themes of View 2.0/ Par Lab • Applications • Compelling Apps driving the research agenda top-down • Identify Computational Bottlenecks • Breaking through disciplinary boundaries • Developing Parallel Software with Productivity, Efficiency, and Correctness • 2 Layers + C&C Language, Autotuning • OS and Architecture • Composable Primitives, not Pre-Packaged Solutions • Deconstruction, Fast Barrier Synchronization, Partition

Par Lab Research Overview Easy to write correct programs that run efficiently on manycore Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU RAMP Manycore

Theme 1. Applications. What are the problems? • “Who needs 100 cores to run M/S Word?” • Need compelling apps that use 100s of cores • How did we pick applications? • Enthusiastic expert application partner, leader in field, promise to help design, use, evaluate our technology • Compelling in terms of likely market or social impact, with short term feasibility and longer term potential • Requires significant speed-up, or a smaller, more efficient platform to work as intended • As a whole, applications cover the most important • Platforms (handheld, laptop, games) • Markets (consumer, business, health)

Coronary Artery Disease After Before • Modeling to help patient compliance? • 400k deaths/year, 16M w. symptom, 72MBP • Massively parallel, Real-time variations • CFD FEsolid (non-linear), fluid (Newtonian), pulsatile • Blood pressure, activity, habitus, cholesterol

Content-Based Image Retrieval • Built around Key Characteristics of personal databases • Very large number of pictures (>5K) • Non-labeled images • Many pictures of few people • Complex pictures including people, events, places, and objects Relevance Feedback Query by example Similarity Metric Candidate Results Final Result Image Database 1000’s of images

Compelling Laptop/Handheld Apps • Meeting Diarist • Laptops/ Handhelds at meeting coordinate to create speaker identified, partially transcribed text diary of meeting • Teleconference speaker identifier, speech helper • L/Hs used for teleconference, identifies who is speaking, “closed caption” hint of what being said

Compelling Laptop/Handheld Apps • Musicians have an insatiable appetite for computation • More channels, instruments, more processing, more interaction! • Latency must be low (5 ms) • Must be reliable (No clicks) • Music Enhancer • Enhanced sound delivery systems for home sound systems using large microphone and speaker arrays • Laptop/Handheld recreate 3D sound over ear buds • Hearing Augmenter • Laptop/Handheld as accelerator for hearing aide • Novel Instrument User Interface • New composition and performance systems beyond keyboards • Input device for Laptop/Handheld Berkeley Center for New Music and Audio Technology (CNMAT) created a compact loudspeaker array: 10-inch-diameter icosahedron incorporating 120 tweeters.

Parallel Browser • Goal: Desktop quality browsing on handhelds • Enabled by 4G networks, better output devices • Bottlenecks to parallelize • Parsing, Rendering, Scripting • SkipJax • Parallel replacement for JavaScript/AJAX • Based on Brown’s FlapJax

Theme 2. Use computational bottlenecks instead of conventional benchmarks? • How invent parallel systems of future when tied to old code, programming models, CPUs of the past? • Look for common patterns of communication and computation • Embedded Computing (42 EEMBC benchmarks) • Desktop/Server Computing (28 SPEC2006) • Data Base / Text Mining Software • Games/Graphics/Vision • Machine Learning • High Performance Computing (Original “7 Dwarfs”) • Result: 13 “Dwarfs”

Dwarf Popularity (Red HotBlue Cool) • How do compelling apps relate to 13 dwarfs?

What about I/O in Dwarfs? • High speed serial lines  many parallel I/O channels per chip • Storage will be much faster • “Flash is disk, disk is tape” • At least for laptop/handheld • iPod, camera, cellphone, laptop  $ to rapidly improve flash memory • New technologies even more promising: Phase-Change RAM denser, faster write, 100X longer wear-out • Disk  Flash means 10,000x improvement in latency (10 millisecond to 1 microsecond) • Network will be much faster • 10 Gbit/s copper is standard; 100 Gbit/sec is coming • Parallel serial channels to multiply bandwidth • Already need multiple fat cores to handle TCP/IP at 10 Gbit/s • Super 3G -> 4G, 100Mb/s->1Gb/s WAN wireless to handset

4 Valuable Roles of Dwarfs • “Anti-benchmarks” • Dwarfs not tied to code or language artifacts  encourage innovation in algorithms, languages, data structures, and/or hardware • Universal, understandable vocabulary • To talk across disciplinary boundaries • Bootstrapping: Parallelize parallel research • Allow analysis of HW & SW design without waiting years for full apps • Targets for libraries (see later)

Themes 1 and 2 Summary • Application-Driven Research vs. CS Solution-Driven Research • Drill down on (initially) 5 app areas to guide research agenda • Dwarfs to represent broader set of apps to guide research agenda • Dwarfs help break through traditional interfaces • Benchmarking, multidisciplinary conversations, parallelizing parallel research, and target for libraries

Theme 3: Developing Parallel Software Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU RAMP Manycore

Dwarfs become Libraries or Frameworks • Observation: Dwarfs are of 2 types • Algorithms in the dwarfs can either be implemented as: • Compact parallel computations within a traditional library • Compute/communicate pattern implemented as framework Libraries • Dense matrices • Sparse matrices • Spectral • Combinational • Finite state machines Patterns/Frameworks • MapReduce • Graph traversal, graphical models • Dynamic programming • Backtracking/B&B • N-Body • (Un) Structured Grid • Computations may be viewed at multiple levels: e.g., an FFT library may be built by instantiating a Map-Reduce framework, mapping 1D FFTs and then transposing (generalized reduce)

Two Layers for Two Types of Programmer • Efficiency Layer (10% of today’s programmers) • Expert programmers build Frameworks & Libraries, Hypervisors, … • “Bare metal” efficiency possible at Efficiency Layer • Productivity Layer (90% of today’s programmers) • Domain experts / Naïve programmers productively build parallel apps using frameworks & libraries • Frameworks & libraries composed to form app frameworks • Effective composition techniques allows the efficiency programmers to be highly leveraged  Create language for Composition and Coordination (C&C)

Productivity Layer Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Intel Multicore/GPGPU RAMP Manycore

C & C Language Requirements Applications specify C&C language requirements: • Constructs for creating application frameworks • Primitive parallelism constructs: • Data parallelism • Divide-and-conquer parallelism • Event driven execution • Constructs for composing frameworks • Frameworks require independence • Independence is proven at instantiation with a variety of techniques • Needs to have low runtime overhead and ability to measure and control performance

Coordination & Composition Composition is key to software reuse • Solutions exist for libraries with hidden parallelism: • Partitions in OS help runtime composition • Instantiating parallel frameworks is harder: • Framework specifies required independence • E.g., operations in map, div&conq must not interfere • Guaranteed independence through types • Type system extensions (side effects, ownership) • Extra: Image READ Array[double] • Data decomposition may be implicit or explicit: • Partition: Array[T], …  List [Array[T]] (well understood) • Partition: Graph[T], …  List [Graph[T]] • Efficiency layer code has these specifications at interfaces, which are verified, tested, or asserted • Independence is proven by checking side effects and overlap at instantiation

DCT, etc. Sparse LU Delaunay Coordination & Composition • Coordination is used to create parallelism: • Support parallelism patterns of applications • Data parallelism (degree = data size, not core count) • May be nested: forall images, forall blocks in image • Divide-and-conquer (parallelism from recursion) • Event-driven: nondeterminism at algorithm level • Branch&Bound dwarf, etc. • Serial semantics with limited nondeterminism • Data parallelism comes from array/aggregate operations and loops without side effects (serial) • Divide-and-conquer parallelism comes from recursive function invocations with non-overlapping side effects (serial) • Event-driven programs are written as guarded atomic commands, which may be implemented as transactions • Discovered parallelism mapped to available resources • Techniques include static scheduling, autotuning, dynamic schedule and possibly hints

Efficiency Layer Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Intel Multicore/GPGPU RAMP Manycore

Efficiency Layer Ground OS/HW Requirements Create parallelism, manage resources: affinity, bandwidth Queries: #cores, #memory spaces, etc. Latency hiding (prefetch, DMA,…) to enable bandwidth saturation, i.e., SPMV should run at peak bandwidth Programming in legacy languages with threads Private on-chip m: l: Shared partitioned on-chip x: 1 y: x: 5 y: x: 7 y: 0 Shared off-chip DRAM Build by extending UPC • For portability across local RAM vs. cache, global address space with memory hierarchy

Div&Conq with task stealing Efficiency Layer: Software RAMP • Efficiency layer is abstract machine model + selective virtualization • Software RAMP provides add-ons • Schedulers • Add runtime w/ dynamic scheduling for dynamic task tree • Add semi-static schedule for given graph • Memory movement / sharing primitives • Support to manage local RAM • Synchronization primitives • Use of verification and testing at this level General task graph with weights & structure

Spec: simple implementation (3 loop 3D stencil) Sketch: optimized skeleton (5 loops, missing some index/bounds) Optimized code (tiled, prefetched, time skewed) Synthesis via Sketching and Autotuning • Extensive tuning knobs at efficiency level • Performance feedback from hardware and OS • Sketching: Correct by construction • Autotuning: Efficient by search • Examples: Spectral (FFTW, SPIRAL), Dense (PHiPAC, Atlas), Sparse (OSKI), Structured grids (Stencils) • Can select from algorithms/data structures changes not producible by compiler transform Optimization: 1.5x more entries (zeros)  1.5x speedup Compilers won’t do this!

Autotuning Sparse Matrices • Sparse Matrix-Vector Multiply (SPMV) tuning steps: • Register Block to compress matrix data structure (choose r1xr2) • Cache Block so corresponding vectors fit in local memory (c1xc2) • Parallelize by dividing matrix evenly (p1xp2) • Prefetch for some distance d • Machine-specific code for SSE, etc. Autotuning is more important than parallelism on these machines!

Theme 3 Summary Two Layers for two types of programmers • Productivity layer: safety • Coordination and Composition language has serial semantics with controlled nondeterminism • Frameworks and libraries have enforced contracts, leverage expert programmers by reusing library/framework code • Efficiency layer: expressiveness and portability • Base is abstract machine model • Software RAMP for selective virtualization (add-ons) • Sketching and autotuning for library/framework development

Theme 4: OS and Architecture Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Applications Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Layer Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Directed Testing Sketching Efficiency Layer Autotuners Dynamic Checking Legacy Code Schedulers Communication & Synch. Primitives Efficiency Language Compilers Debugging with Replay Legacy OS OS Libraries & Services OS Hypervisor Arch. Multicore/GPGPU RAMP Manycore

Why change OS and Architecture? Old world: OS is the only parallel code, user jobs are sequential New world: Many user applications are highly parallel • Efficiency layer wants to control own scheduling of threads onto cores (e.g. select non-virtualized or virtualized) • Legacy OS + Virtual Machine layers makes this intractable: VMM scheduler + Guest OS scheduler + Application scheduler? • Parallel applications need efficient inter-thread communication and synchronization operations • Legacy architectures have poor inter-thread communication (cache-coherent shared memory) and unwieldy synchronization (spin locks) • Managing data movement important for efficiency layer, must be able to saturate memory bandwidth with independent requests • Legacy OS + caches/prefetching make bad decisions about where data should be relative to accessing thread within application

Par Lab OS and Architecture Theme Deconstruct OS and architectures into software-composable primitives not pre-packaged solutions • Very thin hypervisor layer providing hierarchical partitions • Protection only, no scheduling/multiplexing in hypervisor. Traditional OS functionalities in user libraries and service partitions. • Efficiency layer takes over thread scheduling in partition • Low overhead hardware synchronization primitives • Barriers+signals, atomic memory operations (fetch+op), user-level active messages • Memory read/write set tracking to support hybrid hardware/software coherence, transactional memory (TM), race detection, checkpointing • Efficiency layer uses and exports customized synchronization facilities • Configurable memory hierarchy • Private and/or shared caches, local RAM, DMA engine • Efficiency layer controls whether and how to cache data, and controls data movement around system

Music Framework Audio Unit UI Audio Unit Device Driver Legacy OS Process Process Web Browser Root Partition Plug-in Plug-in Singularity Partition-Based Deconstructed OS • Hypervisor is a very thin software layer over the hardware providing only a resource protection mechanism, no resource multiplexing • Traditional OS functionality provided as libraries (cf. Exokernel) and service partitions (e.g., device drivers) • Hypervisor allows partition to subdivide into hierarchical child partitions, where parent retains control over children. Motivating examples: • Protecting browser from third-party plugins • Predictable real-time for audio processing units in music application • Support legacy parallel libraries/frameworks with runtime in own partition • Trusted OS or VM (e.g., Singularity or JavaVM) can avoid mandatory hardware protection checking within own partition

CPU CPU CPU CPU CPU L1 L1 L1 L1 L1 L2 Interconnect L2 Bank L2 Bank L2 Bank L2 Bank L2 Bank DRAM & I/O Interconnect DRAM DRAM DRAM DRAM DRAM Partition 2 Partition 1 Protected Shared Memory Hardware Partitioning Support • Partition can contain own cores, L1 and L2 $/RAM, DRAM, and interconnect bandwidth allocation • Inter-partition communication through protected shared memory • “Un-virtual” partitions give efficiency layer complete “bare metal” control of hardware • Per-partition performance counters give feedback to runtimes/autotuners • Benefits: • Efficiency (fewer layers, custom OS) • Enables new exposed HW primitives • Performance isolation/predictability • Robustness to faults/errors • Security

Partition Boundaries (only cores shown, separately sizable allocation for other resources, e.g., L2 memory) Customized Parallel Processors • Efficiency layer software uses hardware synchronization and communication primitives to glue cores together into a customized application “processor” • Partition always swapped in en masse • Software protocols can support dynamic partition resizing Example: 16x16 Array of Cores If “Processors are the new Transistors” then “Partitions are the new Processors”

RAMP Manycore Prototype • Separate multi-university RAMP project building FPGA emulation infrastructure • BEE3 boards with Chuck Thacker/Microsoft • Expect to fit hundreds of 64-bit cores with full instrumentation in one rack • Run at ~100MHz, fast enough for application software development • Flexible cycle-accurate timing models • What if DRAM latency 100 cycles? 200? 1000? • What if barrier takes 5 cycles? 20? 50? • “Tapeout” every day, to incorporate feedback from application and software layers • Rapidly distribute hardware ideas to larger community • RAMP Blue, July 2007 • 1000+ RISC cores @90MHz • Works! Runs UPC version of NAS parallel benchmarks.

Summary: A Berkeley View 2.0 Easy to write correct programs that run efficiently on manycore • Our industry has bet its future on parallelism (!) • Recruit best minds to help? • Try Apps-Driven vs. CS Solution-Driven Research • Dwarfs as lingua franca, anti-benchmarks, … • Efficiency layer for ≈ 10% today’s programmers • Productivity layer for ≈ 90% today’s programmers • C&C language to help compose and coordinate • Autotuners vs. Parallelizing Compilers • OS & HW: Composable Primitives vs. Solutions Personal Health Image Retrieval Hearing, Music Speech Parallel Browser Apps Dwarfs Composition & Coordination Language (C&CL) Static Verification C&CL Compiler/Interpreter Productivity Parallel Libraries Parallel Frameworks Type Systems Correctness Efficiency Languages Sketching Directed Testing Efficiency Autotuners Legacy Code Schedulers Communication & Synch. Primitives Dynamic Checking Efficiency Language Compilers Debugging with Replay OS Legacy OS OS Libraries & Services Hypervisor Multicore/GPGPU RAMP Manycore Arch.

Acknowledgments • Berkeley View material comes from discussions with: • Profs Krste Asanovic, Ras Bodik, Jim Demmel, Kurt Keutzer, John Kubiatowicz, John Wawrzynek, Kathy Yelick • UCB Grad students Bryan Catanzaro, Jike Chong, Joe Gebis, William Plishker, and Sam Williams • LBNL:Parry Husbands, Bill Kramer, Lenny Oliker, John Shalf • See view.eecs.berkeley.edu • RAMP based on work of RAMP Developers: • Krste Asanovic (Berkeley), Derek Chiou (Texas), James Hoe (CMU), Christos Kozyrakis (Stanford), Shih-Lien Lu (Intel), Mark Oskin (Washington), David Patterson (Berkeley, Co-PI), and John Wawrzynek (Berkeley, PI) • See ramp.eecs.berkeley.edu

The Parallel Computing Landscape: A View from Berkeley 2.0

The Parallel Computing Landscape: A View from Berkeley 2.0

Presentation Transcript

Principles of Parallel Algorithm Design

Parallel Computing Explained

Parallel Processing

Cohesive devices… .

Principles of Parallel Algorithm Design

Parallelism and Locality in Matrix Computations www.cs.berkeley.edu/~demmel/cs267_Spr09 Introduction

Parallel Computing with MATLAB

The iphone

Parallel Graph Algorithms

Parallel Concept and Hardware Architecture CUDA Programming Model Overview

BACKDROP

COMPUTING WITH WORDS AND PERCEPTIONS (CWP) —

Distributed Parallel Computing

Prof. Chris Carothers Computer Science Department Lally 306 [Office Hrs: Wed, 11a.m – 1p.m]

Message-Passing Computing

Monte- C arlo method and Parallel computing

Parallel Computing with OpenMP on distributed shared memory platforms

Computing and Brokering

CS 267 Applications of Parallel Computers Hierarchical Methods for the N-Body problem

Parallel and Distributed Algorithms Spring 2007

Parallel Computing Final Exam Review

25 Magnificent Landscape Photography