Exascale: Why It is Different

Exascale: Why It is Different Barbara Chapman University of Houston P2S2 Workshop at ICPP Taipei, Taiwan September 13, 2011 High Performance Computing and Tools Group http://www.cs.uh.edu/~hpctools

Agenda • What is Exascale? • Hardware Revolution • Programming at Exascale • Runtime Support • We Need Tools

Top 10 Supercomputers (June 2011)

Petascale is a Global Reality • K computer • 68,544 SPARC64 VIIIfx processors, Tofu interconnect, Linux-based enhanced OS, produced by Fujitsu • Tianhe-1A • 7,168 Fermi GPUs and 14,336 CPUs; it would require more than 50,000 CPUs and twice as much floor space to deliver the same performance using CPUs alone. • Jaguar • 224,256 x86-based AMD Opteronprocessor cores, Each compute node features two Opterons with 12 cores and 16GB of shared memory • Nebulae • Nvidia Tesla 4640 GPUs, Intel X5650-based 9280 CPUs • Tsubame • 4200 GPUs

Exascale Systems: The Race is On • Town Hall Meetings April‐June 2007 • Scientific Grand Challenges Workshops November 2008 – October 2009 • Climate Science (11/08), • High Energy Physics (12/08), • Nuclear Physics (1/09), • Fusion Energy (3/09), • Nuclear Energy (5/09) (with NE) • Biology (8/09), • Material Science and Chemistry (8/09), • National Security (10/09) (with NNSA) • Cross-cutting workshops • Architecture and Technology (12/09) • Architecture, Applied Mathematics and Computer Science (2/10) • Meetings with industry (8/09, 11/09) • External Panels • ASCAC Exascale Charge • Trivelpiece Panel Peak performance is 10**18 floating point operations per second

DOE Exascale Exploration: Findings • Exascale is essential in important areas • Climate, combustion, nuclear reactors, fusion, stockpile stewardship, materials, astrophysics,… • Electricity production, alternative fuels, fuel efficiency • Systems need to be both usable and affordable • Must deliver exascale level of performance to applications • Power budget cannot exceed today’s petascale power consumption by more than a factor of 3 • Applications need to exhibit strong scaling, or weak scaling that does not increase memory requirements • Need to understand R&D implications (HW and SW) What is the roadmap?

IESP: International Exascale Software • International effort to specify research agenda that will lead to exascale capabilities • Academia, labs, agencies, industry • Requires open international collaboration • Significant contribution of open-source software • Revolution vs. evolution? • Produced a detailed roadmap Focused meetings to determine R&D needs

IESP: Exascale Systems Given budget constraints, current predictions focus on two alternative designs: • Huge number of lightweight processors, e.g. 1 million chips, 1000 cores/chip = 1 billion threads of execution • Hybrid processors, e.g. 1.0GHz processor and 10000 FPUs/socket & 100000 sockets/system = 1 billion threads of execution See http:/www.exascale.org/ Platforms expected around 2018

Exascale: Anticipated Architectural Changes • Massive (ca. 4K) increase in concurrency • Mostly within compute node • Balance between compute power and memory changes significantly • 500x compute power and 30x memory of 2PF HW • Memory access time lags further behind

But Wait, There’s More… • Need breakthrough in power efficiency • Impact on all of HW, cooling, system software • Power-awareness in applications and runtime • Need new design of I/O systems • Reduced I/O bandwidth & file system scaling • Must improve system resilience, manage component failure • Checkpoint/restart won’t work

DOE’s Co-design Model • Initial set of projects funded in key areas • Integrate decisions and explore trade-offs • Across hardware and software • Simultaneous redesign of architecture and the application code software • Requires significant interactions, incl. with hardware vendors • Including accurate simulators • Programming model. system software, will be key to success

Hot Off The PressExaflop supercomputer receives full funding from Senate appropriators September 11, 2011 — 11:45pm ET | By David Perera • An Energy Department effort to create a super computer three orders of magnitude more powerful than today's most powerful computer--an exascale computer--would receive $126 million during the coming federal fiscal year under a Senate Appropriations Committee markup of the DOE spending bill. • The Senate Appropriations Committee voted Sept. 7 to approve the amount as part of the fiscal 2012 energy and water appropriations bill; fiscal 2011 ends on Sept. 30. The $126 million is the same as requested earlier this year in the White House budget request. • In a report accompanying the bill, Senate appropriators say Energy plans to deploy the first exascale system in 2018, despite challenges.

Heterogeneous High-Performance System Each node has multiple CPU cores, and some of the nodes are equipped with additional computational accelerators, such as GPUs. www.olcf.ornl.gov/wp-content/uploads/.../Exascale-ASCR-Analysis.pdf

Many More Cores • Biggest change is within the node • Some full-featured cores • Many low-power cores • Technology rapidly evolving, will be integrated • Easiest way to get power efficiency and high performance • Specialized cores • Global memory • Low amount of memory per core • Coherency domains, networks on chip Display Subsystem C64x+™ DSP and video accelerators (3525/3530 only) AvRM®Cortex™-A8 CPU POWERVR SGX™ Graphics (3515/3530 only) Camera I/F L3/L4 Interconnect System Peripherals Connectivity Serial Interfaces Program/Data Storage

Example: Nvidia Hardware • Echelon – “Radical and rapid evolution of GPUs for exascale systems” (SC10) – expected in 2013-2014 • ~1024 stream cores • ~8 latency optimized CPU cores on a single chip • ~20TFLOPS in a shared memory system • 25x performance over Fermi • Plan is to integrate in exascale chip

Top 10 Energy-efficient Supercomputers (June 2011)

Processor Architecture and Performance (1993-2011)

Programming Challenges • Scale and structure of parallelism • Hierarchical parallelism • Many more cores, more nodes in clusters • Heterogeneous nodes • New methods and models to match architecture • Exploit intranode concurrency, heterogeneity • Adapt to reduced memory size; drastically reduce data motion • Resilience in algorithm; fault tolerance in application • Need to run multi-executable jobs • Coupled computations; Postprocessing of data while in memory • Uncertainty quantification to clarify applicability of results

Programming Models? Today’s Scenario // Run one OpenMP thread per device per MPI node #pragma omp parallel num_threads(devCount) if (initDevice()) { // Block and grid dimensions dim3 dimBlock(12,12); kernel<<<1,dimBlock>>>(); cudaThreadExit(); } else { printf("Device error on %s\n",processor_name); } MPI_Finalize(); return 0; } www.cse.buffalo.edu/faculty/miller/Courses/CSE710/heavner.pdf

Exascale Programming Models • Programming models biggest “worry factor” for application developers • MPI-everywhere model no longer viable • Hybrid MPI+OpenMP already in use in HPC • Need to explore new approaches, adapt existing APIs • Exascale models and their implementation must take account of: • Scale of parallelism, levels of parallelism • Potential coherency domains, heterogeneity • Need to reduce power consumption • Resource allocation and management • Legacy code, libraries; interoperability • Resilience in algorithms, alternatives to checkpointing Work on programming models must begin now!

Specialized core genericcore Specialized core genericcore Programming Model Major Requirements • Portable expression of scalable parallelism • Across exascale platforms and intermediate systems • Uniformity • One model across variety of resources • Across node, across machine? • Locality • For performance and power • At all levels of hierarchy • Asynchrony • Minimize delays • But trade-off with locality Control and data transfers

Delivering The Programming Model • Between nodes, MPI with enhancements might work • Needs more help for fault tolerance and improved scalability • Within nodes, too many models, no complete solution • OpenMP, PGAS, CUDA and OpenCL all potential starting point • In layered system, migrate code to most appropriate level, develop at most suitable level • Incremental path for existing codes • Timing of programming model delivery is critical • Must be in place when machines arrive • Needed earlier for development of new codes

DOE Workshop’s Reverse Timeline

A Layered Programming Approach Computational Science Data Informatics Information Technology Applications Very-high level Efficient, Deterministic, Declarative, Restrictive Expressiveness based language (DSL maybe?) Parallel Programming Languages (OpenMP, PGAS, APGAS) High level Low-level APIs (MPI, pthreads, OpenCL, Verilog) Low-level Very low-level Machine code, Assembly Heterogeneous Hardware

Specialized core genericcore Specialized core genericcore OpenMP Evolution Toward Exascale • OpenMP language committee is actively working toward the expression of locality and heterogeneity • And to improve task model to enhance asynchrony • How to identify code that should run on a certain kind of core? • How to share data between host cores and other devices? • How to minimize data motion? • How to support diversity of cores? Control and data transfers

OpenMP 4.0 Attempts To Target Range of Acceleration Configurations Dedicated hardware for specific function(s) Attached to a master processor Multiple types or levels of parallelism Process level, thread level, ILP/SIMD May not support a full C/C++ or Fortran compiler May lack stack or interrupts, may limit control flow, types Master Master Master Master Accelerator w/ nonstd Programming model DSP ACC DSP DSP DSP ACC Massively Parallel Accelerator

Increase Locality, Reduce Power void foo(double A[], double B[], double C[], int nrows, int ncols) { #pragma omp data_region acc_copyout(C), host_shared(A,B) { #pragma omp acc_region for (int i=0; i < nrows; ++i) for (int j=0; j < ncols; j += NLANES) for (int k=0; k < NLANES; ++k) { int index = (i * ncols) + j + k; C[index] = A[index] + B[index]; } // end accelerator region print2d(A,nrows,ncols); print2d(B,nrows,ncols); Transpose(C); // calls function w/another accelerator construct } // end data_region print2d(C, nrows, ncols); } void Transpose(double X[], int nrows, int ncols) { #pragma omp acc_region acc_copy(X), acc_present(X) { … } }

OpenMP Locality Research Locations := Affinity Regions • Coordinate data layout, work • Collection of locations represent execution environment • Map data, threads to a location; distribute data across locations • Align computations with data’s location, or map them • Location inherited unless task explicitly migrated

Research on Locality in OpenMP • Implementation in OpenUH compiler shows good performance • Eliminates unnecessary thread migrations; • Maximizes locality of data operations; • Affinity support for heterogeneous systems. int main(int argc, char * argv[]) { double A[N]; … #pragma omp distribute(BLOCK: A) \ location(0:1) … #pragma omp parallel for OnLoc(A[i]) for(i=0;i<N;i++){ foo(A[i]) } … }

EnablingAsynchronousComputation • Directed acyclic graph (DAG): where each node represents a task and the edges represents inter-task dependencies • A task can begin execution only if all its predecessors have completed execution • Should user express this directly? • Compiler can generate tasks and graph (at least partially) to enhance performance • What is “right” size of task?

OpenMP Research: Enhancing Tasks for Asynchrony • Implicitly create DAG by specifying data input and output related to tasks • A task that produces data may need to wait for any previous child task that reads or writes the same locations • A task that consumes data may need to wait for any previous child task that writes the same locations • Task weights (priorities) • Groups of tasks and synchronization on groups

Asynchronous OpenMP Execution T.-H. Weng, B. Chapman: Implementing OpenMP Using Dataflow Execution Model for Data Locality and Efficient Parallel Execution. Proc. HIPS-7, 2002

OpenUH “All Task” Execution Model • Replace fork-join with data flow execution model • Reduce synchronization, increase concurrency • Compiler transforms OpenMP code to a collection of tasks and a task graph. • Task graph represents dependences between tasks • Array region analysis; data read/write requests for the data the task needs. • Map tasks to compute resources through task characterization using cost modeling • Enable locality-aware scheduling and load-balancing

Runtime Is Critical for Performance • Runtime support must • Adapt workload and data to environment • Respond to changes caused by application characteristics, power, faults, system noise • Dynamically and continuously • Provide feedback on application behavior • Uniform support for multiple programming models on heterogeneous platforms • Facilitate interoperability, (dynamic) mapping

Lessons Learned from Prior Work Only a tight integration of application-provided meta-data and architecture description can let the runtime system take appropriate decisions Good analysis of thread affinity and data locality Task reordering H/W aware selective information gathering

Executables Feedback Optimizations Source code w/ OpenMP directives FRONTENDS (C/C++, Fortran 90, OpenMP) Open64 Compiler infrastructure IPA (Inter Procedural Analyzer) Source code with runtime library calls DynamicFeedback forCompilerOptimizations OMP_PRELOWER (Preprocess OpenMP ) A Native Compiler LNO (Loop Nest Optimizer) LOWER_MP (Transformation of OpenMP ) Linking Object files WOPT (global scalar optimizer) WHIRL2C & WHIRL2F (IR-to-source option ) OpenMP Runtime Library CG (Itanium, Opteron, Pentium) Static Feedback Information Runtime Feedback Information Free compute resources

Compiler’s Runtime Must Adapt OpenMP Runtime Library OpenMP App • Light-weight performance data collection • Dynamic optimization • Interoperate with external tools and schedulers • Need very low-level interfaces to facilitate its implementation Register event Event callback Collector Tool Multicore Association is defining interfaces

Interactions Across System Stack • More interactions needed to share information • To support application development and tuning • Increase execution efficiency • Runtime needs architectural information, smart monitoring • Application developer needs feedback • So does dynamic optimizer IPA: Inlining Analysis / Selective Instrumentation Source-to-Source Transformations Instrumentation Phase Optimization Logs Oscar Hernandez, Haoqiang Jin, Barbara Chapman. Compiler Support for Efficient Instrumentation. In Parallel Computing: Architectures, Algorithms and Applications , C. Bischof, M. B¨ucker, P. Gibbon, G.R. Joubert, T. Lippert, B. Mohr, F. Peters (Eds.), NIC Series, Vol. 38, ISBN 978-3-9810843-4-4, pp. 661-668, 2007.

Automating the Tuning Process K. Huck, O. Hernandez, V. Bui, S. Chandrasekaran, B. Chapman, A. Malony, L. McInnes, B. Norris. Capturing Performance Knowledge for Automated Analysis. Supercomputing 2008

Code Migration Tools • From structural information to sophisticated adaptation support • Changes in data structures, large-grain and fine-grained control flow • Variety of transformations for kernel granularity and memory optimization Red: High similarity Blue: Low similarity

Summary • Projected exascale hardware requires us to rethink programming model and its execution support • Intra-node concurrency is fine-grained, heterogeneous • Memory is scarce and power is expensive • Data locality has never been so important • Opportunity for new programming models • Runtime continues to grow in importance for ensuring good performance • Need to migrate large apps: Where are the tools?

Exascale: Why It is Different