Perspective on Extreme Scale Computing in China

Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29, 2013

Outline • Related R&D programs in China • HPC system development • Application service environment • Applications

Related R&D programs in China

HPC-related R&D Under NSFC • NSFC • Key initiative “Basic algorithms for high performance scientific computing and computable modeling” • 2011-2018 • 180 million RMB • Basic algorithms and high efficient implementation • Computable modeling • Verification by solving domain problems

HPC-related R&D Under 863 program • 3 Key projects in the last 12 years • High performance computer and core software (2002-2005) • High productivity computer and Grid service environment (2006-2010) • High productivity computer and application environment (2011-2016) • 3 Major projects • Multicore/many-core programming support (2012-2015) • High performance parallel algorithms and parallel coupler development for earth systems study (2010-2013) • HPC software support for earth system modeling (2010-2013)

HPC-related R&D Under 973 program • 973 program • High performance scientific computing • Large scale scientific computing • Aggregation and coordination mechanisms in virtual computing environment • Highly efficient and trustworthy virtual computing environment

There is no national long-term R&D program on extreme scale computing • Coordination between different programs needed

Shift of 863 program emphasis • 1987: Intelligent computers, following the 5th generation computer program in Japan • 1990: from intelligent computers to high performance parallel computers • 1999: from individual HPC system to the national HPC environment • 2006: from high performance computers to high productivity computers

History of HPC development under 863 program • 1990: parallel computers identified as priority topic of the 863 program • National Intelligent Computer R&D Center established • 1993: Dawning 1, 640MIPS, SMP • 1995: Dawning 1000, 2.5GFlops, MPP • Downing company established in 1995 • 1996: Dawning 1000A, cluster system • First product-oriented system of Dawning • 1998: Dawning 2000, 100GFlops, cluster

History of HPC development under 863 program • 2000: Dawning 3000, 400GFlops, cluster, • First system commercialized • 2002: Lenovo DeepComp 1800, 1TFlops, cluster • Lenovo entered the HPC market • 2003: Lenovo DeepComp 6800, 5.3TFlops, cluster • 2004: Dawning 4000A, 11.2TFlops

History of HPC development under 863 program • 2008: • Lenovo DeepComp 7000 • 150TFlops, Heterogeneous cluster • Dawning 5000A • 230TFlops, cluster • 2010: • Dawning 6000 • 3PFlops, Heterogeneous system CPU+GPU • TH-1A • 4.7PFlops, Heterogeneous CPU+GPU • 2011: • Sunway-Bluelight • IPFlops+100TFlops • Based on domestic processor • 2013: • TH-2 • Heterogeneous system with CPU+MIC

“High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local government, application organizations, and industry Major outcomes: China National Grid (CNGrid) “High productivity Computer and Grid Service Environment” Period: 2006-2010 (extended to now) 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources 863 key projects on HPC and Grid: 2002-2010

Current 863 key project • “High productivity computer and application environment” • 2011-2015 (2016) • 1.3B YUAN investment secured • Develop leading level high performance computers • Transfer CNGrid into an application service environment • Develop parallel applications in selected areas

Projects launched • The first round of projects launched in 2011 • High productivity computer (1) • 100PF by the end of 2015 • HPC applications (6) • Fusion simulation • Simulation for aircraft design • Drug discovery • Digital media • Structural mechanics for large machinery • Simulation of electro-magnetic environment • Parallel programming framework (1) • Application service environment will be supported in the second round • Emphasis on application service support • Technologies for new mode of operation

HPC system development

Major challenges • Power consumption • Performance obtained by the applications • Programmability • Resilience • Major obstacles • memory walls • Power walls • I/O walls • …

Power consumption • The limiting factor to implementation of extreme scale computers • Impossible to increase performance by expanding system scale only • Cooling of the system is difficult and affects reliability of the system • Energy cost is a heavy burden and prevent acceptance of extreme scale computers by end users

Performance obtained by applications • Systems installed at general purpose computing centers • Serving a large population of users • supporting a wide range of applications • LinPack is not everything • Need to be efficient for both general-purpose and special-purpose computing • Need to support both computing-intensive and data-intensive applications

Programmability • Must handle • Concurrency/locality • Heterogeneity of the system • Legacy programs porting • Lower the skill requirement for application developers

Resilience • Very short MTBF for extreme scale systems • Long-time continuous operation • System must self-heal/recover from hardware faults/failures • System must detect and tolerate errors in software

Constrained design principle • We must set strong constrains to the extreme scale system implementation • Power consumption • 50GF/W or less before 2020 • 5GF/W in 2015 • Systems scale • <100,000 processors • <200 cabinets • Cost • <300 million dollars (or <2 B YUAN) • We can only design and implement extreme scale system with those constrains

How to address the challenges? Architectural support Technology innovation Hardware and software coordination

Architectural support • Using the most appropriate architecture to achieve the goal • Making trade-offs between performance, power consumption, programmability, resilience, and cost • Hybrid architecture (TH-1A & TH-2) • General purpose + high density computing (GPU or MIC) • HPP architecture (Dawning 6000/Loonson) • Enable different processors to co-exist • Support global address space • Multi-level of parallelism • Multi-conformation and Multi-scale adaptive architecture (SW/BL) • Cluster implemented with Intel processor for supporting commercial software • Homogeneous system implemented with domestic multicore processors for computing-intensive applications • Support parallelism at different levels

Classification of current major architectures • Classifying architectures using “homogeneity/heterogeneity” and “CPU only/CPU+Accelerator” • Homo-/Hetero refers to the ISA

Comparison of different architectures

TH-1A architecture • Hybrid system architecture • Computing sub-system • Service sub-system • Communication networks • Storage sub-system • Monitoring and diagnosis sub-system Monitor and diagnosis sub-system • Compute sub-system Service sub-system CPU + GPU CPU + GPU CPU + GPU Operation node CPU + GPU CPU + GPU Operation node … • Communication sub-system • Storage sub-system MDS OSS OSS OSS OSS …

Dawning/Loonson HPP (Hyper Parallel Processing) architecture • Hyper node composed of AMD and Loonson processors • Separation of OS & appl. processors • Multiple interconnect • H/W globalsynchronization

Sunway BlueLight Architecture

Technology innovations • Innovation at different levels • Device • Component • system • New processor architectures • Heter. Many-core, accelerators, re-configurable • Address memory wall • new memory devices • 3D stacking • New cache architectures • High performance interconnect • All optical network • Silicon photonics • High density system design • Low power design

SW1600 processor features • a general-purpose multi-core processor • power efficient, achieve 2.0GFlops/W • Next generation processor is under development

FT-1500 CPU • SparcV9，16 cores，4 SIMD • 40nm, 1.8GHz • Performance： 144GFlops • Typical power: ~65W

Heterogeneous Compute Node (TH-2) Dual Gigabit LAN Comm. Port GE PDP 16X PCIE CPU DMI PCH 16X PCIE MIC QPI 16X PCIE CPLD CPU 16X PCIE GDDR5 Memory IPMB Similar ISA, different ALU 2 Intel Ivy Bridge CPU + 3 Intel Xeon Phi 16 Registered ECC DDR3 DIMMs, 64GB 3 PCI-E 3.0 with 16 lanes PDP Comm. Port Dual Gigabit LAN Peak Perf. : 3.432Tflops

Interconnection network (TH-2) • Fat-tree topology using 13 576-port top level switches • Optical-electronic hybrid transport tech. • Proprietary network protocol

Interconnection network(TH-2) • High radix router ASIC: NRC • Feature size: 90nm • Die size: 17.16mm x 17.16mm • Package: FC-PBGA • 2577 pins • Throughput of single NRC: 2.56Tbps • Network interface ASIC: NIC • Same Feature size and package • Die size: 10.76mm x 10.76mm • 675 pins, PCI-E G2 16X

High density system design (SW/BL) computing node Basic element, one processor +memory node complex High density assembly, 2 computing nodes+network interface Supernode 256 nodes (processors), tightly coupled interconnect cabinet 1024 computing nodes (4 supernodes) system supernode Node complex Multi/many-core processor Computing node

Low power design Low power design at different levels Low power processors Low power interconnect High efficient cooling High efficient power supply Low power management Fine-grain real-time power consumption monitor System status sensing Multi-layer power consumption control Low power programming Default system tools like debugging and tuning? Code power consumption modeling Sampling the code power consumption as code performance Feedback to programming

Power supply (SW/BL) DC UPS Conversion efficiency 77% Highly reliable Power monitoring associated

Efficient Cooling (TH-2) • Close-coupled chilled water cooling • Customized Liquid Cooling Unit • High Cooling Capacity: 80kW • Use city cooling system to supply cooling water to LCUs

Efficient Cooling (SW/BL) Water cooling to the board (node complex) Energy-saving Environment-friendly High room temperature Low noise

HW/SW coordination • Using combination of hardware and software technologies to address the technical issues • Achieving performance while maintaining flexibility • Compilation support • Parallel programming framework • Performance tools • HW/SW coordinated reliability measures • User level checkpointing • Redundancy based reliability measure

Software stack of TH-2

Compiler for many-core • Features • Support C, Fortran and SIMD extension • Libc for computing kernel • Support storage hierarchy • Programming model for many-core acceleration • Collaborative cache date prefetch • Instruction prefetch optimization • Static/dynamic instruction scheduling optimization

Basic math lib for many-core • Basic math lib based on many-core structure • Basic function lib • SIMD extended function lib • Fortran function lib • Technical features • Standard function call interface • Customized optimization • Support accuracy analysis

Parallel OS • Technical features • Unified architecture for heterogeneous many-cores • Low overhead virtualization • High efficient resource management

Parallel application development platform • Covering program development, testing, tuning, parallelization and code translation • Collaborative tuning framework • Tolls for parallelism analysis and parallelization • Integrated translation tools for multiple source codes

Parallel programming framework • Hide the complexity of programming millions of cores • Integrate high efficient implementation of fast parallel algorithms • Provide efficient data structures and solver libraries • Support software engineering concept for code extensibility.

Applications Materials, Climate, nuclear energy… High Performance Computing Applications Infrastructure middleware Program wall： Think parallel Write sequential 100times Supercomputer Peta-scale flops 100P flops

Infrastructure: Four types computing JASMIN：（J Adaptive Structured Meshes applications INfrastructure）并行自适应结构网格支撑软件框架 PHG： Parallel Hierarchical Grid infrastructure 并行自适应有限元计算软件平台 JAUMIN： JAdaptive Unstructured Meshes applications INfrastructure 并行自适应非结构网格支撑软件框架 JCOGIN： J mesh-free COmbinatory Geometry INfrastructure 并行三维无网格组合几何计算支撑软件框架 HPC Structured Mesh Finite Element Unstructured Mesh Combinatory Geometry

Reliability design High-quality components, strict screening test Water cooling to prolong the lifetime of components High density assembly, reduce the length of wires, improve data transfer reliability Multiple error correction codes to deal with instantaneous errors Redundant design for memory, computing node, networks, I/O, power supply, and water cooling

Hardware monitoring (SW/BL) Basis for reliability, availability, maintainability of the system Monitor major components Maintenance Diagnosis Dedicated management network

Perspective on Extreme Scale Computing in China

Perspective on Extreme Scale Computing in China

Presentation Transcript

Extreme-Scale Software Overview

IPR in China Industry Perspective

Perspective from China on Accelerating Fusion Development

Simulation at Extreme Scale

Warehouse-Scale Computing

Dax : Rethinking Visualization Frameworks for Extreme-Scale Computing

Extreme Scale Analytics on Spatio -Temporal Datasets

Some thoughts on Extreme Scale Computing

Architectures for Extreme-Scale Computing

New Rules: Scaling Performance for Extreme Scale Computing

Extreme Computing at Bull

Perspective on Distance Learning – China Network Education in China and SJTU’s Practice

Harald Servat Program Development for Extreme-Scale Computing May 3rd, 2010

Research Computing on Multi-core and Many-core Systems: Toward Extreme-scale Computing

GRADING SCALE – PARENT PERSPECTIVE

VisuNet XT eXTreme computing - PC9700

Overview of Extreme-Scale Software Research in China

Web Scale Computing

Extreme Computing at Bull

HCI Issues in eXtreme Computing

Overview of Extreme-Scale Software Research in China

Extreme Computing A Primer