1 / 72

Perspective on Extreme Scale Computing in China

Perspective on Extreme Scale Computing in China. Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29, 2013. Outline. Related R&D programs in China HPC system development Application service environment Applications.

bobby
Download Presentation

Perspective on Extreme Scale Computing in China

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Perspective on Extreme Scale Computing in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University Co-design 2013, Guilin, Oct. 29, 2013

  2. Outline • Related R&D programs in China • HPC system development • Application service environment • Applications

  3. Related R&D programs in China

  4. HPC-related R&D Under NSFC • NSFC • Key initiative “Basic algorithms for high performance scientific computing and computable modeling” • 2011-2018 • 180 million RMB • Basic algorithms and high efficient implementation • Computable modeling • Verification by solving domain problems

  5. HPC-related R&D Under 863 program • 3 Key projects in the last 12 years • High performance computer and core software (2002-2005) • High productivity computer and Grid service environment (2006-2010) • High productivity computer and application environment (2011-2016) • 3 Major projects • Multicore/many-core programming support (2012-2015) • High performance parallel algorithms and parallel coupler development for earth systems study (2010-2013) • HPC software support for earth system modeling (2010-2013)

  6. HPC-related R&D Under 973 program • 973 program • High performance scientific computing • Large scale scientific computing • Aggregation and coordination mechanisms in virtual computing environment • Highly efficient and trustworthy virtual computing environment

  7. There is no national long-term R&D program on extreme scale computing • Coordination between different programs needed

  8. Shift of 863 program emphasis • 1987: Intelligent computers, following the 5th generation computer program in Japan • 1990: from intelligent computers to high performance parallel computers • 1999: from individual HPC system to the national HPC environment • 2006: from high performance computers to high productivity computers

  9. History of HPC development under 863 program • 1990: parallel computers identified as priority topic of the 863 program • National Intelligent Computer R&D Center established • 1993: Dawning 1, 640MIPS, SMP • 1995: Dawning 1000, 2.5GFlops, MPP • Downing company established in 1995 • 1996: Dawning 1000A, cluster system • First product-oriented system of Dawning • 1998: Dawning 2000, 100GFlops, cluster

  10. History of HPC development under 863 program • 2000: Dawning 3000, 400GFlops, cluster, • First system commercialized • 2002: Lenovo DeepComp 1800, 1TFlops, cluster • Lenovo entered the HPC market • 2003: Lenovo DeepComp 6800, 5.3TFlops, cluster • 2004: Dawning 4000A, 11.2TFlops

  11. History of HPC development under 863 program • 2008: • Lenovo DeepComp 7000 • 150TFlops, Heterogeneous cluster • Dawning 5000A • 230TFlops, cluster • 2010: • Dawning 6000 • 3PFlops, Heterogeneous system CPU+GPU • TH-1A • 4.7PFlops, Heterogeneous CPU+GPU • 2011: • Sunway-Bluelight • IPFlops+100TFlops • Based on domestic processor • 2013: • TH-2 • Heterogeneous system with CPU+MIC

  12. “High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local government, application organizations, and industry Major outcomes: China National Grid (CNGrid) “High productivity Computer and Grid Service Environment” Period: 2006-2010 (extended to now) 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources 863 key projects on HPC and Grid: 2002-2010

  13. Current 863 key project • “High productivity computer and application environment” • 2011-2015 (2016) • 1.3B YUAN investment secured • Develop leading level high performance computers • Transfer CNGrid into an application service environment • Develop parallel applications in selected areas

  14. Projects launched • The first round of projects launched in 2011 • High productivity computer (1) • 100PF by the end of 2015 • HPC applications (6) • Fusion simulation • Simulation for aircraft design • Drug discovery • Digital media • Structural mechanics for large machinery • Simulation of electro-magnetic environment • Parallel programming framework (1) • Application service environment will be supported in the second round • Emphasis on application service support • Technologies for new mode of operation

  15. HPC system development

  16. Major challenges • Power consumption • Performance obtained by the applications • Programmability • Resilience • Major obstacles • memory walls • Power walls • I/O walls • …

  17. Power consumption • The limiting factor to implementation of extreme scale computers • Impossible to increase performance by expanding system scale only • Cooling of the system is difficult and affects reliability of the system • Energy cost is a heavy burden and prevent acceptance of extreme scale computers by end users

  18. Performance obtained by applications • Systems installed at general purpose computing centers • Serving a large population of users • supporting a wide range of applications • LinPack is not everything • Need to be efficient for both general-purpose and special-purpose computing • Need to support both computing-intensive and data-intensive applications

  19. Programmability • Must handle • Concurrency/locality • Heterogeneity of the system • Legacy programs porting • Lower the skill requirement for application developers

  20. Resilience • Very short MTBF for extreme scale systems • Long-time continuous operation • System must self-heal/recover from hardware faults/failures • System must detect and tolerate errors in software

  21. Constrained design principle • We must set strong constrains to the extreme scale system implementation • Power consumption • 50GF/W or less before 2020 • 5GF/W in 2015 • Systems scale • <100,000 processors • <200 cabinets • Cost • <300 million dollars (or <2 B YUAN) • We can only design and implement extreme scale system with those constrains

  22. How to address the challenges? Architectural support Technology innovation Hardware and software coordination

  23. Architectural support • Using the most appropriate architecture to achieve the goal • Making trade-offs between performance, power consumption, programmability, resilience, and cost • Hybrid architecture (TH-1A & TH-2) • General purpose + high density computing (GPU or MIC) • HPP architecture (Dawning 6000/Loonson) • Enable different processors to co-exist • Support global address space • Multi-level of parallelism • Multi-conformation and Multi-scale adaptive architecture (SW/BL) • Cluster implemented with Intel processor for supporting commercial software • Homogeneous system implemented with domestic multicore processors for computing-intensive applications • Support parallelism at different levels

  24. Classification of current major architectures • Classifying architectures using “homogeneity/heterogeneity” and “CPU only/CPU+Accelerator” • Homo-/Hetero refers to the ISA

  25. Comparison of different architectures

  26. TH-1A architecture • Hybrid system architecture • Computing sub-system • Service sub-system • Communication networks • Storage sub-system • Monitoring and diagnosis sub-system Monitor and diagnosis sub-system • Compute sub-system Service sub-system CPU + GPU CPU + GPU CPU + GPU Operation node CPU + GPU CPU + GPU Operation node … • Communication sub-system • Storage sub-system MDS OSS OSS OSS OSS …

  27. Dawning/Loonson HPP (Hyper Parallel Processing) architecture • Hyper node composed of AMD and Loonson processors • Separation of OS & appl. processors • Multiple interconnect • H/W globalsynchronization

  28. Sunway BlueLight Architecture

  29. Technology innovations • Innovation at different levels • Device • Component • system • New processor architectures • Heter. Many-core, accelerators, re-configurable • Address memory wall • new memory devices • 3D stacking • New cache architectures • High performance interconnect • All optical network • Silicon photonics • High density system design • Low power design

  30. SW1600 processor features • a general-purpose multi-core processor • power efficient, achieve 2.0GFlops/W • Next generation processor is under development

  31. FT-1500 CPU • SparcV9,16 cores,4 SIMD • 40nm, 1.8GHz • Performance: 144GFlops • Typical power: ~65W

  32. Heterogeneous Compute Node (TH-2) Dual Gigabit LAN Comm. Port GE PDP 16X PCIE CPU DMI PCH 16X PCIE MIC QPI 16X PCIE CPLD CPU 16X PCIE GDDR5 Memory IPMB Similar ISA, different ALU 2 Intel Ivy Bridge CPU + 3 Intel Xeon Phi 16 Registered ECC DDR3 DIMMs, 64GB 3 PCI-E 3.0 with 16 lanes PDP Comm. Port Dual Gigabit LAN Peak Perf. : 3.432Tflops

  33. Interconnection network (TH-2) • Fat-tree topology using 13 576-port top level switches • Optical-electronic hybrid transport tech. • Proprietary network protocol

  34. Interconnection network(TH-2) • High radix router ASIC: NRC • Feature size: 90nm • Die size: 17.16mm x 17.16mm • Package: FC-PBGA • 2577 pins • Throughput of single NRC: 2.56Tbps • Network interface ASIC: NIC • Same Feature size and package • Die size: 10.76mm x 10.76mm • 675 pins, PCI-E G2 16X

  35. High density system design (SW/BL) computing node Basic element, one processor +memory node complex High density assembly, 2 computing nodes+network interface Supernode 256 nodes (processors), tightly coupled interconnect cabinet 1024 computing nodes (4 supernodes) system supernode Node complex Multi/many-core processor Computing node

  36. Low power design Low power design at different levels Low power processors Low power interconnect High efficient cooling High efficient power supply Low power management Fine-grain real-time power consumption monitor System status sensing Multi-layer power consumption control Low power programming Default system tools like debugging and tuning? Code power consumption modeling Sampling the code power consumption as code performance Feedback to programming

  37. Power supply (SW/BL) DC UPS Conversion efficiency 77% Highly reliable Power monitoring associated

  38. Efficient Cooling (TH-2) • Close-coupled chilled water cooling • Customized Liquid Cooling Unit • High Cooling Capacity: 80kW • Use city cooling system to supply cooling water to LCUs

  39. Efficient Cooling (SW/BL) Water cooling to the board (node complex) Energy-saving Environment-friendly High room temperature Low noise

  40. HW/SW coordination • Using combination of hardware and software technologies to address the technical issues • Achieving performance while maintaining flexibility • Compilation support • Parallel programming framework • Performance tools • HW/SW coordinated reliability measures • User level checkpointing • Redundancy based reliability measure

  41. Software stack of TH-2

  42. Compiler for many-core • Features • Support C, Fortran and SIMD extension • Libc for computing kernel • Support storage hierarchy • Programming model for many-core acceleration • Collaborative cache date prefetch • Instruction prefetch optimization • Static/dynamic instruction scheduling optimization

  43. Basic math lib for many-core • Basic math lib based on many-core structure • Basic function lib • SIMD extended function lib • Fortran function lib • Technical features • Standard function call interface • Customized optimization • Support accuracy analysis

  44. Parallel OS • Technical features • Unified architecture for heterogeneous many-cores • Low overhead virtualization • High efficient resource management

  45. Parallel application development platform • Covering program development, testing, tuning, parallelization and code translation • Collaborative tuning framework • Tolls for parallelism analysis and parallelization • Integrated translation tools for multiple source codes

  46. Parallel programming framework • Hide the complexity of programming millions of cores • Integrate high efficient implementation of fast parallel algorithms • Provide efficient data structures and solver libraries • Support software engineering concept for code extensibility.

  47. Applications Materials, Climate, nuclear energy… High Performance Computing Applications Infrastructure middleware Program wall: Think parallel Write sequential 100times Supercomputer Peta-scale flops 100P flops

  48. Infrastructure: Four types computing JASMIN:(J Adaptive Structured Meshes applications INfrastructure) 并行自适应结构网格支撑软件框架 PHG: Parallel Hierarchical Grid infrastructure 并行自适应有限元计算软件平台 JAUMIN: JAdaptive Unstructured Meshes applications INfrastructure 并行自适应非结构网格支撑软件框架 JCOGIN: J mesh-free COmbinatory Geometry INfrastructure 并行三维无网格组合几何计算支撑软件框架 HPC Structured Mesh Finite Element Unstructured Mesh Combinatory Geometry

  49. Reliability design High-quality components, strict screening test Water cooling to prolong the lifetime of components High density assembly, reduce the length of wires, improve data transfer reliability Multiple error correction codes to deal with instantaneous errors Redundant design for memory, computing node, networks, I/O, power supply, and water cooling

  50. Hardware monitoring (SW/BL) Basis for reliability, availability, maintainability of the system Monitor major components Maintenance Diagnosis Dedicated management network

More Related