1 / 73

Overview of Extreme-Scale Software Research in China

Overview of Extreme-Scale Software Research in China. Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop Sep. 27, 2011. Outline. Related R&D efforts in China Algorithms and Computational Methods HPC and e-Infrastructure

rreva
Download Presentation

Overview of Extreme-Scale Software Research in China

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Overview of Extreme-Scale Software Research in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop Sep. 27, 2011

  2. Outline • Related R&D efforts in China • Algorithms and Computational Methods • HPC and e-Infrastructure • Parallel programming frameworks • Programming heterogeneous systems • Advanced compiler technology • Tools • Domain specific programming support

  3. Related R&D efforts in China • NSFC • Basic algorithms and computable modeling for high performance scientific computing • Network based research environment • Many-core parallel programming • 863 program • High productivity computer and Grid service environment • Multicore/many-core programming support • HPC software for earth system modeling • 973 program • Parallel algorithms for large scale scientific computing • Virtual computing environment

  4. Algorithms and Computational Methods

  5. NSFC’s Key Initiative on Algorithm and Modeling • Basic algorithms and computable modeling for high performance scientific computing • 8-year, launched in 2011 • 180 million Yuan funding • Focused on • Novel computational methods and basic parallel algorithms • Computable modeling for selected domains • Implementation and verification of parallel algorithms by simulation

  6. HPC & e-Infrastructure

  7. 863’s key projects on HPC and Grid “High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources Major R&D activities Developing PFlops computers Building up a grid service environment--CNGrid Developing Grid and HPC applications in selected areas

  8. CNGrid GOS Architecture Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps Core, System and App Level Services Axis Handlers for Message Level Security Tomcat(5.0.28) + Axis(1.2 rc2) J2SE(1.4.2_07, 1.5.0_07) OS (Linux/Unix/Windows) PC Server (Grid Server)

  9. Abstractions Grid community: Agora persistent information storage and organization Grid process: Grip runtime control

  10. CNGrid GOS deployment • CNGrid GOS deployed on 11 sites and some application Grids • Support heterogeneous HPCs: Galaxy, Dawning, DeepComp • Support multiple platforms Unix, Linux, Windows • Using public network connection, enable only HTTP port • Flexible client • Web browser • Special client • GSML client

  11. Tsinghua University: 1.33TFlops, 158TB storage, 29 applications, 100+ users. IPV4/V6 access CNIC: 150TFlops, 1.4PB storage,30 applications, 269 users all over the country, IPv4/v6 access IAPCM: 1TFlops, 4.9TB storage, 10 applications, 138 users, IPv4/v6 access Shandong University 10TFlops, 18TB storage, 7 applications, 60+ users, IPv4/v6 access GSCC: 40TFlops, 40TB, 6 applications, 45 users , IPv4/v6 access SSC: 200TFlops, 600TB storage, 15 applications, 286 users, IPv4/v6 access XJTU: 4TFlops, 25TB storage, 14 applications, 120+ users, IPv4/v6 access USTC: 1TFlops, 15TB storage, 18 applications, 60+ users, IPv4/v6 access HUST: 1.7TFlops, 15TB storage, IPv4/v6 access SIAT: 10TFlops, 17.6TB storage, IPv4v6 access HKU: 20TFlops, 80+ users, IPv4/v6 access

  12. CNGrid: resources 11 sites >450TFlops 2900TB storage Three PF-scale sites will be integrated into CNGrid soon

  13. CNGrid:services and users 230services >1400users China commercial Aircraft Corp Bao Steel automobile institutes of CAS universities ……

  14. CNGrid:applications • Supporting >700 projects • 973, 863, NSFC, CAS Innovative, and Engineering projects

  15. Parallel programming frameworks

  16. Jasmin: A parallel programming Framework Applications Codes extract Data Dependency Communications Parallel Computing Models form support Load Balancing Data Structures Promote Models Stencils Algorithms Library Models Stencils Algorithms separate Special Common Computers Also supported by the 973 and 863 projects

  17. Basic ideas • Hide the complexity of programming millons of cores • Integrate the efficient implementations of parallel fast numerical algorithms • Provide efficient data structures and solver libraries • Support software engineering for code extensibility.

  18. Basic Ideas Applications Codes PetaFlops MPP Scale up using Infrastructures TeraFlops Cluster Serial Programming Personal Computer

  19. J parallel Adaptive Structured Mesh INfrastructure JASMIN http:://www.iapcm.ac.cn/jasmin,2010SR050446 2003-now JASMIN Structured Grid Inertial Confinement Fusion Global Climate Modeling CFD Material Simulations …… Particle Simulation Unstructured Grid

  20. JASMIN V. 2.0 JASMIN User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces:Components based Parallel Programming models. ( C++ classes) Numerical Algorithms:geometry, fast solvers, mature numerical methods, time integrators, etc. HPC implementations( thousands of CPUs):data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture:Multilayered, Modularized, Object-oriented; Codes:C++/C/F90/F77+MPI/OpenMP,500,000 lines; Installation: Personal computers, Cluster, MPP.

  21. Numerical simulations on TianHe-1A Simulation duration : several hours to tens of hours.

  22. Programming heterogeneous systems

  23. GPU programming support • Source to source translation • Runtime optimization • Mixed programming model for multi-GPU systems

  24. S2S translation for GPU A source-to-source translator, GPU-S2S, for GPU programming Facilitate the development of parallel programs on GPU by combining automatic mapping and static compilation

  25. S2S translation for GPU (con’d) • Insert directives into the source program • Guide implicit call ofCUDA runtime libraries • Enable the user to control the mapping from the homogeneous CPU platform to GPU’s streaming platform • Optimization based on runtime profiling • Take full advantage of GPU according to the application characteristics by collecting runtime dynamic information.

  26. The GPU-S2S architecture

  27. Program translation by GPU-S2S

  28. Runtime optimization based on profiling First level profiling (function level) Second level profiling (memory access and kernel improvement ) Third level profiling (data partition)

  29. First level profiling Identify computing kernels Instrument the scan source code, get the execution time of every function, and identify computing kernel

  30. Second level profiling Identify the memory access pattern and improve the kernels Instrument the computing kernels extract and analyze the profile information, optimize according to the feature of application, and finally generate the CUDA code with optimized kernel

  31. Third level profiling Optimization by improve data partition Get copy time and computing time by instrumentation Compute the number of streams and data size of each stream Generate the optimized CUDA code with stream

  32. Matrix multiplication Performance comparison before and after profile The CUDA code with three level profiling optimization achieves 31% improvement over the CUDA code with only memory access optimization, and 91% improvement over the CUDA code using only global memory for computing . Execution performance comparison on different platform

  33. The CUDA code after three level profile optimization achieves 38% improvement over the CUDA code with memory access optimization, and 77% improvement over the CUDA code using only global memory for computing . FFT(1048576 points) Performance comparison before and after profile FFT(1048576 points ) execution performance comparison on different platform

  34. The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system. MPI PGAS Using message passing or shared data for communication between parallel tasks or GPUs Programming Multi-GPU systems

  35. Mixed Programming Model NVIDIA GPU —— CUDA Traditional Programming model —— MPI/UPC MPI+CUDA/UPC+CUDA CUDA program execution

  36. MPI+CUDA experiment Platform 2NF5588 server, equipped with 1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1

  37. MPI+CUDA experiment (con’d) • Matrix Multiplication program • Using block matrix multiply for UPC programming. • Data spread on each UPC thread. • The computing kernel carries out the multiplication of two blocks at one time, using CUDA to implement. • The total time of execution:Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel Tcom: UPC thread communication time Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time

  38. MPI+CUDA experiment (con’d) For 4094*4096,the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is 184x of the case with 8 MPI task. For small scale data,such as 256,512 , the execution time of using 2 GPUs is even longer than using 1 GPUs the computing scale is too small , the communication between two tasks overwhelm the reduction of computing time. 2server,8 MPI task most 1 server with 2 GPUs

  39. PKU Manycore Software Research Group • Software tool development for GPU clusters • Unified multicore/manycore/clustering programming • Resilience technology for very-large GPU clusters • Software porting service • Joint project, <3k-line Code, supporting Tianhe • Advanced training program

  40. PKU-TianheTurbulence Simulation Reach a scale 43 times higher than that of the Earth Simulator did 7168 nodes / 14336 CPUs / 7168 GPUs FFT speed: 1.6X of Jaguar Proof of feasibility of GPU speed up for large scale systems PKUFFT(using GPUs) MKL(not using GPUs) Jaguar

  41. Advanced Compiler Technology

  42. Advanced Compiler Technology (ACT) Group at the ICT, CAS ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawning) and multi-core processors (Loongson) Will lead the new multicore/many-core programming support project

  43. PTA: Process-based TAsk parallel programming model • new process-based task construct • With properties of isolation, atomicity and deterministic submission • Annotate a loop into two parts, prologue and task segment • #pragma pta parallel [clauses] • #pragma pta task • #pragma pta propagate (varlist) • Suitable for expressing coarse-grained, irregular parallelism on loops • Implementation and performance • PTA compiler, runtime system and assistant tool (help writing correct programs) • Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores) • Code changes is within 10 lines, much smaller than OpenMP

  44. Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC Enhance optimizations as localization and communication optimization Support SIMD intrinsics CUDA cluster:72% of hand-tuned version’s performance, code reduction to 68% Multi-core cluster: better process mapping and cache reuse than UPC UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies

  45. OpenMP and Runtime Support for Heterogeneous Platforms Heterogeneous platforms consisting of CPUs and GPUs Multiple GPUs, or CPU-GPU cooperation brings extra data transfer hurting the performance gain Programmers need unified data management system OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing among computing devices Runtime support DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C)

  46. Analyzers based on Compiling Techniques for MPI programs Communication slicing and process mapping tool Compiler part PDG Graph Building and slicing generation Iteration Set Transformation for approximation Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation Memory bandwidth measuring tool for MPI programs Detect the burst of bandwidth requirements Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime checking tool (MARMOT)

  47. LoongCC: An Optimizing Compiler for Loongson Multicore Processors • Based on Open64-4.2 and supporting C/C++/Fortran • Open source at http://svn.open64.net/svnroot/open64/trunk/ • Powerful optimizer and analyzer with better performances • SIMD intrinsic support • Memory locality optimization • Data layout optimization • Data prefetching • Load/store grouping for 128-bit memory access instructions • Integrated with Aggressive Auto Parallelization Optimization (AAPO) module • Dynamic privatization • Parallel model with dynamic alias optimization • Array reduction optimization

  48. Tools

  49. Testing and evaluation of HPC systems • A center led by Tsinghua University (Prof. Wenguang Chen) • Developing accurate and efficient testing and evaluation tools • Developing benchmarks for HPC evaluation • Provide services to HPC developers and users

  50. LSP3AS: large-scale parallel program performance analysis system • Designed for performance tuning on peta-scale HPC systems • Method: • Source code is instrumented • Instrumented code is executed, generating profiling&tracing data files • The profiling&tracing data is analyzed and visualization report is generated • Instrumentation: based on TAU from University of Oregon

More Related