Efforts on Programming Environment and Tools in China's High-tech R&D Program

Efforts on Programming Environment and Tools in China's High-tech R&D Program Depei Qian Sino-German Joint Software Institute (JSI), Beihang University Email: depeiq@buaa.edu.cn August 1, 2011, CScADS tools workshop

China’s High-tech Program The National High-tech R&D Program (863 Program) proposed by 4 senior Chinese Scientists and approved by former leader Mr. Deng Xiaoping in March 1986 One of the most important national science and technology R&D programs in China Now a regular national R&D program planed in 5-year terms, the one just finished is the 11th five-year plan

863 key projects on HPC and Grid “High performance computer and core software” 4-year project, May 2002 to Dec. 2005 100 million Yuan funding from the MOST More than 2Χ associated funding from local government, application organizations, and industry Outcomes: China National Grid (CNGrid) “High productivity Computer and Grid Service Environment” Period: 2006-2010 940 million Yuan from the MOST and more than 1B Yuan matching money from other sources

HPC development (2006-2010) First phase: developing two 100TFlops machines Dawning 5000A for SSC Lenovo DeepComp 7000 for SC of CAS Second phase: three 1000Tflops machines Tianhe IA: CPU+GPU, NUDT/Tianjin Supercomputing Center Dawning 6000: CPU+GPU, ICT/Dawning/South China Supercomputing Center (Shenzhen) Sunway: CPU-only, Jiangnan/Shandong Supercomputing Center

CNGrid development 11 sites CNIC, CAS (Beijing, major site) Shanghai Supercomputer Center (Shanghai, major site ) Tsinghua University (Beijing) Institute of Applied Physics and Computational Mathematics (Beijing) University of Science and Technology of China (Hefei, Anhui) Xi’an Jiaotong University (Xi’an, Shaanxi) Shenzhen Institute of Advanced Technology (Shenzhen, Guangdong) Hong Kong University (Hong Kong) Shandong University (Jinan, Shandong) Huazhong University of Science and Technology (Wuhan, Hubei) Gansu Provincial Computing Center The CNGrid Operation Center (based on CNIC, CAS)

CNGrid GOS Architecture Grid Portal, Gsh+CLI, GSML Workshop and Grid Apps Core, System and App Level Services Axis Handlers for Message Level Security Tomcat(5.0.28) + Axis(1.2 rc2) J2SE(1.4.2_07, 1.5.0_07) OS (Linux/Unix/Windows) PC Server (Grid Server)

Jasmin: A parallel programming Framework Contact: Prof. Zeyao Mo, IAPCM, Beijing zeyao_mo@iapcm.ac.cn

Basic ideas Applications Codes extract Data Dependency Communications Parallel Computing Models form support Load Balancing Data Structures Promote Models Stencils Algorithms Library Models Stencils Algorithms separate Special Common Infrastructure parallel middlewares for scientific computing Computers

Basic ideas • Hides parallel programming using millons of cores and the hierarchy of parallel computers; • Integrates the efficient implementations of parallel fast numerical algorithms； • Provides efficient data structures and solver libraries; • Supports software engineering for code extensibility.

Basic Ideas Applications Codes PetaFlops MPP Scale up using Infrastructures TeraFlops Cluster Serial Programming Personal Computer

J parallel Adaptive Structured Mesh INfrastructure JASMIN http:://www.iapcm.ac.cn/jasmin，2010SR050446 2003-now JASMIN Structured Grid Inertial Confinement Fusion Global Climate Modeling CFD Material Simulations …… Particle Simulation Unstructured Grid

JASMIN V. 2.0 JASMIN User provides: physics, parameters, numerical methods, expert experiences, special algorithms, etc. User Interfaces：Components based Parallel Programming models. ( C++ classes) Numerical Algorithms：geometry, fast solvers, mature numerical methods, time integrators, etc. HPC implementations( thousands of CPUs)：data structures, parallelization, load balancing, adaptivity, visualization, restart, memory, etc. Architecture：Multilayered, Modularized, Object-oriented； Codes:C++/C/F90/F77＋MPI/OpenMP，500,000 lines； Installation: Personal computers, Cluster, MPP.

JASMIN Mesh supported

Simulation Cycle Inertial Confinement Fusion: 2004-now ICF Application Codes 13 codes, 46 researches, concurrently develop Different Combinations • Hides parallel computing and adaptive implementations using tens of thousands of CPU cores； • Provides efficient data structures, algorithms and solvers; • Support software engineering for code extensibility. numerical methods Physical parameters Expert Experience

Numerical simulations on TianHe-1A Simulation duration : several hours to tens of hours.

Scale up a factor of 1000

GPU programming support and performance optimization Contact: Prof. Xiaoshe Dong, Xi’an Jiaotong University Email: xsdong@xjtu.edu.cn

GPU program optimization • Three approaches for GPU program optimization • memory-access level • kernel-speedup level • data-partition level

Approaches for GPU program optimization Memory-access Level Kernel-speedup Level Data-partition Level

Source-to-source translation for GPU Developed a source-to-source translator, GPU-S2S, for GPU Facilitate the development of parallel programs on GPU by combining automatic mapping and static compilation

Source-to-source translation for GPU • Insert directives into the source program • guide implicit calling of CUDA runtime libraries • enable the user to control the mapping of the compute-intensive applications from the homogeneous CPU platform to GPU’s streaming platform • Optimization based on runtime profiling • take full advantage of GPU according to the characteristics of applications by collecting runtime dynamic information.

The GPU-S2S architecture

Program translation by GPU-S2S

Runtime optimization based on profiling First level profiling (function level) Second level profiling (memory access and kernel improvement ) Third level profiling (data partition)

First level profiling Scan source code before translation, find function and insert instrumentation before and after the function, compute execution time of every function, and find computing kernel finally.

Second level profiling GPU-S2S scans code, insert instrumentation in the corresponding place of computing kernel extract profile information, analyze the code, perform some optimization, according to the feature of application to expand the templates, finally generate the CUDA code with optimized kernel Using share memory is an general approach, containing 13 parameters, having different performance with different values.

Third level profiling GPU-S2S scans code, find computing kernel and its copy function, insert instrumentation into the corresponding place of code, get copy time and computing time. according to the time to compute the number of streams and data size of each stream. finally generate the optimized CUDA code with stream.

Verification and experiment Experiment platform： server：4-core Xeon CPU with 12GB memory ，NVIDIA Tesla C1060 Redhat enterprise server version 5.3 operation system CUDA version 2.3 Test example： Matrix multiplication Fast Fourier transform (FFT)

Matrix multiplication Performance comparison before and after profile The CUDA code with three level profiling optimization achieves 31% improvement over the CUDA code with only memory access optimization, and 91% improvement over the CUDA code using only global memory for computing . Execution performance comparison on different platform

The CUDA code after three level profile optimization achieves 38% improvement over the CUDA code with memory access optimization, and 77% improvement over the CUDA code using only global memory for computing . FFT(1048576 points) Performance comparison before and after profile FFT(1048576 points ) execution performance comparison on different platform

Programming Multi-GPU system The traditional programming models, MPI and PGAS, are not directly suitable for the new CPU+GPU platform. The legacy applications cannot exploit the power of GPUs. Programming model for CPU-GPU architecture Combining the traditional programming model and GPU-specific programming model, forming a mixed programming model. Better performance on the CPU-GPU architecture, making more efficient use of the computing power.

The memory of the CPU+GPU system are both distributed and shared. So it is feasible to use MPI and PGAS programming model for this new kind of system. MPI PGAS Using message passing or shared data for communication between parallel tasks or GPUs Programming Multi-GPU system

Mixed Programming Model NVIDIA GPU —— CUDA Traditional Programming model —— MPI/UPC MPI+CUDA/UPC+CUDA CUDA program execution

Mixed Programming Model The primary control of an application is implemented by MPI or UPC programming model. The computing kernels of the application are implemented by CUDA, using GPU to accelerate computing. Optimizing the computing kernel , make better use of GPUs. Using GPU-S2S to generate the computing kernel program, hiding the CPU+GPU heterogeneity to use, improving the portability of application. Compiling process

MPI+CUDA experiment Platform ２NF5588 server, equipped with 1 Xeon CPU (2.27GHz), 12GB MM 2 NVIDIA Tesla C1060 GPU(GT200 architecture，4GB deviceMM) 1Gbt Ethernet RedHatLinux5.3 CUDA Toolkit 2.3 and CUDA SDK OpenMPI 1.3 BerkeleyUPC 2.1

MPI+CUDA experiment (cont’) • Matrix Multiplication program • Using block matrix multiply for UPC programming. • Data spread on each UPC thread. • The computing kernel carries out the multiply of two blocks at one time, using CUDA to implement. • The total time of execution：Tsum=Tcom+Tcuda=Tcom+Tcopy+Tkernel Tcom: UPC thread communication time Tcuda: CUDA program execution time Tcopy: Data transmission time between host and device Tkernel: GPU computing time

MPI+CUDA experiment(cont’) For 4094*4096，the speedup of 1 MPI+CUDA task ( using 1 GPU for computing) is 184x of the case with 8 MPI task. For small scale data，such as 256，512 , the execution time of using 2 GPUs is even longer than using 1 GPUs the computing scale is too small , the communication between two tasks overwhelm the reduction of computing time. ２server，8 MPI task most １ server with 2 GPUs

MPI+CUDA experiment(cont’) Matrix size:8192*8192 Tcuda reduced as the task number increase, but the Tsum of 4 tasks is larger than that of 2. Matrix size:16384*16384 Reason：the latency of Ethernet between 2 servers is much higher than the latency on the Bus inside one server 。 If the computing scale is larger or using faster network (e.g. Infiniband), Multi-node with multi-GPUs will still improve the performance of application.

Programming Support and Compilers Contact: Prof. Xiaobing Feng, ICT, CAS, Beijing fxb@ict.ac.cn

Advanced Compiler Technology (ACT) Group at the ICT, CAS • Institute of Computing Technology (ICT) is founded at 1956, the first and leading institute on computing technology in China • ACT is founded in early 1960’s, and has over 40 years experiences on compilers • Compilers for most of the mainframes developed in China • Compiler and binary translation tools for Loogson proessors • Parallel compilers and tools for the Dawning series (SMP/MPP/cluster)

Advanced Compiler Technology (ACT) Group at the ICT, CAS ACT’s Current research Parallel programming languages and models Optimized compilers and tools for HPC (Dawning) and multi-core processors (Loongson)

Advanced Compiler Technology (ACT) Group at the ICT, CAS • PTA model (Process-based TAsk parallel programming model ) • new process-based task construct • With properties of isolation, atomicity and deterministic submission • Annotate a loop into two parts, prologue and task segment • #pragma pta parallel [clauses] • #pragma pta task • #pragma pta propagate (varlist) • Suitable for expressing coarse-grained, irregular parallelism on loops • Implementation and performance • PTA compiler, runtime system and assistant tool (help writing correct programs) • Speedup: 4.62 to 43.98 (average 27.58 on 48 cores); 3.08 to 7.83 (average 6.72 on 8 cores) • Code changes is within 10 lines, much smaller than OpenMP

Hierarchical UPC Provide multi-level data distribution Implicit and explicit hierarchical loop parallelism Hybrid execution model: SPMD with fork-join Multi-dimensional data distribution and super-pipelining Implementations on CUDA clusters and Dawning 6000 cluster Based on Berkeley UPC Enhance optimizations as localization and communication optimization Support SIMD intrinsics CUDA cluster：72% of hand-tuned version’s performance, code reduction to 68% Multi-core cluster: better process mapping and cache reuse than UPC UPC-H : A Parallel Programming Model for Deep Parallel Hierarchies

OpenMP and Runtime Support for Heterogeneous Platforms Heterogeneous platforms consisting of CPUs and GPUs Multiple GPUs, or CPU-GPU cooperation brings extra data transfer hurting the performance gain Programmers need unified data management system OpenMP extension Specify partitioning ratio to optimize data transfer globally Specify heterogeneous blocking sizes to reduce false sharing among computing devices Runtime support DSM system based on the blocking size specified Intelligent runtime prefetching with the help of compiler analysis Implementation and results On OpenUH compiler Gains 1.6X speedup through prefetching on NPB/SP (class C)

Analyzers based on Compiling Techniques for MPI programs Communication slicing and process mapping tool Compiler part PDG Graph Building and slicing generation Iteration Set Transformation for approximation Optimized mapping tool Weighted graph, Hardware characteristic Graph partitioning and feedback-based evaluation Memory bandwidth measuring tool for MPI programs Detect the burst of bandwidth requirements Enhance the performance of MPI error checking Redundant error checking removal by dynamically turning on/off the global error checking With the help of compiler analysis on communicators Integrated with a model checking tool (ISP) and a runtime checking tool (MARMOT)

LoongCC: An Optimizing Compiler for Loongson Multicore Processors • Based on Open64-4.2 and supporting C/C++/Fortran • Open source at http://svn.open64.net/svnroot/open64/trunk/ • Powerful optimizer and analyzer with better performances • SIMD intrinsic support • Memory locality optimization • Data layout optimization • Data prefetching • Load/store grouping for 128-bit memory access instructions • Integrated with Aggressive Auto Parallelization Optimization (AAPO) module • Dynamic privatization • Parallel model with dynamic alias optimization • Array reduction optimization

DigitalBridge: An Binary Translation System for Loongson Multicore Processors • Fully utilizing hardware characters of Loongson CPUs • Handle return instructions by shadow stack • Handle Eflag operations by flag pattern • Emulate X86 FPU by local FP registers • Combination of static and dynamic translation • Handle indirect-jumping table • Handle misalignment data accesses by dynamic profile and exception handler • Improve data locality by pool allocation • Stack variables promotion

Software Tools for High Performance Computing Contact: Prof. Yi Liu, JSI, Beihang Universityyi.liu@jsi.buaa.edu.cn

LSP3AS: large-scale parallel program performance analysis system • Designed for performance tuning on peta-scale HPC systems • Method is common: • Source code is instrumented by inserting specified function-calls • Instrumented code is executed, while performance data are collected, generating profiling&tracing data files • The profiling&tracing data is analyzed and visualization report is generated • Instrumentation: based on TAU from University of Oregon

LSP3AS: large-scale parallel program performance analysis system • ≈ 10 thousands of nodes in Peta-scale system, massive performance data will be generated, transmitted and stored • Scalable structure for performance data collection • Distributed data collection and transmission: eliminate bottlenecks in network and data processing • Dynamic Compensation algorithm: reduce the influence of performance data volume • Efficient Data Transmission: use Remote Direct Memory Access (RDMA) to achieve high bandwidth and low latency

Efforts on Programming Environment and Tools in China's High-tech R&D Program