CINECA: the Italian HPC infrastructure and his evolution in the European scenario

CINECA: the Italian HPC infrastructure and his evolution in the European scenario Giovanni Erbacci, Supercomputing, Applications and Innovation Department, CINECA, Italy g.erbacci@cineca.it

Agenda • CINECA: the Italian HPC Infrastructure • CINECA and the Euroepan HPC Infrastructure • Evolution: Parallel Programming Trends in Extremely Scalable Architectures

CINECA CINECA non profit Consortium, made up of 50 Italian universities*, The National Institute of Oceanography and Experimental Geophysics - OGS, the CNR (National Research Council), and the Ministry of Education, University and Research (MIUR). CINECA is the largest Italian computing centre, one of the most important worldwide. The HPC department - manages the HPC infrastructure, - provide support end HPC resources to Italian and European researchers, - promote technology transfer initiatives for industry.

The Story 1969: CDC 6600 1st system for scientific computing 1975: CDC 7600 1st supercomputer 1985: Cray X-MP / 4 8 1st vector supercomputer 1989: Cray Y-MP / 4 64 1993: Cray C-90 / 2 128 1994: Cray T3D 64 1st parallel supercomputer 1995: Cray T3D 128 1998: Cray T3E 256 1st MPP supercomputer 2002: IBM SP4 512 1 Teraflops 2005: IBM SP5 512 2006: IBM BCX 10 Teraflops 2009: IBM SP6 100 Teraflops 2012: IBM BG/Q > 1 Petaflops

CINECA and Top 500

Trend di sviluppo 10 PF 1 PF 100 TF 10 TF 1 TF 100 GF 10 GF 1 GF >2 PF

HPC InfrastructureforScientificcomputing

Visualisation system • Visualisation and computer graphycs • Virtual Theater • 6 video-projectors BARCO SIM5 • Audio surround system • Cylindric screen 9.4x2.7 m, angle 120° • Ws + Nvidia cards • RVN nodes on PLX system

StorageInfrastructure

SP Power6 @ CINECA - 168 compute nodes IBM p575 Power6 (4.7GHz) - 5376 compute cores (32 core / node) - 128 Gbyte RAM / node (21Tbyte RAM) - IB x 4 DDR (double data rate) Peak performance 101 TFlops Rmax 76.41 Tflop/s Efficiency (workload) 75.83 % N. 116 Top500 (June 11) - 2 login nodes IBM p560 - 21 I/O + service nodes IBM p520 - 1.2 PByte Storage row: 500 Tbyte working area High Performance 700 Tbyte data repository

BGP @ CINECA Model: IBM BlueGene / P Architecture: MPP Processor Type: IBM PowerPC 0,85 GHz Compute Nodes: 1024 (quad core, 4096 total) RAM: 4 GB/compute node (4096 GB total) Internal Network: IBM 3D Torus OS: Linux (login nodes) CNK (compute nodes) Peak Performance: 14.0 TFlop/s

PLX @ CINECA IBM Server dx360M3 – Computenode 2 x processori Intel Westmere 6c X5645 2.40GHz 12MB Cache, DDR3 1333MHz 80W 48GB RAM su 12 DIMM 4GB DDR3 1333MHz 1 x HDD 250GB SATA 1 x QDR Infiniband Card 40Gbs 2 x NVIDIA m2070 (m2070q su 10 nodi) Peak performance 32 TFlops (3288 cores a 2.40GHz) Peak performance 565 TFlops Single Precision o 283 TFlopsDoublePrecision (548 Nvidia M2070) N. 54 Top500 (June 11)

Main Activities Molecular Dynamics Material Science Simulations Cosmology Simulations Genomic Analysis Geophysics Simulations Fluid dynamics Simulations Engineering Applications Application Code development/ parallelization/optimization Help desk and advanced User support Consultancy for scientific software Consultancy and research activities support Scientific visualization support Science @ CINECA Scientific Area Chemistry Physics Life Science Engineering Astronomy Geophysics Climate Cultural Heritage National Institutions INFM-CNR SISSA INAF INSTM OGS INGV ICTP Academic Institutions

The HPC Model at CINECA From agreements with National Institutions to National HPC Agency in an European context -Big Science – complex problems - Support Advanced computational science projects - HPC support for computational sciences at National and European level - CINECA calls for advanced National Computational Projects ISCRAItalian SuperComputing Resource Allocation • http://iscra.cineca.it • Objective: support large-scale, computationally intensive projects not possible or productive without terascale, and in future petascale, computing. • Class A: Large Projects (> 300.000 CPUh x project): two calls per Year • Class B: Standard Projects. two calls per Year • Class C: Test and Development Projects (< 40.000 CPU h x project): continuous submission scheme; proposals reviewed 4 times per year,

ISCRA: Italian SuperComputing Resource Allocation - National scientific committee - Blind National Peer review system- Allocation procedure Sp6, 80TFlops (5376 core) N. 116 Top500, June 2011 BGP, 17 TFlops (4096 core) PLX, 142TFlops (3288 core + 548 nVidia M2070) N. 54 Top500 June 2011 iscra.cineca.it

CINECA and Industry CINECA provides HPC service to Industry: • ENI (geophysics) • BMW-Oracle (American cup, CFD structure) • Arpa (weather forecast, Meteoclimatology) • Dompé (pharmaceutical) CINECA hosts the ENI HPC system: HP ProLiant SL390s G7 Xeon 6C X5650, Infiniband, Cluster Linux HP, 15360 cores N. 60 Top 500 (June 2011) 163.43 Tflop/s Peak, 131.2 Linpack

CINECA Summer schools

PRACE The European HPC-Ecosystem European (PRACE) • PRACE Research Infrastructure (www.prace-ri.eu): the top level of the European HPC ecosystem • CINECA: • - represents Italy in PRACE • - hosting member in PRACE • - Tier-1 system > 5 % PLX + SP6 - Tier-0 system in 2012 BG/Q 2 Pflop/s • involved in PRACE 1IP, 2IP • PRACE 2IP prototype EoI Tier 0 National Tier 1 Local Tier 2 Creation of a European HPC ecosystem involving all stakeholders • HPC service providers on all tiers • Scientific and industrial user communities • The European HPC hw and sw industry

HPC-Europa 2: Providing access to HPC resources HPC-Europa 2 consortium of seven European HPC infrastructures integrated provision of advanced computational services to the European research community Provision of transnational access to some of the most powerful HPC facilities in Europe Opportunities to collaborate with scientists working in related fields at a relevant local research institute • http://www.hpc-europa.eu/ • HP-Europa 2: 2009 – 2012 • (FP7-INFRASTRUCTURES-2008-1)

Verce MMM@HPC VMUST Vph-op Europlanet Plug-it HPC-Europa HPCworld EUDAT EESI DEISA PRACE Deep Montblanc EMI

This Power A2 core has a 64-bit instruction set, like other commercial Power-based processors sold by IBM since 1995 but unlike the prior 32-bit PowerPC chips used in prior BlueGene/L and BlueGene/P supercomputers. The A2 core have four threads and has in-order dispatch, execution, and completion instead of out-of-order execution common in many RISC processor designs. The A2 core has 16KB of L1 data cache and another 16KB of L1 instruction cache. Each core also includes a quad-pumped double-precision floating point unit, which is blocked out thus: BG/Q in CINECA • The Power A2 core has a 64-bitinstruction set (unlike the prior 32-bit PowerPC chips used in BG/L and BG/P • The A2 core have four threads and has in-order dispatch, execution, and completion instead of out-of-order execution common in many RISC processor designs. • The A2 core has 16KB of L1 data cache and another 16KB of L1 instruction cache. • Each core also includes a quad-pumped double-precision floating point unit: Each FPU on each core has four pipelines, which can be used to execute scalar floating point instructions, four-wide SIMD instructions, or two-wide complex arithmetic SIMD instructions. 16 core chip @ 1.6 GHz • a crossbar switch links the cores and L2 cache memory together. • 5D torus interconnect

Moore’s law is holding, in the number of transistors – Transistors on an ASIC still doubling every 18 months at constant cost – 15 years of exponential clock rate growth has ended Moore’s Law reinterpreted – Performance improvements are now coming from the increase in the number of cores on a processor (ASIC) – #cores per chip doubles every 18 months instead of clock – 64-512 threads per node will become visible soon HPC Evolution From Herb Sutter<hsutter@microsoft.com>

Top500 Paradigm Change in HPC …. What about applications? Next HPC systems with more than e in the order of 500.000 cores

Real HPCCrisis is with Software A supercomputer application and software are usually much more long-lived than a hardware - Hardware life typically four-five years at most. - Fortran and C are still the main programming models Programming is stuck - Arguably hasn’t changed so much since the 70’s Software is a major cost component of modern technologies. - The tradition in HPC system procurement is to assume that the software is free. It’s time for a change - Complexity is rising dramatically - Challenges for the applications on Petaflop systems - Improvement of existing codes will become complex and partly impossible - The use of O(100K) cores implies dramatic optimization effort - New paradigm as the support of a hundred threads in one node implies new parallelization strategies - Implementation of new parallel programming methods in existing large applications has not always a promising perspective There is the need for new community codes

Roadmap to Exascale(architectural trends)

What about parallel App? • In a massively parallel context, an upper limit for the scalability of parallel applications is determined by the fraction of the overall execution time spent in non-scalable operations (Amdahl's law). maximum speedup tends to 1 / ( 1 − P ) P= parallel fraction 1000000 core P = 0.999999 serial fraction= 0.000001

Programming Models • Message Passing (MPI) • Shared Memory (OpenMP) • Partitioned Global Address Space Programming (PGAS) Languages • UPC, Coarray Fortran, Titanium • Next Generation Programming Languages and Models • Chapel, X10, Fortress • Languages and Paradigm for Hardware Accelerators • CUDA, OpenCL • Hybrid: MPI + OpenMP + CUDA/OpenCL

trends Scalar Application MPP System, Message Passing: MPI Vector Multi core nodes: OpenMP Distributed memory Accelerator (GPGPU, FPGA): Cuda, OpenCL Shared Memory Hybrid codes

memory memory memory memory memory memory node node node node node node Internal High Performance Network CPU CPU CPU CPU CPU CPU Message Passingdomain decomposition

Processor 1 Processor 1 sub-domain boundaries i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j+1 Ghost Cells exchanged between processors at every update Ghost Cells i-1,j i,j i+1,j i,j-1 i,j+1 i,j+1 i-1,j i,j i+1,j i-1,j i,j i+1,j i,j-1 i,j-1 Processor 2 Processor 2 Ghost Cells - Data exchange

Message Passing: MPI • Main Characteristic • Library • Coarse grain • Inter node parallelization (few real alternative) • Domain partition • Distributed Memory • Almost all HPC parallel App Open Issue • Latency • OS jitter • Scalability

Shared memory node y Thread 0 CPU Thread 1 CPU memory Thread 2 CPU x Thread 3 CPU

Shared Memory: OpenMP • Main Characteristic • Compiler directives • Medium grain • Intra node parallelization (pthreads) • Loop or iteration partition • Shared memory • Many HPC App Open Issue • Thread creation overhead • Memory/core affinity • Interface with MPI

OpenMP • !$omp parallel do • do i = 1 , nsl • call 1DFFT along z ( f [ offset( threadid ) ] ) • end do • !$omp end parallel do • call fw_scatter ( . . . ) • !$omp parallel • do i = 1 , nzl • !$omp parallel do • do j = 1 , Nx • call 1DFFT along y ( f [ offset( threadid ) ] ) • end do • !$omp parallel do • do j = 1, Ny • call 1DFFT along x ( f [ offset( threadid ) ] ) • end do • end do • !$omp end parallel

+ Sum of 1D array Accelerator/GPGPU

CUDA sample void CPUCode( int* input1, int* input2, int* output, int length) { for ( int i = 0; i < length; ++i ) { output[ i ] = input1[ i ] + input2[ i ]; }} __global__void GPUCode( int* input1, int*input2, int* output, int length) { int idx = blockDim.x * blockIdx.x + threadIdx.x; if ( idx < length ) { output[ idx ] = input1[ idx ] + input2[ idx ]; }} Each thread execute one loop iteration

CUDAOpenCL Main Characteristic • Ad-hoc compiler • Fine grain • offload parallelization (GPU) • Single iteration parallelization • Ad-hoc memory • Few HPC App Open Issue • Memory copy • Standard • Tools • Integration with other languages

Hybrid (MPI+OpenMP+CUDA+… • Take the positive off all models • Exploit memory hierarchy • Many HPC applications are adopting this model • Mainly due to developer inertia • Hard to rewrite million of source lines …+python)

Hybrid parallel programming Python: Ensemble simulations MPI: Domain partition OpenMP: External loop partition CUDA: assign inner loops Iteration to GPU threads Quantum ESPRESSO http://www.qe-forge.org/

Storage I/O • The I/O subsystem is not keeping the pace with CPU • Checkpointing will not be possible • Reduce I/O • On the fly analysis and statistics • Disk only for archiving • Scratch on non volatile memory (“close to RAM”)

Conclusion Parallel programming trends in extremely scalable architectures • Exploit millions of ALU • Hybrid Hardware • Hybrid codes • Memory Hierarchy • Flops/Watt (more that Flops/Sec) • I/O subsystem • Non volatile memory • Fault Tolerance!

CINECA: the Italian HPC infrastructure and his evolution in the European scenario

CINECA: the Italian HPC infrastructure and his evolution in the European scenario

Presentation Transcript

European Infrastructure

The European Union in French, Italian, and Spanish

HPC in the Cloud

IGI in the biggest European Grid Infrastructure

EDUCATIONAL MEASUREMENT IN ITALIAN UNIVERSITIES and THE UNIVERSITY RATING EVOLUTION

Rail Infrastructure Charging in the European Union

The Italian European Documentation Centres cdeita.it

THE EUROPEAN UNION Features and Evolution

The European Scenario in Fighting Early School Leaving

Evolution of the Italian Tier1 (INFN-T1)

Bulgarian Involvement in the European Grid Infrastructure

CINECA

DEISA Towards a European HPC Infrastructure (deisa.eu)

The speech technology business and evolution scenario

Italian and European Institutions

The European and Italian roadmap towards the Smart C ities

The Italian Industrial Districts : evolution and features

Logistic Infrastructure Scenario in Brazil

The Evolution of Water Law in the European Union

CINECA HPC Infrastructure: state of the art and road map

EDUCATIONAL MEASUREMENT IN ITALIAN UNIVERSITIES and THE UNIVERSITY RATING EVOLUTION

The Evolution of Water Law in the European Union