High Performance Computing and Networking Status & Trends

High PerformanceComputing and NetworkingStatus & Trends H. Leroy - 2005

Thanks : • Franck Capello (Grid 5000) • Dominique Lavenier (R-Disk, Remix) • François Bodin And some SuperComputing Conference Tutorials ( http://supercomputing.org/ )

Table of contents • Computer architecture (memory refresh) • HPCN Trends • Grid Computing • Grid 5000 • Parallel Programming Models

Rappels Les enjeux du parallélisme : • Offrir une puissance de calcul importante • Mettre en oeuvre des modèles de simulation numériques de plus en plus complexes, avec par exemple des couplages de modèles. • Disposer de machines extensibles et de faible coût.

Besoins en performance Pour : • Génie nucléaire • Aérodynamique • Météorologie • Imagerie (traitement et synthèse d’images, réalité virtuelle) • Recherche pétrolière • Simulation avant fabrication • Nouveaux problèmes en physique, biologie, ...

Grand Challenge (1992) : 3T

Objectif Teraflop Source: J. Normand CEA 1995

Taxinomie de Michael Flynn(1972) MIMD : UMA, NUMA, COMA, CC NUMA, NORMA

SIMD

MIMD SMP (Shared Memory multi Processors)

MIMD Distributed Memory multi processors

MIMD (CC NUMA)

Memory Hierarchy Process Working Set • Spatial locality : DO I=1,N S=S+A(I) ENDDO • Temporal locality : DO I=1,N A(I)=B(I) ENDDO Caches (data & instruction caches) TLB

Caches L1cache : L2 cache : And now also L3 cache !

Top 500 November 2004

Top 500 June 2005 NB Procs Rmax Rpeak 1LLNL IBM eserver 65536 136800 183500 Blue Gene 2 IBM research center 40960 91290 114688 IBM eserver Blue Gene 3 Nasa, SGI Altix 10160 51870 60960 Voltaire Infiniband 4 NEC Earth Simulator 5120 35860 40960 5 Barcelona Mare Nostrum IBM JS20 Cluster 4800 27910 42144 PPC 970, Myrinet

Top 500 : Architecture Trends Clusters : 06/2001 : 33/500 ( 6.6%) 11/2001 : 43/500 ( 8.6%) 06/2002 : 80/500 (16%) 11/2002 : 93/500 (18.6%) 06/2003 : 149/500 (29.8%) 11/2003 : 208/500 (41.6%) 06/2004 : 291/500 (58.2%) 11/2004 : 294/500 (58.8%) 06/2005 : 304/500 (60.8%)

Top 500 : Performance Trend 11/2004 : 70,72 TF 06/2005 : 136,80 TF

Top 500 : Performance Trend

Top 500 : Applications

Applications AstrophysicsNbody simulation extreme.indiana.edu/gc3/ Tokamak fluid dynamics http://www.acl.lanl.gov/Grand-Chal/Tok/gallery.html

Applications www.llnl.gov/CASC/iscr/biocomp/ Winter temperaturesand CO2 dispersion www.llnl.gov/CASC/climate

Applications Virtual prototyping www.irisa.fr/ProHPC

Top 500 : Interconnect Trend

Remember Beowulf Clusters ? • “Do-It-Yourself Supercomputers” (Science 1996) • Built around : - Pile Of PCs (POP) - Dedicated High Speed LAN - Free Unix : Linux - Free and COTS Parallel Programming and Performance Tools • COTS Hardware permits rapid development and technology tracking

Avalon Hive Source: [NASA01] Source: [LANL01] MAPS Source: [GMU01] http://newton.gsfc.nasa.gov/thehive/ http://cnls.lanl.gov/avalon/ Inria Rocq. Crystal http://maps.scs.gmu.edu/

Source: [Sci97] Reconfigurable ComputersThe microchip that rewires itself Scientific American – June 1997 : - Computers that modify their hardware circuits as they operate are opening a new era in computer design. Because they can filter data rapidly, they excel at pattern recognition, image processing and encryption - Reconfigurable computers architecture is based on FPGAs (Field Programmable Gate Arrays)

Microprocessor and FPGA Performance Increases Conservative estimates for FPGAs Performance = # of Gates x Clock Rate Source: [SRC02]

Reconfigurable ComputingClusters • Beowulf style clusters • COTS reconfigurable boards as accelerators at each node • Some parallel programming and execution model tool

Delivered and Benchmarked • 48 nodes • 2U, back-to-back (net 1U/node) • 96 FPGA’s • Annapolis Micro • Xilinx Virtex II • 34 Tera-Ops • In use today • All Commodity Parts

… WILDSTAR/ WILDFIRE Example 2: Extended JMS http://www.gwu.edu/~hpc/lsf/ http://ece.gmu.edu/lucite/

Massively Parallel Reconfigurable Systems • Massively parallel systems with large numbers of reconfigurable processors and microprocessors • Everything can be configured, things to configure include : • Processing • Network • Everything can be reconfigured over and over at run time (Run-Time Reconfiguration) to suite underlying applications • Can be easily programmed by application scientists, at least in the same way of programming conventional parallel computers

P P FPGA FPGA P memory P memory FPGA memory FPGA memory Vision for Reconfigurable Supercomputers . . . Shared Memory and or NIC

Microprocessor system Reconfigurable system . . . P P . . . FPGA FPGA P memory P memory FPGA memory FPGA memory . . . . . . Interface Interface I/O I/O Current Reconfigurable Architecture

Direct Connect Fat Tree Multiple Chassis Connected to RapidArray Fabric Source: [Cray, MAPLD04] Cray XD1 System

Chassis Front Chassis Rear Source: [Cray, MAPLD04] XD1 Chassis (OctigaBay 12K) Four 133 MH PCI-X Slots Six Two-way Opteron Blades Six FPGA Modules Six SATA Hard Drives 0.5 Tb/s Switch 12 x 2 GB/s Ports to Fabric

Application Accelerator P RapidArray Switch RAP P RAP Six of These Configurations Per Chassis Source: [Cray, MAPLD04] Application Acceleration FPGA • High BW connection to fabric and Opteron • Fine-grained integration of FPGA logic and software • Well-suited for: Searching, sorting, signal processing, audio/video/image manipulation, encryption, error correction, coding/decoding, packet processing, random number generation.

RAP Source: [Cray, MAPLD04] Application Acceleration Co-Processor AMD Opteron HyperTransport 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s 3.2 GB/s RapidArray QDR SRAM Application Acceleration FPGA Xilinx Virtex II Pro 2 GB/s 2 GB/s Cray RapidArray Interconnect

UserLogic RapidArray Transport Core QDR RAM Interface Core ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) RAP ADDR(20:0) D(35:0) Q(35:0) ADDR(20:0) D(35:0) Q(35:0) • XC2VP30 running at 200 MHz. • 4 QDR II RAM with over 400 HSTL-I I/O at 200 MHz DDR (400 MTransfers/s). • 16 bit RapidArray I/F at 400 MHz DDR (800 MTransfers/s.) • QDR and RapidArray I/F take up <20 % of XC2VP30. The rest is available for user applications. Source: [Cray, MAPLD04] Application Acceleration Interface TX QDR SRAM RX RapidArray

Source: [Cray, MAPLD04] FPGA Linux API Administration Commands fpga_open – allocate and open fpga fpga_close – close allocated fpga fpga_load – load binary into fpga Control Commands fpga_start – start fpga (release from reset) fpga_stop – stop fpga Status Commands fpga_status – get status of fpga Data Commands fpga_put – put data to fpga fpga_get – get data from fpga Interrupt/Blocking Commands fpga_intwait – blocks process, waits for fpga interrupt Programmer sees get / put and message passing programming model

MPI bandwidth comparison MPI latency comparison Source: [Cray, MAPLD04] Cray XD1 Interconnect performance

SGI Altix 3000 http://www.sgi.com/servers/altix

Message Passing/Commodity Bus Distributed Shared Memory Source: [SGI, MAPLD04] SGI Altix 3000 FGPA products in development for the SGI Altix 3000 family and others - Up to 256 Itanium2 with 64 bit Linux – DSM - SGI NUMAlink™ GSM interconnect fabric (up to 256 devices) - Programming model to be determined

App-Specific Graphics - GPU Signals - DSP Prog’ble - FPGA Other ASICs Scalar Intel Itanium SGI MIPS IBM Power Sun SPARC HP PA AMD Opteron Vector Cray X1 NEC SX Source: [SGI, MAPLD04] The 3 Single-Paradigm Architectures

Scalar Scalar Vector IO IO Scalable Shared Memory . Globally addressable . Thousands of ports . Flat & high bandwidth . Flexible & configurable Vector FPGA DSP Graphics Reconfigurable Source: [SGI, MAPLD04] Multi-Paradigm Computing UltraViolet Terascale to Petascale Data Set :Bring Function to Data

Source: [SGI, MAPLD04] Performance - Direct Connection to NUMAlink4 6.4GB/s/connection - Fast System Level Reprogramming of FPGA FPGA load at memory speeds - Atomic Memory Operations Same set as System CPUs - Hardware Barriers Dynamic Load Balancing - Configurations to 8191 NUMA/FPGA Nodes Scalability

And what about I/O ?

Number of base pairs of sequence in GenBank release 142 for selected organisms Growth of GenBank in billions of base pairs from release 3 in April of 1994 to the current release, 142. *2 every 12 monthsMoore Law : *2 /18 m

Cluster Interconnect Latency Mesured Bandwith Ethernet Gb/s 28 ... 70 μs 100 MB/s Myrinet 4.7 μs 500 MB/s Quadrics 1.1 μs 950 MB/s Infiniband 3.7 μs 1500 MB/s (11/2004) Infiniband : goal of 500 nano sec. latency allready has RDMA, atomicity Parallel File Systems : GFS, GPFS, Lustre, PVFS, QFS, …

High Performance Computing and Networking Status & Trends