Programming the IBM Power3 SP

Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB

Advanced Computational Research Laboratory • High Performance Computational Problem-Solving and Visualization Environment • Computational Experiments in multiple disciplines: CS, Science and Eng. • 16-Processor IBM SP3 • Member of C3.ca Association, Inc. (http://www.c3.ca)

Advanced Computational Research Laboratory www.cs.unb.ca/acrl • Virendra Bhavsar, Director • Eric Aubanel, Research Associate & Scientific Computing Support • Sean Seeley, System Administrator

Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)

POWER chip: 1990 to 2003 1990 • Performance Optimized with Enhanced RISC • Reduced Instruction Set Computer • Superscalar: combined floating point multiply-add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz • Initially: 25 MHz (50 MFLOPS) and 64 KB data cache

POWER chip: 1990 to 2003 1991: SP1 • IBM’s first SP (scalable power parallel) • Rack of standalone POWER processors (62.5 MHz) connected by internal switch network • Parallel Environment & system software

POWER chip: 1990 to 2003 1993: POWER2 • 2 FMAs • Increased data cache size • 66.5 MHz (254 MFLOPS) • Improved instruction set (incl. Hardware square root) • SP2: POWER2 + higher bandwidth switch for larger systems

POWER chip: 1990 to 2003 1993: POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz

POWER chip: 1990 to 2003 Feb. ‘99: POWER3 • Combined P2SC & POWERPC • 64 bit architecture • Initially 2-way SMP, 200 MHz • Cache improvement, including L2 cache of 1-16 MB • Instruction & data prefetch

Winterhawk II - 375 MHz 4- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II - 375 MHz 16- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node POWER3+ chip: Feb. 2000

The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes

Power3 Architecture

Power4 - 32 way • Logical UMA • SP High Node • L3 cache shared between all processors on node - 32 MB • Up to 32 GB main memory • Each processor: 1.1 GHz • 140 Gflops total peak

Going to NUMA NUMA up to 256 processors - 1.1 Teraflops

Uni-processor Optimization • Compiler options: • start with -O3 -qstrict, then -O3, -qarch=pwr3 • Cache re-use • Take advantage of superscalar architecture • give enough operations per load/store • Use ESSL - optimization already maximally exploited

Memory Access Times

128 byte cache line 2 MB 2 MB 2 MB 2 MB Cache L2 cache: 4-way set-associative, 8 MB total L1 cache: 128-way set-associative, 64 KB

How to Monitor Performance? • IBM’s hardware monitor: HPMCOUNT • Uses hardware counters on chip • Cache & TLB misses, fp ops, load-stores, … • Beta version • Available soon on ACRL’s SP

real*8 a(256,256),b(256,256),c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 66543 Average number of loads per TLB miss : 5.916 Total loads and stores : 0.525 M Instructions per load/store : 2.749 Cycles per instruction : 2.378 Instructions per cycle : 0.420 Total floating point operations : 0.066 M Hardware float point rate : 2.749 Mflop/sec HMPCOUNT sample output

real*8 a(257,256),b(257,256),c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : 241.876 Total loads and stores : 0.527 M Instructions per load/store : 2.749 Cycles per instruction : 1.271 Instructions per cycle : 0.787 Total floating point operations : 0.066 M Hardware float point rate : 3.525 Mflop/sec HMPCOUNT sample output

ESSL • Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers • Fast! • 560x560 real*8 matrix multiply • Hand coding: 19 Mflops • dgemm: 1.2 GFlops • Parallel (threaded and distributed) versions

Disk ACRL’s IBM SP • 4 Winterhawk II nodes • 16 processors • Each node has: • 1 GB RAM • 9 GB (mirrored) disk on each node • Switch adapter • High Perforrnance Switch • Gigabit Ethernet (1 node) • Control workstation • Disk: SSA tower with 6 18.2 GB disks Gigabit Ethernet

IBM Power3 SP Switch • Bidirectional multistage interconnection networks (MIN) • 300 MB/sec bi-directional • 1.2 sec latency

Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD General Parallel File System Node 2 Node 3 Node 4 SP Switch Node 1

ACRL Software • Operating System: AIX 4.3.3 • Compilers • IBM XL Fortran 7.1 (HPF not yet installed) • VisualAge C for AIX, Version 5.0.1.0 • VisualAge C++ Professional for AIX, Version 5.0.0.0 • IBM Visual Age Java - not yet installed • Job Scheduler: Loadleveler 2.2 • Parallel Programming Tools • IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O • Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) • Visualization: OpenDX (not yet installed) • E-Commerce software (not yet installed)

Why Parallel Computing? • Solve large problems in reasonable time • Many algorithms are inherently parallel • image processing, Monte Carlo • Simulations (eg. CFD) • High performance computers have parallel architectures • Commercial off-the shelf (COTS) components • Beowulf clusters • SMP nodes • Improvements in network technology

NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP

Parallel Computational Models • Data Parallelism • Parallel program looks like serial program • parallelism in the data • Vector processors • HPF

Parallel Computational Models • Message Passing (MPI) • Processes have only local memory but can communicate with other processes by sending & receiving messages • Data transfer between processes requires operations to be performed by both processes • Communication network not part of computational model (hypercube, torus, …) Send Receive

Address space Processes Parallel Computational Models • Shared Memory (threads) • P(osix)threads • OpenMP: higher level standard

Parallel Computational Models • Remote Memory Operations • “One-sided” communication • MPI-2, IBM’s LAPI • One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Get Put

Address space Address space Address space Network Processes Processes Processes Parallel Computational Models • Combined: Message Passing & Threads • Driven by clusters of SMPs • Leads to software complexity!

Message Passing Interface • MPI 1.0 standard in 1994 • MPI 1.1 in 1995 - IBM support • MPI 2.0 in 1997 • Includes 1.1 but adds new features • MPI-IO • One-sided communication • Dynamic processes

Advantages of MPI • Universality • Expressivity • Well suited to formulating a parallel algorithm • Ease of debugging • Memory is local • Performance • Explicit association of data with process allows good use of cache

MPI Functionality • Several modes of point-to-point message passing • blocking (e.g. MPI_SEND) • non-blocking (e.g. MPI_ISEND) • synchronous (e.g. MPI_SSEND) • buffered (e.g. MPI_BSEND) • Collective communication and synchronization • e.g. MPI_REDUCE, MPI_BARRIER • User-defined datatypes • Logically distinct communicator spaces • Application-level or virtual topologies

Simple MPI Example My_Id 0 1 This is from MPI process number 0 This is from MPI processes other than 0

Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id .eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end

MPI Example with send/recv Send Receive Receive Send My_Id 0 1

MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end

/* Processor 0 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); What Will Happen?

MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered <= eager limit > eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)

MPI Performance Visualization • ParaGraph • Developed by University of Illinois • Graphical display system for visualizing behaviour and performance of MPI programs

Programming the IBM Power3 SP

Programming the IBM Power3 SP

Presentation Transcript

IBM

IBM

Comparison of Communication and I/O of the Cray T3E and IBM SP

IBM SP Switch: Theory and Practice

Profiling Tools on the NERSC Crays and IBM/SP

Porting from the Cray T3E to the IBM SP

Dynamic Programming (DP), Shortest Paths (SP)

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

Programming on IBM Cell Triblade

Performance Programming with IBM pSeries Compilers and Libraries

TotalView on the T3E and IBM SP Systems

IBM RS/6000 SP POWER3 SMP

Introduction to Scientific Computing on the IBM SP and Regatta

IBM

IBM Certified Data Science in R Programming Course

IBM

Performance Analysis using PAPI and Hardware Performance Counters on the IBM Power3

IBM SP Switch: Theory and Practice

Grantia sp. or Scypha sp.