710 likes | 919 Views
Programming the IBM Power3 SP. Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB. Advanced Computational Research Laboratory. High Performance Computational Problem-Solving and Visualization Environment
E N D
Programming the IBM Power3 SP Eric Aubanel Advanced Computational Research Laboratory Faculty of Computer Science, UNB
Advanced Computational Research Laboratory • High Performance Computational Problem-Solving and Visualization Environment • Computational Experiments in multiple disciplines: CS, Science and Eng. • 16-Processor IBM SP3 • Member of C3.ca Association, Inc. (http://www.c3.ca)
Advanced Computational Research Laboratory www.cs.unb.ca/acrl • Virendra Bhavsar, Director • Eric Aubanel, Research Associate & Scientific Computing Support • Sean Seeley, System Administrator
Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)
POWER chip: 1990 to 2003 1990 • Performance Optimized with Enhanced RISC • Reduced Instruction Set Computer • Superscalar: combined floating point multiply-add (FMA) unit which allowed peak MFLOPS rate = 2 x MHz • Initially: 25 MHz (50 MFLOPS) and 64 KB data cache
POWER chip: 1990 to 2003 1991: SP1 • IBM’s first SP (scalable power parallel) • Rack of standalone POWER processors (62.5 MHz) connected by internal switch network • Parallel Environment & system software
POWER chip: 1990 to 2003 1993: POWER2 • 2 FMAs • Increased data cache size • 66.5 MHz (254 MFLOPS) • Improved instruction set (incl. Hardware square root) • SP2: POWER2 + higher bandwidth switch for larger systems
POWER chip: 1990 to 2003 1993: POWERPC Support SMP 1996: P2SC POWER2 super chip: clock speeds up to 160 MHz
POWER chip: 1990 to 2003 Feb. ‘99: POWER3 • Combined P2SC & POWERPC • 64 bit architecture • Initially 2-way SMP, 200 MHz • Cache improvement, including L2 cache of 1-16 MB • Instruction & data prefetch
Winterhawk II - 375 MHz 4- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 1.6 GB/ s Memory Bandwidth 6 GFLOPS/ Node Nighthawk II - 375 MHz 16- way SMP 2 MULT/ ADD - 1500 MFLOPS 64 KB Level 1 - 5 nsec/ 3.2 GB/ sec 8 MB Level 2 - 45 nsec/ 6.4 GB/ sec 14 GB/ s Memory Bandwidth 24 GFLOPS/ Node POWER3+ chip: Feb. 2000
The Clustered SMP ACRL’s SP: Four 4-way SMPs Each node has its own copy of the O/S Processors on the node are closer than those on different nodes
Power4 - 32 way • Logical UMA • SP High Node • L3 cache shared between all processors on node - 32 MB • Up to 32 GB main memory • Each processor: 1.1 GHz • 140 Gflops total peak
Going to NUMA NUMA up to 256 processors - 1.1 Teraflops
Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)
Uni-processor Optimization • Compiler options: • start with -O3 -qstrict, then -O3, -qarch=pwr3 • Cache re-use • Take advantage of superscalar architecture • give enough operations per load/store • Use ESSL - optimization already maximally exploited
128 byte cache line 2 MB 2 MB 2 MB 2 MB Cache L2 cache: 4-way set-associative, 8 MB total L1 cache: 128-way set-associative, 64 KB
How to Monitor Performance? • IBM’s hardware monitor: HPMCOUNT • Uses hardware counters on chip • Cache & TLB misses, fp ops, load-stores, … • Beta version • Available soon on ACRL’s SP
real*8 a(256,256),b(256,256),c(256,256) common a,b,c do j=1,256 do i=1,256 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 66543 Average number of loads per TLB miss : 5.916 Total loads and stores : 0.525 M Instructions per load/store : 2.749 Cycles per instruction : 2.378 Instructions per cycle : 0.420 Total floating point operations : 0.066 M Hardware float point rate : 2.749 Mflop/sec HMPCOUNT sample output
real*8 a(257,256),b(257,256),c(257,256) common a,b,c do j=1,256 do i=1,257 a(i,j)=b(i,j)+c(i,j) end do end do end PM_TLB_MISS (TLB misses) : 1634 Average number of loads per TLB miss : 241.876 Total loads and stores : 0.527 M Instructions per load/store : 2.749 Cycles per instruction : 1.271 Instructions per cycle : 0.787 Total floating point operations : 0.066 M Hardware float point rate : 3.525 Mflop/sec HMPCOUNT sample output
ESSL • Linear algebra, Fourier & related transforms, sorting, interpolation, quadrature, random numbers • Fast! • 560x560 real*8 matrix multiply • Hand coding: 19 Mflops • dgemm: 1.2 GFlops • Parallel (threaded and distributed) versions
Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)
Disk ACRL’s IBM SP • 4 Winterhawk II nodes • 16 processors • Each node has: • 1 GB RAM • 9 GB (mirrored) disk on each node • Switch adapter • High Perforrnance Switch • Gigabit Ethernet (1 node) • Control workstation • Disk: SSA tower with 6 18.2 GB disks Gigabit Ethernet
IBM Power3 SP Switch • Bidirectional multistage interconnection networks (MIN) • 300 MB/sec bi-directional • 1.2 sec latency
Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Client RVSD/VSD Application GPFS Server RVSD/VSD General Parallel File System Node 2 Node 3 Node 4 SP Switch Node 1
ACRL Software • Operating System: AIX 4.3.3 • Compilers • IBM XL Fortran 7.1 (HPF not yet installed) • VisualAge C for AIX, Version 5.0.1.0 • VisualAge C++ Professional for AIX, Version 5.0.0.0 • IBM Visual Age Java - not yet installed • Job Scheduler: Loadleveler 2.2 • Parallel Programming Tools • IBM Parallel Environment 3.1: MPI, MPI-2 parallel I/O • Numerical Libraries: ESSL (v. 3.2) and Parallel ESSL (v. 2.2 ) • Visualization: OpenDX (not yet installed) • E-Commerce software (not yet installed)
Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)
Why Parallel Computing? • Solve large problems in reasonable time • Many algorithms are inherently parallel • image processing, Monte Carlo • Simulations (eg. CFD) • High performance computers have parallel architectures • Commercial off-the shelf (COTS) components • Beowulf clusters • SMP nodes • Improvements in network technology
NRL Layered Ocean Model at Naval Research Laboratory IBM Winterhawk II SP
Parallel Computational Models • Data Parallelism • Parallel program looks like serial program • parallelism in the data • Vector processors • HPF
Parallel Computational Models • Message Passing (MPI) • Processes have only local memory but can communicate with other processes by sending & receiving messages • Data transfer between processes requires operations to be performed by both processes • Communication network not part of computational model (hypercube, torus, …) Send Receive
Address space Processes Parallel Computational Models • Shared Memory (threads) • P(osix)threads • OpenMP: higher level standard
Parallel Computational Models • Remote Memory Operations • “One-sided” communication • MPI-2, IBM’s LAPI • One process can access the memory of another without the other’s participation, but does so explicitly, not the same way it accesses local memory Get Put
Address space Address space Address space Network Processes Processes Processes Parallel Computational Models • Combined: Message Passing & Threads • Driven by clusters of SMPs • Leads to software complexity!
Programming the IBM Power3 SP • History and future of POWER chip • Uni-processor optimization • Description of ACRL’s IBM SP • Parallel Processing • MPI • OpenMP • Hybrid MPI/OpenMP • MPI-I/O (one slide)
Message Passing Interface • MPI 1.0 standard in 1994 • MPI 1.1 in 1995 - IBM support • MPI 2.0 in 1997 • Includes 1.1 but adds new features • MPI-IO • One-sided communication • Dynamic processes
Advantages of MPI • Universality • Expressivity • Well suited to formulating a parallel algorithm • Ease of debugging • Memory is local • Performance • Explicit association of data with process allows good use of cache
MPI Functionality • Several modes of point-to-point message passing • blocking (e.g. MPI_SEND) • non-blocking (e.g. MPI_ISEND) • synchronous (e.g. MPI_SSEND) • buffered (e.g. MPI_BSEND) • Collective communication and synchronization • e.g. MPI_REDUCE, MPI_BARRIER • User-defined datatypes • Logically distinct communicator spaces • Application-level or virtual topologies
Simple MPI Example My_Id 0 1 This is from MPI process number 0 This is from MPI processes other than 0
Simple MPI Example Program Trivial implicit none include "mpif.h" ! MPI header file integer My_Id, Numb_of_Procs, Ierr call MPI_INIT ( ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, ierr ) call MPI_COMM_SIZE ( MPI_COMM_WORLD, Numb_of_Procs, ierr ) print *, ' My_id, numb_of_procs = ', My_Id, Numb_of_Procs if ( My_Id .eq. 0 ) then print *, ' This is from MPI process number ',My_Id else print *, ' This is from MPI processes other than 0 ', My_Id end if call MPI_FINALIZE ( ierr ) ! bad things happen if you forget ierr stop end
MPI Example with send/recv Send Receive Receive Send My_Id 0 1
MPI Example with send/recv Program Simple implicit none Include "mpif.h" Integer My_Id, Other_Id, Nx, Ierr Parameter ( Nx = 100 ) Real A ( Nx ), B ( Nx ) call MPI_INIT ( Ierr ) call MPI_COMM_RANK ( MPI_COMM_WORLD, My_Id, Ierr ) Other_Id = Mod ( My_Id + 1, 2 ) A = My_Id call MPI_SEND ( A, Nx, MPI_REAL, Other_Id, My_Id, MPI_COMM_WORLD, Ierr ) call MPI_RECV ( B, Nx, MPI_REAL, Other_Id, Other_Id, MPI_COMM_WORLD, Ierr ) call MPI_FINALIZE ( Ierr ) stop end
/* Processor 0 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); /* Processor 1 */ ... MPI_Send(sendbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD); printf("Posting receive now ...\n"); MPI_Recv(recvbuf, bufsize, MPI_CHAR, partner, tag, MPI_COMM_WORLD, status); What Will Happen?
MPI Message Passing Modes Ready Standard Synchronous Buffered Ready Eager Rendezvous Buffered <= eager limit > eager limit Default Eager Limit on SP is 4 KB (can be up to 64 KB)
MPI Performance Visualization • ParaGraph • Developed by University of Illinois • Graphical display system for visualizing behaviour and performance of MPI programs