TACC/NPACI IBM Regatta-HPC (Power4) Overview

TACC/NPACI IBM Regatta-HPC (Power4) Overview Chona Guiang, Kent Milfeld, Avi Purkayastha and Jay Boisseau August 21, 2002 Texas Advanced Computing Center The University of Texas at Austin

Background: TACC As AnNPACI Resource Partner • The Texas Advanced Computing Center (TACC) has provided HPC resources and services for 16 years to UT-Austin • TACC has been a leading NPACI resource partner since Oct97 and has provided Cray T3E, Cray SV1, IBM SP2, and now IBM Regatta cycles to NPACI users • TACC resource are available via the usual NPACI Allocations procedures • TACC will teach HPC (and SciViz and Grid) training this Fall in Austin (the Live Music Capital of the World)

Outline • Architecture and System Configuration • Regatta Programming Environment • Power4 Code Optimization

Architecture and System Configuration

Architecture & System Configuration Outline • Processor Features (chip, core, cache/memory) • Node Design (Multi-Chip Modules, MCM) • MCM Memory Access (remote/local)

IBM Microprocessor Family Power 32-bit P2SC Power3 64-bit Power4 SOI Copper 1+ GHz 64-bit PowerPC 64-bit RS64 Chip series 60x Chip series PowerPC 32-bit

Power4 Processor Features • 64-bit Architecture • Super Scalar, Dynamic Scheduling • Speculative Superscalar • Out-of-Order execution, In-Order completion • “8 Instruction Fetch” but instructions are grouped for execution • sustains five-issues per clock and 1 branch, up to 215 in flight. • 2 LSU, 2 FXU, 2 FPU, 1 BXU, 1CRLXU • 8 Prefetching Streams

Processor Features (cont.) • 80 General Purpose Registers, 72 Float Registers • Rename registers for pipelining • Aggressive Branch Prediction • 4KB or 16MB Page Sizes • 3-Level Cache • 1024 TLB entry • Hardware Performance Counters

Processor Features: FPU • 2 Floating Point Multiply/Add (FMA) Units 4 Flops/CP6 CP FMA Pipeline • 128-bit Intermediate Results (no rounding, default) • IEEE Arithmetic • 32 Floating Point Registers + 40 rename regs • Hardware Square Root 38 CPs, Divide 32 CPs

Power4 Packaging: 2 Cores/Chip

Processor Features: Cache L1 32KB/data 2-way assoc. (write through) 64KB/instruction direct mapped L2 1.44MB (unified) 8-way assoc. (write-in) L3 32MB 8-way assoc. 128/128/4x128 Byte Lines for L1/L2/L3

Processor Features: Cache/Memory Memory 8GB/MCM 13.86GB/sec 2 W* CP 4 W CP 0.87 W CP Regs. L2 L3 L1 Data L2 32KB 0.87 W CP L2 32MB L1 Instr. 1.4MB 64KB ~4 CP Latencies ~14 CP ~100 CP W PF Word (64 bit) Int Integer (64 bit) CP Clock Period Line size L1/L2/L3 =16/16/4x16 W 2 reads, 1 read & 1 write, 1 write ~250 CP to Memory

Processor Features: Memory Fabric Processor Core 1 Processor Core 2 Ifetch Store Load Ifetch Store Load Trace & Debug 8B 8B SP Controller 32B 32B 32B 32B CIU Switch BIST Engine POR Sequencer 8 8 8B 32B 32 32 Perf Monitor Error Detect & Logging L2 Cache L2 Cache L2 Cache 32 32 32B 32B 32 32 chip-chip Fabric (2:1) chip-chip Fabric (2:1) 16B 16B Fabric Controller 16B 16B 16B 16B MCM-MCM (2:1) MCM-MCM (2:1) 8B 8B 16B 4B L3 Directory L3 Controller Mem Controller L3/Mem Bus (3:1) GX Bus (n:1) GX Controller 16B 4B

Processor Features: Costs of New Features • Increased FPU & pipeline depth (dependencies hurt, uses more registers) • Reduced L1 cache size • Higher latency on higher level caches

Processor Features: Relative Performance Performance Factor

Power4 Multi-Chip Module (MCM) • 4-way SMP on Multi-Chip Module (MCM) • >41.6 GB/sec chip-to-chip interconnect & MCM-MCM • Logically shared L2 and L3’s in MCM • Distributed Switch Design (on chip) features • Low Latency of bus-based system • High Bandwidth of switched based system • Fast I/O Interface: (GX bus) • Dual-plane Switch: Two independent switch fabrics; each node has two adapters, one to each fabric. • Point-to-Point Bandwidth ~350MB/sec; 14 usec latency. • MPI on-node (shared memory) Bandwidth ~1.5GB/sec

Power4 MCM Four POWER4 chips assembled onto a Multi-Chip Module (MCM) (left) to create an 4-way SMP building block for the Regatta HPC configuration. The die of a single chipset is magnified on the right-- 170 million transistors.

Power4 MCM 125 watts / die x 4 HOT!!!

Power4 Node: Multiple MCMs M E M M E M M E M M E M

cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 cpu L1,L2 Power4 Node: Network of Buses MCM IO Bus Memory Bus cpu L1,L2 cpu L1,L2 L3 L3 X 4  Memory Memory L3 cpu L1,L2 cpu L1,L2 L3 Memory 4-Way 8GB memory IO Inter-MCM Memory Paths 16-Way 32GB memory

IOCC IOCC IOCC IOCC C C C C C L3 32MB L3 32MB L3 32MB L3 32MB L2 L2 L2 L2 Mem. 2GB Mem. 2GB Mem. 2GB Mem. 2GB L3 Ctl/Dir L3 Ctl/Dir L3 Ctl/Dir L3 Ctl/Dir MCM Memory Access: Local C C C

c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 L2 iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc iocc L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 ctl/dir L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 L3 Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. Mem. MCM Memory Access

TACC/NPACI Regatta-HPC Systems • 4 16-way IBM p690-HPC compute nodes • 64 IBM Power4 1.3 GHz procs • 32 GB Memory/Compute Node = 128GB • 1TB disk (3/4 TB in GPFS) • 1 8-way SMP IBM P690 interactive/service nodes • 8 1.1GHz procs • 5 procs for logins and interactive, 3 procs for GPFS • High-speed dual-plane IBM SP Switch2 (Colony) • Interconnects compute nodes, interactive node & GPFS nodes • IP name: longhorn.tacc.utexas.edu • User guide: www.tacc.utexas.edu/resources/user_guides/regatta/

1/4TB /home /archive TACC Regatta-HPC Systems 36GB 36GB 36GB 36GB /srcatch /scratch /scratch /scratch IBM p690 IBM p690 IBM p690 IBM p690 IBM p690 8GB 8 CPUS 32GB 16 CPUS 32GB 16 CPUS 32GB 16 CPUS 32GB 16 CPUS GigE 125MB/sec 3/4TB /work GPFS (Switch2—SSA) 72 Processors 4 16-way SMPS, 32 GB memory/node 1 8-way SMP, 8 GB memory 1.3 GHz Power4 processors IBM Switch2

Programming Environment

Programming Environment Outline • System Software/Environment • File Systems • Compilers • Compiler Options • POE (Parallel Operating Environment) • LoadLeveler Batch Facility • Debugger

System Software • System • AIX 5.1L (32- or 64-bit kernel) • Compilers • IBM XL Fortran 7.1.1, Visual Age C/C++ 5.0.2 • Batch System • LoadLeveler • Parallel Processing Support • PE (PSSP) • MPI, OpenMP and Pthreads

System Environment • Access: • ssh longhorn.tacc.utexas.edu • Login: • system provides basic shell scripts like .login, .cshrc etc. for your new account, and should not be altered. • users can create their own scripts like .cshrc_user, which supplements user’s environment. Please see README file in your directory. • Modules – sets up programming environment • % moduleavailable – lists the modules that are loaded by default at login. • % module load <application-name> – loads a new application environment. • % module swap <application-name> <application-name>old – allows users to swap different versions of applications. • % module help – for further information.

Home File Systems Node Only All Nodes Scratch Regatta Pwr4 Node All Nodes Work/ GPFS All TACC Machines Archive Application Node 16 CPUs NFS Local

File System Limit & Lifetime Table User Access Limit ~50 MB Variable Unlimited ~100 GB Lifetime Project 4 Days Project Job Duration Environment Variable $HOME $WORK/$GPFS $ARCHIVE $SCRATCH Home Work GPFS Archive Scratch (Use cd, cdw, cdg, cda, cds to change directory to $HOME, $WORK, $GPFS, $ARCHIVE, and $SCRATCH, respectively.)

Programming Models Supported • Serial • Login node only • Shared memory • OpenMP, Pthreads • Within a compute node – up to 16 procs • Distributed memory • MPI • Within and between compute nodes – up to 64p • Mixed mode, e.g. MPI + OpenMP

Compilers Serial Code: Example Type Compiler Suffix xlc prog.c C Visual Age Comp. .c xlC prog.C C++ Visual Age Comp. .C, .i xlf prog.f F77 xlf .f xlf90 prog.f F90 xlf90 .f, .F Parallel (MPI) Code: Example Type Compiler Suffix mpcc prog.c C VAC .c mpCC prog.C C++ VAC .C mpxlf prog.f F77 xlf .f mpxlf90 prog.f F90 xlf90 .f For SMP Codes (OpenMP or Pthreads), use “_r” extension

Target Machine Optimization Options • ARCH • Restricts the compiler to generate a subset of Power or PowerPC instruction set • Specified as –qarch=isa, where isa is one of: • com (default): code can run on any RS/6000 => -qtune=pwr2 • auto: code may take advantage of instructions available only on the compiling machine • ppc: code follows PowerPC architecture => -qtune=604 (32 bit) or –qtune = pwr3 (64 bit) • pwr4: Code can run on any Power4 => -qtune=pwr4 • Lots of others: pwr, pwr2, pwr3, 604

Target Machine Optimization Options (cont.) • TUNE: Bias optimization toward execution on given machine • This option only implies performance not correctness • Specified as –qtune=machine, where machine os one of auto, 604, pwr2, pwr3, pwr4, rs64c, etc. • -qtune=auto generates code that is automatically tuned for compiling machine. • CACHE: Defines a specific cache/memory geometry • Specified as –qcache=level=n:cache_spec, where cache_spec includes: • Type=i|d|c: cache type(instruction/data/combined) • Line=lsz:size=sz:assoc=as: line/cache size and set associativity • -cost=c: cost(in cpu cycles) of a miss • -qcache=level=1:type=d:size=32:line=128:assoc=2:cost=11 • Mainly useful when using –qhot or -qsmp

Program Behavior Options • STRICT: • Specified as –q[no]strict, default is –qstrict with optimization levels 0, 2 but –qnostrict with levels 3, 4, 5. • nostrict allows the compiler to reorder floating point computations and potentially exception instructions • SAVE: • Specified as –q[no]save, default is –qsave. • nosave sets the storage class of local variables to automatic, otherwise static. • -qnosave option should be used in conjunction with -qsmp=omp so variables in parallel section are stored as automatic.

Using -qsmp • -qsmp=noauto should be used with –qsmp=omp, if strict OMP extensions are to be followed. • Test programs using optimization and preferably using –qhot in a single-threaded manner before using –qsmp, since –qsmp=auto implies –qhot, but –qsmp=omp does not. • Always use the “_r” or reentrant compiler invocations when using –qsmp. • Do not set PARTHDS or OMP_NUM_THREADS environment variables unless you wish to use fewer than available processors. • If using node or server in dedicated mode, consider setting SPINS and YIELDS environment variable to 0.

Link Loader Options • bmaxdata: • Specified as –bmaxdata:<bytes>, default 256 MB total including heap, stack and static data in Segment 2. • Sets memory limits up to 2 GB for heap and static data only in 32-bit mode for shared data starting from Segment 3. • Can get more memory using 64-bit mode (-q64), but performance and migration issues (in C/C++) need to be resolved. • bmaxstack: • Specified as –bmaxstack:<bytes>, default is 256 MB total with heap and static data in Segment 3. • If –bmaxdata option is used, then 256 MB is available only for stack, in 32-bit mode and more is not available in this mode. • Using 64-bit mode, more stack space is available with usual concerns.

Link Loader Options (cont.) • Loadmap: • Specified as –bloadmap:[no]file_name, and saves a log of linker actions and messages in file_name. • Listing is useful for debugging purposes. • Debugging Mode: • Specified as –g, generates symbol and line number information for general debugging purposes • Profiling: • Specified as –p, -pg; generate monitoring information for runtime profiles • List • Specified as –qlist, output goes to .list file.

Porting Issues for 32-bit to 64-bit Applications • Note that –q64 implies creation of 64-bit object files but only if kernel allows 64-bit addressing. Currently kernel is 32-bit and only allows 64-bit program calls. • Due to different size for long from 32-bit to 64-bit apps., interchangeable use of int’s and long’s will cause problems. Same with use of int’s and pointer’s. • Some general solutions: • Use of –qwarn64 option: gives truncation warnings. • avoid casting of pointers to ints and vice-versa. • Convert long's to int's if the range of values the variable will assume falls within the range of values for int.

POE: Parallel Interactive Execution • Use to execute parallel jobs (n tasks). • Job submissions can be made interactively or batch queues using poe. • Invokes Loadleveler (Batch scheduling software) in either case. • Options: Command Line Arguments or Environment Variables. (Adapter Specification, MPI parameters, Number of Nodes …) Execution: • poe_command executable options • mp-compiled_executable options

POE: Parallel Interactive Execution (cont.) • Example 1 • mpxlf90 prog.f • poe a.out –shared_memory=true –nodes 1 –tasks_per_node 2 -rmpool 1 • Example 2 • mpcc prog.c • setenv MP_INFO_LEVEL 2 • setenv MP_SHARED_MEMORY true • a.out –nodes 1 –tasks_per_node 2 –rmpool 1

LoadLeveler Batch Facility • Used to execute a batch parallel job • POE options: Use Environment Variables for LoadLeveler Scheduling • Adapter Specification • MPI parameters • Number of Nodes • Class (Priority) • Consumable Resources

Parallel Batch Execution MPI example across nodes • #!/bin/csh • … • # @ resources = ConsumableCpus(1) ConsumableMemory(1500mb) • # @ network.MPI=csss,not_shared,us • # @ node = 2 • # @tasks_per_node=16 • # @ class = normal • # @ queue • setenv MP_SHARED_MEMORY true • poe a.out

Parallel Batch Execution (cont.) OMP example on Node • #!/bin/csh • : • #@ resources = ConsumableCpus(16), ConsumableMemory(1gb) • #@ node = 1 • #@ tasks_per_node=16 • #@ class = normal • #@ queue • a.out

Monitoring Jobs Using LoadLeveler • llsubmit job – submitting `job’ script to batch queue, returns a job id/number. • % llsubmit job • llsubmit: The job "longhorn1.tacc.utexas.edu.2453" has been submitted. • llq – monitoring the queue • llcancel 2453 – deleting my job from the queue • llhold longhorn1.tacc.utexas.edu.2453 – places a hold on the job with the identifier machine • llhold –r longhorn1.tacc.utexas.edu.2453 – to release a hold on the above job

Libraries • ESSL (Engineering & Science Subroutine Lib.) v3.3 • BLAS, Linear Algebra Solvers, FFTs, etc. • Sorting, Searching, Quadratures, Interpolation • -lessl or –lesslsmp or –lessl_r • PESSL (Parallel ESSL) v2.3 • Parallel BLAS, Subset of Level 2 & 3 • Subset of ScaLAPACK (Dense, Banded, Sparse) • Eigensystem Solver and Singular Value Analysis • FFTs (2D and 3D) • -lpessl or –lpesslsmp or –lpessl_r

Libraries (cont.) • MASS (Mathematical Acceleration SubSystem) v3.0 • Subset of Fortran Intrinsics (C & Fortran callable.) • Small Accuracy Loss • Faster, Vector Version Requires “vector expression” • 1.2 – 2 X faster for mass, 2 – 5 X vector mass • -L/usr/local/apps/mass -lmass or -lmassvp4

Debugger • dbx – command line symbolic debugger • To use dbx, compile/link program with –g flag % xlf -g prog.f -o prog • run the program within the debugger to investigate % % (dbx) prog • Some basic commands for investigation of program %(dbx) list line1 %(dbx) step n (to step through n lines of program) %(dbx) next n (to skip through n lines of program) %(dbx) stop ([Variable] [at Line | in Function] [if Condition]) %(dbx) stop if y == x (break point condition) %(dbx) stop in foo (break point inside function) %(dbx) cont (continue execution until next stopping point or finish) %(dbx) status (to get a listing of all break points) %(dbx) print x,y (printing values of variables at current point of execution)

Power4 Code Optimization

Power4 Code Optimization Outline • General Optimization Advice • Compiler Optimization • Performance Libraries • Tuning for the Power4 System • Memory subsystem • Floating point • I/O

TACC/NPACI IBM Regatta-HPC (Power4) Overview

TACC/NPACI IBM Regatta-HPC (Power4) Overview

Presentation Transcript

Networking Overview

ESD Control Overview

Early Aspects Overview of Some Approaches

eVA – AIS Integration Overview Regional Meeting

Company and Product Line Overview

Rabbit Product Overview

RASCAL 4.3

Antaira Products Overview

WELCOME F-200 CBA Reconciliation – An Overview

Overview CollegeScope

OVERVIEW OF ANATOMY AND PHYSIOLOGY

MHS Data Overview M2 Foundations Course

Cisco Nexus 1000V Technical Overview

Visualization with ParaView

ebXML Overview

U.S. CUSTOMS C-TPAT OVERVIEW

Computer Animation Where we are (overview) Where we are going (perhaps)

Overview of Microstrip Antennas

WORKERS’ COMPENSATION OVERVIEW OF NEW LAW