Cluster Computing

Cluster Computing Kick-start seminar 16 December, 2009 High Performance Cluster Computing Centre (HPCCC) Faculty of Science Hong Kong Baptist University

Outline Overview of Cluster Hardware and Software Basic Login and Running Program in a job queuing system Introduction to Parallelism Why Parallelism & Cluster Parallelism Message Passing Interface Parallel Program Examples Policy for using sciblade.sci.hkbu.edu.hk Contact http://www.sci.hkbu.edu.hk/hpccc2009/sciblade 2

Overview of Cluster Hardware and Software

Cluster Hardware This 256-node PC cluster (sciblade) consist of: Master node x 2 IO nodes x 3 (storage) Compute nodes x 256 Blade Chassis x 16 Management network Interconnect fabric 1U console & KVM switch Emerson Liebert Nxa 120k VA UPS 4

Sciblade Cluster 256-node clusters supported by fund from RGC 5

Hardware Configuration Master Node Dell PE1950, 2x Xeon E5450 3.0GHz (Quad Core) 16GB RAM, 73GB x 2 SAS drive IO nodes (Storage) Dell PE2950, 2x Xeon E5450 3.0GHz (Quad Core) 16GB RAM, 73GB x 2 SAS drive 3TB storage Dell PE MD3000 Compute nodes x 256 each Dell PE M600 blade server w/ Infiniband network 2x Xeon E5430 2.66GHz (Quad Core) 16GB RAM, 73GB SAS drive 6

Hardware Configuration Blade Chassis x 16 Dell PE M1000e Each hosts 16 blade servers Management Network Dell PowerConnect 6248 (Gigabit Ethernet) x 6 Inerconnect fabric Qlogic SilverStorm 9120 switch Console and KVM switch Dell AS-180 KVM Dell 17FP Rack console Emerson Liebert Nxa 120kVA UPS 7

Software List Operating System ROCKS 5.1 Cluster OS CentOS 5.3 kernel 2.6.18 Job Management System Portable Batch System MAUI scheduler Compilers, Languages Intel Fortran/C/C++ Compiler for Linux V11 GNU 4.1.2/4.4.0 Fortran/C/C++ Compiler 8

Software List Message Passing Interface (MPI) Libraries MVAPICH 1.1 MVAPICH2 1.2 OPEN MPI 1.3.2 Mathematic libraries ATLAS 3.8.3 FFTW 2.1.5/3.2.1 SPRNG 2.0a(C/Fortran) 9

Software List • Molecular Dynamics & Quantum Chemistry • Gromacs 4.0.5 • Gamess 2009R1 • Namd 2.7b1 • Third-party Applications • GROMACS, NAMD • MATLAB 2008b • TAU 2.18.2, VisIt 1.11.2 • VMD 1.8.6, Xmgrace 5.1.22 • etc 10

Software List Queuing system Torque/PBS Maui scheduler Editors vi emacs 11

Hostnames Master node External : sciblade.sci.hkbu.edu.hk Internal : frontend-0 IO nodes (storage) pvfs2-io-0-0, pvfs2-io-0-1, pvfs-io-0-2 Compute nodes compute-0-0.local, …, compute-0-255.local 12

Basic Login and Running Program in a Job Queuing System

Basic login Remote login to the master node Terminal login using secure shell ssh -l username sciblade.sci.hkbu.edu.hk Graphical login PuTTY & vncviewer e.g. [username@sciblade]$ vncserver New ‘sciblade.sci.hkbu.edu.hk:3 (username)' desktop is sciblade.sci.hkbu.edu.hk:3 It means that your session will run on display 3. 14

Graphical login Using PuTTY to setup a secured connection: Host Name=sciblade.sci.hkbu.edu.hk 15

Graphical login (con’t) ssh protocol version 16

Graphical login (con’t) Port 5900 +display numbe (i.e. 3 in this case) 17

Graphical login (con’t) Next, click Open, and login to sciblade Finally, run VNC Viewer on your PC, and enter "localhost:3" {3 is the display number} You should terminate your VNC session after you have finished your work. To terminate your VNC session running on sciblade, run the command [username@tdgrocks] $ vncserver –kill : 3 18

Linux commands Both master and compute nodes are installed with Linux Frequently used Linux command in PC cluster http://www.sci.hkbu.edu.hk/hpccc2009/sciblade/faq_sciblade.php 19

ROCKS specific commands ROCKS provides the following commands for users to run programs in all compute node. e.g. cluster-fork Run program in all compute nodes cluster-fork ps Check user process in each compute node cluster-kill Kill user process at one time tentakel Similar to cluster-fork but run faster 20

Ganglia • Web based management and monitoring • http://sciblade.sci.hkbu.edu.hk/ganglia 21

Why Parallelism

Why Parallelism – Passively Suppose you are using the most efficient algorithm with an optimal implementation, but the program still takes too long or does not even fit onto your machine Parallelization is the last chance. 23

Why Parallelism – Initiative Faster Finish the work earlier Same work in shorter time Do more work More work in the same time Most importantly, you want to predict the result before the event occurs 24

Examples Many of the scientific and engineering problems require enormous computational power. Following are the few fields to mention: Quantum chemistry, statistical mechanics, and relativistic physics Cosmology and astrophysics Computational fluid dynamics and turbulence Material design and superconductivity Biology, pharmacology, genome sequencing, genetic engineering, protein folding, enzyme activity, and cell modeling Medicine, and modeling of human organs and bones Global weather and environmental modeling Machine Vision 25

Parallelism The upper bound for the computing power that can be obtained from a single processor is limited by the fastest processor available at any certain time. The upper bound for the computing power available can be dramatically increased by integrating a set of processors together. Synchronization and exchange of partial results among processors are therefore unavoidable. 26

Multiprocessing Clustering LM LM LM LM CU CU CU CU n 1 2 n-1 2 1 n-1 n DS DS DS DS IS IS IS IS CPU CPU CPU CPU PU PU PU PU 1 2 n-1 n n 1 2 n-1 DS DS DS DS Interconnecting Network Shared Memory Parallel Computer Architecture Distributed Memory – Cluster Shared Memory – Symmetric multiprocessors (SMP) 27

Clustering: Pros and Cons Advantages Memory scalable to number of processors. ∴Increase number of processors, size of memory and bandwidth as well. Each processor can rapidly access its own memory without interference Disadvantages Difficult to map existing data structures to this memory organization User is responsible for sending and receiving data among processors 28

TOP500 Supercomputer Sites (www.top500.org) 29

Cluster Parallelism

Parallel Programming Paradigm • Multithreading • OpenMP • Message Passing • MPI (Message Passing Interface) • PVM (Parallel Virtual Machine) Shared memory only Shared memory, Distributed memory 31

Distributed Memory Programmers view: Several CPUs Several block of memory Several threads of action Parallelization Done by hand Example MPI Serial Process P1 P2 P3 P1 Process 0 Data exchange via interconnection Message Passing P2 Process 1 P3 Process 2 time 32

Message Passing Model P1 P2 P3 Serial Process 0 P1 Process 1 P2 Message Passing Process 2 P3 time Data exchange Process A process is a set of executable instructions (program) which runs on a processor. Message passing systems generally associate only one process per processor, and the terms "processes" and "processors" are used interchangeably Message Passing The method by which data from one processor's memory is copied to the memory of another processor. 33

Message Passing Interface (MPI)

MPI Is a library but not a language, for parallel programming An MPI implementation consists of a subroutine library with all MPI functions include files for the calling application program some startup script (usually called mpirun, but not standardized) Include the lib file mpi.h (or however called) into the source code Libraries available for all major imperative languages (C, C++, Fortran …) 35

General MPI Program Structure MPI include file #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } variable declarations #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } #include <mpi.h> void main (int argc, char *argv[]) { int np, rank, ierr; ierr = MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD,&rank); MPI_Comm_size(MPI_COMM_WORLD,&np); /* Do Some Works */ ierr = MPI_Finalize(); } Initialize MPI environment Do work and make message passing calls Terminate MPI Environment 36

Sample Program: Hello World! In this modified version of the "Hello World" program, each processor prints its rank as well as the total number of processors in the communicator MPI_COMM_WORLD. Notes: Makes use of the pre-defined communicator MPI_COMM_WORLD. Not testing for error status of routines! 37

Sample Program: Hello World! #include <stdio.h> #include “mpi.h” // MPI compiler header file void main(int argc, char **argv) { int nproc,myrank,ierr; ierr=MPI_Init(&argc,&argv); // MPI initialization // Get number of MPI processes MPI_Comm_size(MPI_COMM_WORLD,&nproc); // Get process id for this processor MPI_Comm_rank(MPI_COMM_WORLD,&myrank); printf (“Hello World!! I’m process %d of %d\n”,myrank,nproc); ierr=MPI_Finalize(); // Terminate all MPI processes } 38

Performance When we write a parallel program, it is important to identify the fraction of the program that can be parallelized and to maximize it. The goals are: load balance memory usage balance minimize communication overhead reduce sequential bottlenecks scalability 39

Compiling & Running MPI Programs • Using mvapich 1.1 • Setting path, at the command prompt, type: export PATH=/u1/local/mvapich1/bin:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) number of compute nodes: compute-0-0 compute-0-1 compute-0-2 compute-0-3 • Run the program with a number of processor node: mpirun –np 4 –machinefile machines ./cpi 40

Compiling & Running MPI Programs • Using mvapich 1.2 • Prepare .mpd.conf and .mpd.passwd and saved in your home directory : MPD_SECRETWORD=gde1234-3 (you may set your own secret word) • Setting environment for mvapich 1.2 export MPD_BIN=/u1/local/mvapich2 export PATH=$MPD_BIN:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) one hostname per line like previous section 41

Compiling & Running MPI Programs • Pmdboot with the hostfile mpdboot –n 4 –f machines • Run the program with a number of processor node: mpiexec –np 4 ./cpi • Remember to clean after running jobs by mpdallexit mpdallexit 42

Compiling & Running MPI Programs • Using openmpi:1.2 • Setting environment for openmpi export LD-LIBRARY_PATH=/u1/local/openmpi/ lib:$LD-LIBRARY_PATH export PATH=/u1/local/openmpi/bin:$PATH (uncomment this line in .bashrc) • Compile using mpicc, mpiCC, mpif77 or mpif90, e.g. mpicc –o cpi cpi.c • Prepare hostfile (e.g. machines) one hostname per line like previous section • Run the program with a number of processor node mpirun –np 4 –machinefile machines ./cpi 43

Submit parallel jobs into torque batch queue • Prepare a job script (name it runcpi) • For program compiled with mvapich 1.1 • For program compiled with mvapich 1.2 • For program compiled with openmpi 1.2 • For GROMACS (refer to your handout for detail scripts) • Submit the above script (gromacs.pbs) qsub gromacs.pbs 44

Parallel Program Examples

Example 1: Matrix-vector Multiplication The figure below demonstrates schematically how a matrix-vector multiplication, A=B*C, can be decomposed into four independent computations involving a scalar multiplying a column vector. This approach is different from that which is usually taught in a linear algebra course because this decomposition lends itself better to parallelization. These computations are independent and do not require communication, something that usually reduces performance of parallel code. 46

Example 1: Matrix-vector Multiplication (Column wise) Figure 1: Schematic of parallel decomposition for vector-matrix multiplication, A=B*C. The vector A is depicted in yellow. The matrix B and vector C are depicted in multiple colors representing the portions, columns, and elements assigned to each processor, respectively. 47

Example 1: MV Multiplication + + + + + + + + + + + + A=B*C P0 P1 P2 P3 P0 P1 P2 P3 Reduction (SUM) 48

Example 2: Quick Sort The quick sort is an in-place, divide-and-conquer, massively recursive sort. The efficiency of the algorithm is majorly impacted by which element is chosen as the pivot point. The worst-case efficiency of the quick sort, O(n2), occurs when the list is sorted and the left-most element is chosen. If the data to be sorted isn't random, randomly choosing a pivot point is recommended. As long as the pivot point is chosen randomly, the quick sort has an algorithmic complexity of O(n log n). Pros: Extremely fast. Cons: Very complex algorithm, massively recursive 49

Quick Sort Performance 50

Cluster Computing