Unix Clusters

Guntis Barzdins Girts Folkmanis Leo Truksans Unix Clusters

Mājas darbs #3 • 1. Uzdevums (max 8 punkti) • A. Uz UNIX bāzes izveidot MPI paralēlo programmu kompilācijas un izpildes vidi (skatīt lekcijas slaidus; MacOS X gadījumā ņemiet vērā, ka MPI ir daļēji iebūvēts). • B. Uzrakstīt nelielu MPI programmu, kas vec CPU-ietilpīgus aprēķinus. Piemēram, uzrakstīt programmu, kas atrod ASCII teksta virknīti ar <6 simboliem, kuras MD5 ir 66d9978935150b34b9dc0741bc642be2 vai risina kādu citu vismaz līdzīgas sarežģītības uzdevumu.Programmu rakstīt maināmam paralēlu procesu skaitam. • C. Uzrakstīto MPI programmu palaist ar dažādu paralēlu procesu skaitu uz datora ar vismaz 2 fiziskiem procesoriem jeb klāsterī. Salīdzināt programmas izpildes laiku izmantojot dažādu procesoru skaitu (vismaz 1CPU un 2CPU). • 2. Uzdevums (kopā ar 1. uzdevumu max 9 punkti) • Instalēt (ieskaitot pamat-konfigurāciju) kādu tīkla serveri UNIX vidē (Mail serveris, Webportāls, WebMail, DNS serveris, Proxy serveris, aplikācijas pieeja caur inetd serveri, tml.) Pietiek, ja aplikācija darbojas no localhost. • 3. Uzdevums (jebkurš apakšpunkts kopā ar 1. un 2. uzdevumu max 10 punkti) • A. Pašam MPI klāsteri palaist uz vairākiem fiziskiem datoriem,jeb uzdevumu izpildīt EGEE Grid vai citā publiskā klāsteru vidē, • B. Izmantot grafiskās kartes paralēlo procesoru lielu aprēķinu ātrai veikšanai, • C. MPI realizēt kādu praktiski noderīgu uzdevumu, kas nav vienkārši sadalāms N neatkarīgos apakšuzdevumos un tādēļ prasa intensīvu MPI komunikāciju starp paralēliem procesiem. • D. Kāds cits netriviāls paplašinājums 1. vai 2. uzdevumam • E. Salīdzināt paralēlās izpildes efektivitāti MPI, Pthreads, OpenMP vidēs.

Moore’s Law - Density

Moore's Law and Performance The performance of computers is determined by architecture and clock speed. Clock speed doubles over a 3 year period due to the scaling laws on chip. Processors using identical or similar architectures gain performance directly as a function of Moore's Law. Improvements in internal architecture can yield better gains cf Moore's Law.

Moore’s Law Data

Future of Moore’s Law • Short-Term (1-5 years) • Will operate (due to prototypes in lab) • Fabrication cost will go up rapidly • Medium-Term (5-15 years) • Exponential growth rate will likely slow • Trillion-dollar industry is motivated • Long-Term (>15 years) • May need new technology (chemical or quantum) • We can do better (e.g., human brain) • I would not close the patent office

Different kinds of Clusters High Performance Computing (HPC) Cluster Load Balancing (LB) Cluster High Availability (HA) Cluster

High Performance Computing (HPC) Cluster (Beowulf Grid) Start from 1994 Donald Becker of NASA assembles the world’s first cluster with 16 sets of DX4 PCs and 10 Mb/s ethernet Also called Beowulf cluster Built from commodity off-the-shelf hardware Applications like data mining, simulations, parallel processing, weather modelling, computer graphical rendering, etc.

Software Building clusters is straightforward, but managing its software can be complex. Oscar (Open Source Cluster Application Resources) Scyld – scyld.com from the scientists of NASA - commercial true beowulf in a box Beowulf Operating System Rocks OpenSCE WareWulf Clustermatic Condor -- grid UniCore -- grid gLite -- grid

Examples of Beowulf cluster Scyld Cluster O.S. by Donald Becker http://www.scyld.com ROCKS from NPACI http://www.rocksclusters.org OSCAR from open cluster group http://oscar.sourceforge.net OpenSCE from Thailand http://www.opensce.org

Cluster Sizing Rule of Thumb System software (Linux, MPI, Filesystems, etc) scale from 64 nodes to at most 2048 nodes for most HPC applications Max socket connections Direct access message tag lists & buffers NFS / storage system clients Debugging Etc It is probably hard to rewrite MPI and all Linux system software for O(100,000) node clusters

HPC Clusters

OSCAR http://oscar.openclustergroup.org/ Cluster on a CD – automates cluster install process Wizard driven Can be installed on any Linux system, supporting RPMs Components: Open source and BSD style license

Rocks Award Winning Open Source High Performance Linux Cluster Solution The current release of NPACI Rocks is 3.3.0. Rocks is built on top of RedHat Linux releases Two types of nodes Frontend Two ethernet interfaces Lots of disk space Compute Disk drive for caching the base operating environment (OS and libararies) Rocks uses an SQL database for global variable saving

Rocks physical structure

Rocks frontend installation 3 CD’s – Rocks Base CD, HPC Roll CD and Kernel Roll CD Bootable Base CD User-friendly wizard mode installation Cluster information Local hardware Both ethernet interfaces

Rocks frontend installation

Rocks compute nodes Installation Login frontend as root Run insert-ethers Run a program which captures compute node DHCP requests and puts their information into the Rocks MySQL database

Rocks compute nodes Use install CD to boot the compute nodes Insert-ethers also authenticates compute nodes. When insert-ethers is running, the frontend will accept new nodes into the cluster based on their presence on the local network. Insert-ethers must continue to run until a new node requests its kickstart file, and will ask you to wait until that event occurs. When insert-ethers is off, no unknown node may obtain a kickstart file.

Rocks compute nodes Monitor installation When finished do next node Nodes divided in cabinets Possible to install compute nodes of different architecture than frontend

Rocks computing Mpirun on Rocks clusters is used to launch jobs that are linked with the Ethernet device for MPICH. "mpirun" is a shell script that attempts to hide the differences in starting jobs for various devices from the user. On workstation clusters, you must supply a file that lists the different machines that mpirun can use to run remote jobs MPICH is an implementation of MPI, the Standard for message-passing libraries.

Rocks computing example High-Performance Linpack software package that solves a (random) dense linear system in double precision (64 bits) arithmetic on distributed-memory computers. Launch HPL on two processors: Create a file in your home directory named machines, and put two entries in it, such as: compute-0-0 compute-0-1 Download the the two-processor HPL configuration file and save it as HPL.dat in your home directory. Now launch the job from the frontend: $ /opt/mpich/gnu/bin/mpirun -nolocal -np 2 -machinefile machines /opt/hpl/gnu/bin/xhpl

Rocks cluster-fork Same jobs of standard Unix commands on different nodes By default, cluster-fork uses a simple series of ssh connections to launch the task serially on every compute node in the cluster. My processes on all nodes: $ cluster-fork ps -U$USER Hostnames of nodes: $ cluster-fork ps -U$USER

Rocks cluster-fork again Often you wish to name the nodes your job is started on $ cluster-fork --query="select name from nodes where name like 'compute-1-%'" [cmd] Or use --nodes=compute-0-%d:0-2

Monitoring Rocks Set of web pages to monitor activities and configuration Apache webserver with access from internal network only From outside - viewing webpages involves sending a web browser screen over a secure, encrypted SSH channel – ssh frontendsite, start Mozilla there, see http://localhost mozilla --no-remote Access from public network (not recommended) Modify IPtables

Rocks monitoring pages Should look like this:

More monitoring through web Access through web pages includes PHPMyAdmin for SQL server TOP command for cluster Graphical monitoring of cluster Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids.

Ganglia looks like this:

Default services 411 Secure information service Distribute files to nodes – password changes, login files In cron – run every hour DNS for local communication Postfix mail software

MPI MPI is a software systems that allows you to write message-passing parallel programs that run on a cluster, in Fortran and C. MPI (Message Passing Interface) is a defacto standard for portable message-passing parallel programs standardized by the MPI Forum and available on all massively-parallel supercomputers.

Parallel programming Mind that memory is distributed – each node has its own memory space Decomposition – divide large problems into smaller Use mpi.h for C programs

Message passing Message Passing Program consists of multiple instances of serial program that communicate bylibrary calls. These calls may be roughly divided into four classes: Calls used to initialise, manage and finally terminate communication Calls used to communicate between pairs of processors Calls that communicate operations among group of processors Calls used to create arbitrary data type

Helloworld.c #include <stdio.h> #include <mpi.h> int main(int argc, char *argv[]) { int rank, size; MPI_Init(&argc, &argv); MPI_Comm_rank(MPI_COMM_WORLD, &rank); MPI_Comm_size(MPI_COMM_WORLD, &size); printf("Hello world! I am %d of %d\n", rank, size); MPI_Finalize(); return(0); }

Communication /* * The root node sends out a message to the next node in the ring and * each node then passes the message along to the next node. The root * node times how long it takes for the message to get back to it. */ #include<stdio.h> /* for input/output */ #include<mpi.h> /* for mpi routines */ #define BUFSIZE 64 /* The size of the messege being passed */ main( int argc, char** argv) { double start,finish; int my_rank; /* the rank of this process */ int n_processes; /* the total number of processes */ char buf[BUFSIZE]; /* a buffer for the messege */ int tag=0; /* not important here */ MPI_Status status; /* not important here */ MPI_Init(&argc, &argv); /* Initializing mpi */ MPI_Comm_size(MPI_COMM_WORLD, &n_processes); /* Getting # of processes */ MPI_Comm_rank(MPI_COMM_WORLD, &my_rank); /* Getting my rank */

Communication again /* * If this process is the root process send a messege to the next node * and wait to recieve one from the last node. Time how long it takes * for the messege to get around the ring. If this process is not the * root node, wait to recieve a messege from the previous node and * then send it to the next node. */ start=MPI_Wtime(); printf("Hello world! I am %d of %d\n", my_rank, n_processes); if( my_rank == 0 ) { /* send to the next node */ MPI_Send(buf, BUFSIZE, MPI_CHAR, my_rank+1, tag, MPI_COMM_WORLD); /* receive from the last node */ MPI_Recv(buf, BUFSIZE, MPI_CHAR, n_processes-1, tag, MPI_COMM_WORLD, &status); }

Even more of communication if( my_rank != 0) { /* receive from the previous node */ MPI_Recv(buf, BUFSIZE, MPI_CHAR, my_rank-1, tag, MPI_COMM_WORLD, &status); /* send to the next node */ MPI_Send(buf, BUFSIZE, MPI_CHAR, (my_rank+1)%n_processes, tag, MPI_COMM_WORLD); } finish=MPI_Wtime(); MPI_Finalize(); /* I’m done with mpi stuff */ /* Print out the results. */ if (my_rank == 0) { printf("Total time used was %f seconds\n", finish-start); } return 0; }

Compiling Compile code using mpicc – MPI C compiler. /u1/local/mpich-pgi/bin/mpicc -o helloworld2 helloworld2.c Run using mpirun.

Kā palaist MPI [guntisb@zars mpi]$ ls -l total 392 -rw-rw-r-- 1 guntisb guntisb 122 Apr 28 07:08 Makefile -rw-rw-r-- 1 guntisb guntisb 13 May 17 14:33 mfile -rw-rw-r-- 1 guntisb guntisb 344 May 12 09:28 mpi.jdl -rw-rw-r-- 1 guntisb guntisb 2508 Apr 28 07:08 mpi.sh -rwxrwxr-x 1 guntisb guntisb 331899 May 17 14:48 passtonext -rw-rw-r-- 1 guntisb guntisb 3408 Apr 28 07:08 passtonext.c -rw-rw-r-- 1 guntisb guntisb 2132 May 17 14:48 passtonext.o [guntisb@zars mpi]$ more mfile localhost:4 [guntisb@zars mpi]$ [guntisb@zars mpi]$ make mpicc passtonext.c -o passtonext -lmpich -lm [guntisb@zars mpi]$ mpirun -np 2 -machinefile mfile passtonext guntisb@localhost's password: Nodename=zars.latnet.lv Rank=0 Size=2 INFO: zars.latnet.lv (0 of 2) sent 73 value to 1 of 2 INFO: zars.latnet.lv (0 of 2) received 74 value from 1 of 2 Nodename=zars.latnet.lv Rank=1 Size=2 INFO: zars.latnet.lv (1 of 2) received 73 value from 0 of 2 INFO: zars.latnet.lv (1 of 2) sent 73+1=74 value to 0 of 2 [guntisb@zars mpi]$ Paldies Jānim Tragheimam!!

HPC Cluster and parallel computing applications Message Passing Interface MPICH (http://www-unix.mcs.anl.gov/mpi/mpich/) LAM/MPI (http://lam-mpi.org) Mathematical fftw (fast fourier transform) pblas (parallel basic linear algebra software) atlas (a collections of mathematical library) sprng (scalable parallel random number generator) MPITB -- MPI toolbox for MATLAB Quantum Chemistry software gaussian, qchem Molecular Dynamic solver NAMD, gromacs, gamess Weather modelling MM5 (http://www.mmm.ucar.edu/mm5/mm5-home.html)

The Success of MPI Applications Most recent Gordon Bell prize winners use MPI 26TF Climate simulation on Earth Simulator, 16TF DNS Libraries Growing collection of powerful software components MPI programs with no MPI calls (all in libraries) Tools Performance tracing (Vampir, Jumpshot, etc.) Debugging (Totalview, etc.) Intel MPI: http://www.intel.com/cd/software/products/asmo-na/eng/308295.htm Results Papers: http://www.mcs.anl.gov/mpi/papers Beowulf Ubiquitous parallel computing Grids MPICH-G2 (MPICH over Globus, http://www3.niu.edu/mpi)

POSIX Thread (pthread) • The POSIX thread libraries are a standards based thread API for C/C++. It allows one to spawn a new concurrent process flow. It is most effective on multi-processor or multi-core systems where the process flow can be scheduled to run on another processor thus gaining speed through parallel or distributed processing. Threads require less overhead than "forking" or spawning a new process because the system does not initialize a new system virtual memory space and environment for the process.

Pthreads example #include <pthread.h> #include <stdio.h> void * entry_point(void *arg) { printf("Hello world!\n"); return NULL; } int main(int argc, char **argv) { pthread_t thr; if(pthread_create(&thr, NULL, &entry_point, NULL)) { printf("Could not create thread\n"); return -1; } if(pthread_join(thr, NULL)) { printf("Could not join thread\n"); return -1;} return 0; }

OpenMP Shared-memory parallel programming

OMP fragments omp_set_dynamic(0); omp_set_num_threads(16); #pragma omp parallel shared(x, npoints) private(iam, ipoints) { if (omp_get_num_threads() != 16) abort(); iam = omp_get_thread_num(); ipoints = npoints/16; do_by_16(x, iam, ipoints); }

OMP pilna programma

Multi-CPU Servers

AMD Opteron 800 HPCProcessing Node HPC Strengths • Flat SMP like Memory Model: • All four reside with the same 248 memory map • Expandable to 8P NUMA • Glue-less Coherent multi-processing: • low Latency and high Bandwidth ~1600M T/sec (6.4 GB/s) • 32GB of High B/W external memory bus (>5.3GB/sec.) • Native high B/W memory map I/O (>25Gbits/sec.)

Unix Clusters

Unix Clusters

Presentation Transcript

UNIX

UNIX

Lecture 11: Unix Clusters

UNIX

UNIX

Groups, Clusters and Clusters of Clusters

UNIX?@!

Clusters

Unix

Clusters

Clusters