平行化程式概念及計中高效能運算環境

平行化程式概念及計中高效能運算環境 台灣大學計算機中心作業管理組程式設計師張傑生

大綱 • 平行化程式概念 • 介紹 • OpenMP • MPI • 計中高效能運算環境

Advantages of Parallel Programming • Need to solve larger problems • more memory intensive • more computation • more data intensive • Parallel programming provides • more CPU resources • more memory resources • solve problems that were not possible with serial program • solve problems more quickly

Parallel Computer Architectures • Two Basic Architectures • Shared Memory Computer • multiple processors • share a global memory space • processors can efficiently exchange/share data • Distributed Memory (ex. Beowulf cluster) • collection of serial computers (nodes) • each nodes uses its own local memory • work together to solve a problem • communicate between nodes via messages • nodes are networked together • Latest Parallel Computers mixed shared/distributed memory architecture • nodes have more than 1 processor • dual/quad core processors

Shared Memory Computer • Bottleneck • memory

Distributed Memory • Bottleneck • network

Parallel vs. Distributed Computing • Data dependency • Data exchange • Clock/execution synchronization • 通常平行程式計算侷限於單一主機或 cluster. • 避免資料交換延遲 • 避免被慢速主機拖累

如何建置平行化程式環境 • Compiler, library • GCC, Intel compiler • Support both MPI and OpenMP • MPICH • 使用既有設備即可 • 學習 • 驗證程式正確性 • 效能非考量

如何平行化程式 • 唯有使用者自己瞭解程式瓶頸 • Algorithm • function, loop, I/O • 先確認 serial code 正確性 • 準備多組 test data，以利未來驗證平行執行結果 • 浮點精確度問題 • 思考演算法修改 • 計算切割、資料切割 • 評估方法 • MPI, OpenMP

OpenMP • OpenMP: An application programming interface (API) for parallel programming on multiprocessors • Compiler directives • Library of support functions • OpenMP works in conjunction with Fortran, C, or C++

Shared-memory Model • Processors interact and synchronize with each other through shared variables.

Fork/Join Parallelism • Initially only master thread is active • Master thread executes sequential code • Fork:Master thread creates or awakens additional threads to execute parallel code • Join: At end of parallel code created threads die or are suspended

Parallel for Loops • C programs often express data-parallel operations as for loops for (i = first; i < size; i += prime) marked[i] = 1; • OpenMP makes it easy to indicate when the iterations of a loop may execute in parallel • Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads

Pragmas • Pragma: a compiler directive in C or C++ • Stands for “pragmatic information” • A way for the programmer to communicate with the compiler • Compiler free to ignore pragmas • Syntax: • #pragma omp <rest of pragma>

Hello World! #include <omp.h> #include <stdio.h> int main (int argc, char *argv[]) { int id, nthreads; #pragma omp parallel private(id) { id = omp_get_thread_num(); printf("Hello World from thread %d\n", id); #pragma omp barrier if ( id == 0 ) { nthreads = omp_get_num_threads(); printf("There are %d threads\n",nthreads); } } return 0; }

Shared and Private Variables

Parallel for Pragma • Format:#pragma omp parallel forfor (i = 0; i < N; i++) a[i] = b[i] + c[i];

Reductions • Reductions are so common that OpenMP provides support for them • May add reduction clause to parallel for pragma • Specify reduction operation and reduction variable • OpenMP takes care of storing partial results in private variables and combining partial results after the loop

Reductions Example #define N 10000 /*size of a*/ void calculate(int); /*function that calculates the elements of a*/ long w; double a[N]; calculate(a); sum = 0.0; /*forks off the threads and starts the work-sharing construct*/ #pragma omp parallel for private(w) reduction(+:sum)schedule(static,1) for(i = 0; i < N; i++) { w = i*i; sum = sum + w*a[i]; } printf("\n %lf",sum);

v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf ("%6.2f\n", epsilon(x,y)); May execute alpha, beta, and delta in parallel Functional Parallelism Example

Example of parallel sections #pragma omp parallel sections { #pragma omp section /* Optional */ v = alpha(); #pragma omp section w = beta(); #pragma omp section y = delta(); } x = gamma(v, w); printf ("%6.2f\n", epsilon(x,y));

Incremental Parallelization • Sequential program a special case of a shared-memory parallel program • Parallel shared-memory programs may only have a single parallel loop • Incremental parallelization: process of converting a sequential program to a parallel program a little bit at a time

Pros and Cons of OpenMP • Pros • Simple: need not deal with message passing as MPI does • Data layout and decomposition is handled automatically by directives. • Incremental parallelism: can work on one portion of the program at one time, no dramatic change to code is needed. • Unified code for both serial and parallel applications: OpenMP constructs are treated as comments when sequential compilers are used. • Original (serial) code statements need not, in general, be modified when parallelized with OpenMP. This reduces the chance of inadvertently introducing bugs. • Cons • Currently only runs efficiently in shared-memory multiprocessor platforms • Requires a compiler that supports OpenMP. • Scalability is limited by memory architecture. • Reliable error handling is missing. • Lack fine-grain mechanisms to control thread-processor mapping. • Synchronization between a subset of threads is not allowed.

MPI • Message Passing Interface • A standard message passing library for parallel computers • MPI was designed for high performance on both massively parallel machines and on workstation clusters. • SPMD programming model • Single Program Multiple Data (SPMD) • A single program running on different sets of data.

Hello World! #include <stdio.h> #include <mpi.h> int main (argc, argv) int argc; char *argv[]; { int rank, size, MPI_Init (&argc, &argv); /* starts MPI */ MPI_Comm_rank (MPI_COMM_WORLD, &rank); /* get current process id */ MPI_Comm_size (MPI_COMM_WORLD, &size); /* get number of processes */ printf( "Hello world from process %d of %d\n", rank, size ); MPI_Finalize(); return 0; }

Output andes:~> mpirun -np 4 hello_world_c Hello world from process 1 of 4 Hello world from process 2 of 4 Hello world from process 3 of 4 Hello world from process 0 of 4 順序可能不一樣

Initialization and Clean-up • MPI_Init • Initialize the MPI execution environment. • The first MPI routine in your program. • The argc and argv parameters are from the standard C command line interface. • MPI_Finalize • Terminate MPI execution environment • The last statement in your program

Configuration • MPI_Comm_size • Tells the number of processes in the system. • The MPI_COMM_WORLD means all the processor in the system. • MPI_Comm_rank • Tells the rank of the calling process.

Communication • Point to point • MPI_Send, MPI_Recv • MPI_Send ((void *)&data, icount, DATA_TYPE, idest, itag, MPI_COMM_WORLD); • data 要送出去的資料起點，可以是純量 (scalar) 或陣列 (array)資料 • icount 要送出去的資料數量，當icount的值大於一時，data必須是陣列 • DATA_TYPE 是要送出去的資料類別 • Idest 是收受資料的CPU id • Itag 要送出去的資料標籤 • Envelope • 送出資料的CPU id • 收受資料的CPU id • 資料標籤 • communicator • Collective • Broadcast • Multicast

Serial Code • 計算 a[i] 並加總 for (i = 0; i < n; i++) { a[i] = b[i] + c[i] * d[i]; suma += a[i]; } printf( "sum of A=%f\n",suma);

MPI code • 計算切割而資料不切割

資料分配 if ( myid==0) { for (idest = 1; idest < nproc; idest++) { istart1=gstart[idest]; icount1=gcount[idest]; itag=10; MPI_Send ((void *)&b[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=20; MPI_Send ((void *)&c[istart1], icount1, MPI_DOUBLE, idest, itag, comm); itag=30; MPI_Send ((void *)&d[istart1], icount1, MPI_DOUBLE, idest, itag, comm); } } else { icount=gcount[myid]; isrc=0; itag=10; MPI_Recv ((void *)&b[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=20; MPI_Recv ((void *)&c[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); itag=30; MPI_Recv ((void *)&d[istart], icount, MPI_DOUBLE, isrc, itag, comm, istat); }

計算及回傳 for (i = istart; i <= iend; i++) { a[i] = b[i] + c[i] * d[i]; } itag=110; if (myid > 0) { icount=gcount[myid]; idest=0; MPI_Send((void *)&a[istart], icount, MPI_DOUBLE, idest, itag, comm); } else { for ( isrc=1; isrc < nproc; isrc++ ) { icount1=gcount[isrc]; istart1=gstart[isrc]; MPI_Recv((void *)&a[istart1], icount1, MPI_DOUBLE, isrc, itag, comm, istat); } }

列印結果 if (myid == 0) { suma=0.0; for (i = 0; i < n; i++) suma+=a[i]; printf( "sum of A=%f\n",suma); }

CPU 0 SEND(1) RECV(1) CPU 1 SEND(0) RECV(0) 解決方式謹慎處理 SEND/RECV 位置 Non-blocking fuctions MPI_ISEND MPI_IRECV 使用 MPI_SENDRECV Deadlock Problem

MPI_Scatter MPI_Gather

MPI_Scatter • MPI_Scatter ((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_DOUBLE, iroot, comm); • 每一段資料必須等長。其引數依序為 • t 是待送出陣列的起點 • n 是送給每一個CPU的資料數量 • MPI_DOUBLE 是待送出資料的類別 • b 是接收資料存放的起點，如果n值大於一時，b必須是個陣列 • n 是接收資料的數量 • MPI_DOUBLE 是接收資料的類別 • Iroot 是送出資料的CPU id

MPI_Reduce MPI_Allreduce

MPI_Reduce • MPI_Reduce ((void *)&suma, (void *)&sumall, count, MPI_DOUBLE, MPI_SUM, iroot, comm); • suma 是待運作 (累加) 的變數 • sumall 是存放運作 (累加) 後的結果 (把各個CPU上的suma加總) • count 是待運作 (累加) 的資料個數 • MPI_DOUBLE 是suma和sumall的資料類別 • MPI_SUM 是運作函數 • iroot 是存放運作結果的CPU_id

MPI_Bcast

MPI code 計算及資料皆切割資料數目必須可整除 n=total/nproc; iroot=0 MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&b, n, MPI_ DOUBLE, iroot, comm); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&c, n, MPI_ DOUBLE, iroot, comm); MPI_Scatter((void *)&t, n, MPI_DOUBLE, (void *)&d, n, MPI_DOUBLE, iroot, comm);

MPI code suma=0.0; for(i=istart; i<=iend; i++) { a[i]=b[i]+c[i]*d[i]; suma=suma+a[i]; } idest=0; MPI_Gather((void *)&a, n, MPI_DOUBLE, (void *)&t, n, MPI_ DOUBLE, idest, comm); MPI_Reduce((void *)&suma, (void *)&sumall, 1, MPI_DOUBLE, MPI_SUM, idest, comm);

MPI code if(myid == 0) { printf( "sum of A=%f\n",sumall); }

Shared-memory Model vs.Message-passing Model (#1) • Shared-memory model • Number active threads 1 at start and finish of program, changes dynamically during execution • Message-passing model • All processes active throughout execution of program

Shared-memory Model vs.Message-passing Model (#2) • Shared-memory model • Execute and profile sequential program • Incrementally make it parallel • Stop when further effort not warranted • Message-passing model • Sequential-to-parallel transformation requires major effort • Transformation done in one giant step rather than many tiny steps

計中高效能運算設備介紹 • 建置日期：2003/11 • 運算節點：50 • Nexcom Blade Server • Dual Intel Xeon 2.0GHz • 1GB memory • 效能 • Rpeak: 400GFlops • Rmax: 200GFlops • 未來計畫移做教育訓練用途

計中高效能運算設備介紹 • 建置日期：2005/05 • 運算節點：78 • IBM Blade Server • Dual Intel Xeon 3.2GHz • 5GB memory • 效能: • Rpeak: 998GFlops • Rmax: 500GFlops • 目前服務主力 • 適合對象： • Serial jobs（非平行化程式） • 已透過 MPI 平行化之程式

計中高效能運算設備介紹 • 建置日期：2006/11 • 運算節點： • IBM p595 • 64*Power5 1.9GHz CPU • 256GB memory • AIX 5.3 • 效能: • Rpeak: 486GFlops • Rmax: 421GFlops • 目前服務主力 • 適合對象： • 已透過 OpenMP 平行化之程式 • 大量記憶體需求之程式

平行化程式概念及 計中高效能運算環境