1 / 24

Portable Multi-Level Parallel Programming

Portable Multi-Level Parallel Programming. March 4th 2008, Simula, Oslo. Gerhard Zumbusch Institut für Angewandte Mathematik Friedrich-Schiller-Universität Jena. Parallel Programming?.

joan-craft
Download Presentation

Portable Multi-Level Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Portable Multi-LevelParallel Programming March 4th 2008, Simula, Oslo Gerhard ZumbuschInstitut für Angewandte MathematikFriedrich-Schiller-Universität Jena

  2. Parallel Programming? • Applications will increasingly need to be concurrent if they want to fully exploit continuing exponential CPU throughput gains. • Therefore single-threaded programs are likely not to get faster any more except for benefits from further cache size groth (…). • Finally, programming languages and systems will increasingly be forced to deal well with concurreny. Herb Sutter: „The free lunch is over“(Dr.Dobbs´s 30(3), 2005) or: massively parallel in capability computing

  3. Parallel-Programming for free. Instruction Parallelism • Memory layout:data close to registers • Programming Model: • use optimising compiler • loop unrolling, instruction re-ordering if needed Sequential code

  4. Data-Parallel Programming Programming Models • Vector Instructions • Thread-Parallel • MPI Message-Passing • Cell with DMA Block Transfers • Mixed and hybrid Models • OpenMP and HPF parallel loops for (int i=1; i<n; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( y(i) ); }

  5. Data-Parallel Programming … float half =.5; _mm_store_ps(&x[i], _mm_mul_ps( _mm_load1_ps(&half), _mm_add_ps(_mm_loadu_ps(&y[i+1]), _mm_loadu_ps(&y[i-1]))); … Vector Processor • examples:SSE, AltiVecextensions • Data Layout:continuous data blocks • Programming Models: • Optimising Compiler • Special Instructions (intrinsics) SSE … float* y0 = &y[i+1], y1 = &y[i-1]; vec_st(vec_madd( vec_splats(.5), vec_add( vec_perm(vec_ld(0,y0), vec_ld(16,y0), vec_lvsl(0,y0)), vec_perm(vec_ld(0,y1), vec_ld(16,y1), vec_lvsl(0,y1))), vec_splats(0.)), 0, &x[i]); … AltiVec

  6. symmetric multi processing data layout:read shared data write private data Programming models: threads (Pthreads, Java threads, Win threads,…) lazy evaluation (Concur, Cilk) for-loops (OpenMP, Fortran Arrays) void *sub1(void *arg) { ... double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } vec->e = e_local; } Data-Parallel Programming thread for (int p=0; p<p_threads; p++) pthread_create(&threads[p], threadAttr, sub1, (void *)vec[p]); for (int p=0; p<p_threads; p++) { pthread_join(threads[p], NULL); e += vec[p]->e; } main

  7. Distributed memory data layout:local memory only,manage data transer explicitely Programming Models: Message passing (MPI-1) (Fortran Arrays) SGI shmem MPI-2, BSP,UPC, X10,… if (p_left) MPI_Send(&y(n_local0), 1, MPI_DOUBLE, p_left,...); if (p_right) { MPI_Recv(&y(n_local1+1), 1, MPI_DOUBLE, p_right,...); MPI_Send(&y(n_local1), 1, MPI_DOUBLE, p_right,...); } if (p_left) MPI_Recv(&y(n_local0-1), 1, MPI_DOUBLE, p_left,...); double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } MPI_AllReduce(&e_local, &e, 1, MPI_DOUBLE, MPI_SUM,...) Data-Parallel Programming proc p

  8. Multi-Core Processor 8+1 processor cores IBM/Sony Cell BE

  9. int main(unsigned long long id, addr64 argp, addr64 envp) { mfc_get(...); mfc_read_tag_status_all(); ... double e_local = 0; for (int i=n_local0; i<n_local1; i++) { x(i) += ( y(i+1) + y(i-1) )*.5; e_local += sqr( y(i) ); } mfc_put(...); mfc_read_tag_status_all(); } Data-Parallel Programming Cell Broadband Architecture • Data Layout: • Global Memory • parts as copy in local SPU memory (256kb) user controlled DMA blocktransfers • Programming Model: • special library calls • Fortran Arrays(?) SPU for (int p=0; p<spe; p++) { spe_context_create(...) spe_program_load(...) pthread_create(...); } for (int p=0; p<spe; p++) { pthread_join(...); spe_context_destroy(...); e += vec[p]->e; } void *sub1(void *arg) { spe_context_run(...); } CPU thread main

  10. automatic code generation Grid1 *g = new Grid1(0, n+1); Grid1IteratorSub it(1, n, g); DistArray<double> x(g), y(g); double e = 0; application library ForEach(int i, it, x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( y(i) ); ) application specificlanguage extensions • data dependence analysis • modified g++ 4.2 • Analysis based on Tree SSA representation load y(i-1), y(i), y(i+1) store x(i) reduce add e code generation sequential code MPI parallel code parallel+vector Cell processor Code vectorised code thread parallel code MPI+thread parallel code

  11. int n = 64; Grid1 *g = new Grid1(0, n+1); Grid1IteratorSub it(1, n, g); DistArray1<double> x(g), y(g); double e = 0.; ForEach(int i, it, ‘x(i) += ( y(i+1) + y(i-1) )*.5; e += sqr( x(i) ); ’ ) array references:local node read/write, neighbour nodes read only shared memory:schedule sub-grids Distributed memory:owner computes local sub-grid,exchange ghost data(message passing) Distributed-Grid summary auto detect dependencey(i+1), y(i-1)

  12. Relaxation Scheme • Solve linear equation system iteratively • n data • O(n) sequential arithmetic operations per iteration • Parallel version: (block) Jacobi iteration

  13. Multigrid Relaxation • Solve linear equation system iteratively • O(n) arithmetic Operations per iteration • Const # iterations

  14. int n = 64; Grid1 *g = new Grid1(0, n+1); Grid1 *gf = new Grid1(0, 2*n+1, g, &f); DistArray1<double> x(g); DistArray1<double> z(gf); Grid1IteratorSub it(1, n, g); ForEach(int i, it, ‘x(i) = z(2*i)*.5 + ( z(2*i-1) + z(2*i+1))*.25; ’ ) ForEach(int i, it, ‘z(2*i) = x(i); z(2*i+1) = ( x(i) + x(i+1) )*.5; ’ ) static communication pattern mappingfine -> coarse grid grid: memory alignement data dependence z(2*i-1), x(i+1)

  15. multigrid: MPI+pthreads 3D multigrid, finite differences, structured nested grids, V0,1 cycle (~Fapin, NAS benchmarks)4 * AMD dual-core Opteron 1.8GHz, g++ 64bit, Linux, Pthreads and/or Mpich

  16. multigrid: MPI+pthreads 1 or 2 processes per cluster node (MPI or pthreads) 3D multigrid, finite Differences, uniform grid, 5133 grid points, (NAS Fapin)Intel dual-core cluster, 1Gbit/s ethernet, g++ 64bit, Linux, Mpich (und Pthreads)

  17. Multigrid solver on Cell processor Single/double buffer, scalar/vector-Code 3D multigrid, finite differences, uniform grid, 1293 grid points, (NAS Fapin)Sony Playstation3, Linux, xlC 8.2

  18. Particle Simulation with Tree Code

  19. TopDownIterator<tree> down(root); ForEach(tree *b, down, ‘ for (int i=0; i<4; i++) if (b->child(i)) b->child(i)->l += b->l; ’ ) shared memory:coarse tree sequential,one thread per sub-tree distributed memory: replicated coarse tree,distributed fine sub-trees tree code: top down

  20. BottomUpIterator<tree> up(root); ForEach(tree *b, up, ‘ for (int i=0; i<4; i++) if (b->child[i]) b->m += b->child[i]->m; ’ ) data dependence analysis: load child[] load child[]->m load,store this->m tree code: bottom up data exchange variable m

  21. Require( list<tree*> inter, fetch ); double x0, x1; int fetch(tree *b) { return (x0==b->x1) || (x1==b->x0); } Distributed memory: Additional data exchange Sub-trees within geometrical neighbourhood tree code: local neighbours Super-set of all possible neighbours TopDownIterator<tree> down(root); ForEach(tree *b, down, ‘ for (list<tree*>::const_iterator i = b->inter.begin(); i != b->inter.end(); i++) b->l += log(abs(b->x - (*i)->x)) * (*i)->m; ’ ) data dependence analysis ->x, ->m

  22. Tree code: MPI+pthreads 2D adaptive fast multipole, 20 complex coeff. Laurent-series, 2*106 particles (~Splash-2)4 * AMD dual-core Opteron 1.8GHz, g++ 64bit, Linux, Pthreads and/or Mpich

  23. Tree code: MPI+pthreads 2 Processes per cluster node (MPI or pthreads) 2D adaptive fast multipol-method, 20 complex coeff. Laurent-series, (~Splash-2)Intel dual-core cluster, 1Gbit/s ethernet, g++ 64bit, Linux, Mpich (und Pthreads)

  24. Conclusion • Tree & grid iterators for numerical codes: • Using code annotation & library aData dependence analysis at compile time aAutomatic parallelization • For shared memory, distributed memory, multi-core and mixed parallel target architectures • Domain specific parallel programming styles vs. parallel libraries vs. parallel languages?

More Related