1 / 29

CEES 157

CEES 157. CEES 157 2006. 18 registered students Efficient serial programming Parallel programming Seminars by Geoscience HPC users Projects on optimizing/parallelizing code. Parallel programing. Introduction to OpenMP Loops, parallel environment Introduction to MPI

lin
Download Presentation

CEES 157

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CEES 157

  2. CEES 157 2006 • 18 registered students • Efficient serial programming • Parallel programming • Seminars by Geoscience HPC users • Projects on optimizing/parallelizing code

  3. Parallel programing • Introduction to OpenMP • Loops, parallel environment • Introduction to MPI • Parallel environment, passing data, comparisons to OpenMP

  4. Projects • 8 Projects • 5-7 were successes • Projects were concentrated in imaging and fluid flow • Main benefit was writing more efficient serial code and introducing OpenMP to their code

  5. Loop invariant Old code: for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } } signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=signdzw/v_z[iy][ix]-signdzwv0z; wld_z[iy][ix]=wld_z[iy][ix]*cmplx(cos(phsft),sin(phsft)); } } New code:

  6. Loop invariant for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+b*cb[ix]*cb[ix],a*ca[ix] ); rorr=cmplx(tr+b*cb[ix]*cb[ix+1],a*ca[ix] ); rorl=cmplx(tr+b*cb[ix]*cb[ix-1],a*ca[ix] ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =cmplx(tr+b*cb[ix]*cb[ix+1],-a*ca[ix] ); rc[ix-1]=cmplx(tr+b*cb[ix]*cb[ix-1],-a*ca[ix] ); }

  7. Loop invariant tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; ror=cmplx(tr+tmpbcb*cb[ix],tmpaca ); rorr=cmplx(tr+tmpbcb*cb[ix+1], tmpaca ); rorl=cmplx(tr+tmpbcb*cb[ix-1], tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; rol=conjg(ror); ra[ix]=1.0-2.0*rol; rb[ix] =conjg(rorr ); rc[ix-1]=conjg(rorl ); }

  8. Functional Inlining tmpbcb=b*cb[ix]; tmpaca=a*ca[ix]; for(ix=1;ix<nx-1;ix++){ url=u[ix-1]; urr=u[ix+1]; __real__ ror=tr+tmpbcb*cb[ix]; __ imag__ ror=tmpaca ; __real__rorr=tr+tmpbcb*cb[ix+1]; __imag__ rorr= tmpaca ); __real__rorl=tr+tmpbcb*cb[ix-1]; __imag__ rorl =tmpaca ); v[ix]=u[ix]-2.0*ror*u[ix]+rorr*urr+rorl*url; __real__ rol=__real__ ror; __imag__ rol=- __imag__ ror; ra[ix]=1.0-2.0*rol; __real__rb[ix] =__real__rorr; __imag__rb[ix]=- __imag__ rorr; __real__rc[ix-1]=__real__rorl; __imag__rc[ix-1]=-__imag__rorl; }

  9. Vectorizing for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phsft=-signn*dz*(w/v_z[i2(iy,ix,nx)]-w_v0_z); wld_z[i2(iy,ix,nx)]=wld_z[i2(iy,ix,nx)]*cmplx(cos(phsft),sin(phsft)); } }

  10. Vectorizing signdzw=-signn*dz*w; signdzwv0z=-signn*dz*w_v0_z; for(iy=0;iy<ny;iy++){ for(ix=0;ix<nx;ix++){ phin[ix]=signdzw/v_z[iy][ix]-signdzwv0z; } vsCos(nx,phin,cosout); vsSin(nx,phin,sinout); for(ix=0;ix<nx;ix++){ __real__ tmp_wld[iy][ix]=cosout[ix]; __imag__ tmp_wld[iy][ix]=sinout[ix]; } for(ix=0;ix<nx;ix++) wld_z[iy][ix]*=tmp_wld[iy][ix]; }

  11. Computation time Old code: 654 seconds New code: 220 seconds

  12. Implementation choices • Restrictions • Cartesian grids • Diagonal permeability tensors • Low-order accurate MFEM • Large memory need => OpenMP • Res. sim. models > 10e6 elements • On cartesian grids mfem solves for 3*Nx*Ny+Nx+Ny unknowns EEES 157

  13. Parallelized code • Initializations • Several large matrices initialized (for grid, system,…) #pragma omp parallel for private(i,j) for(i=0; i < (3*nx*ny+nx+ny); i++) { for(j=0; j<(3*nx*ny+nx+ny); j++) { tangent[i][j] = 0.; } } EEES 157

  14. Parallelized code • Grid generation • Grid data structures created in loops #pragma omp parallel for private(i,j) for(j=0; j<ny; j++) //Loop in x direction { for(i=0; i<nx; i++) //Loop in y direction { blah... blah... } } EEES 157

  15. Parallelized code • Loop over elements • Builds tangent, residual and solution vector • Many calls to shape function routines #pragma omp parallel for private(et,i_e,i_e_2) for (et=0; et<n_dof_i; et++) //Loop over elements { for(i_e=0; i_e<4; i_e++) //Double loop over edges { for(i_e_2=0; i_e_2<4; i_e_2++) { blah… blah… } } } EEES 157

  16. Results EEES 157

  17. Validation • 70x70 grid • 14840 unknowns • Tangent: 220,225,600 terms EEES 157

  18. Performance EEES 157

  19. Old Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

  20. Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

  21. Unnecessary complex math !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:),ikx2d(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:) integer :: iz,ir ikz2d =(cmplx(0., 1.)*allrefs(ir,3)* ( allfields(:,:,3 ) -allrefs(ir,3 ) )) + ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cexp( cmplx( -dz*abs(aimag(ikz2d)),-dz*real(ikz2d) )) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu

  22. Unnecessary complex math !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:),ikx2d(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:) integer :: iz,ir ikz2d =(cmplx(0., 1.)*allrefs(ir,3)* ( allfields(:,:,3 ) -allrefs(ir,3 ) )) + ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cexp( cmplx( -dz*abs(aimag(ikz2d)),-dz*real(ikz2d) )) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu

  23. New Routine - 2.8x faster !! -------------------------------------------------------------- !! SSF correction subroutine subroutine wemig_ssf(dax,w,ir,allfields,allrefs) complex :: dax(:,:) real :: w,dir,allfields(:,:,:),allrefs(:,:), ikx2d(:,:) integer :: iz,ir ikz2d = -dz* ( allrefs(ir,4) * w**2 * ( allfields(:,:,4 ) -allrefs(ir,4 ) )) / & sqrt( allrefs(ir,4)**2 * w**2 - allrefs(ir,10)**2 ) dax = dax * cmplx( cos(ikz2d) , sin(ikz2d) ) end subroutine wemig_ssf !! -------------------------------------------------------------- jeff@sep.stanford.edu

  24. Old Pseudo-code For all Shots (~thousands) For all Frequencies (~50-100) For all Depths (~500) For all References (~4-10) U=U*SSF FFT (U) U=U*PHS IFFT(U) Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

  25. Old Pseudo-code New Pseudo-code For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

  26. New Reorganization - 2.3x faster For all Shots !! Parallelizable For all Frequencies For all Depths For all References SSF FFT PHS FFT Interpolation end References end Depths end Frequencies end Shots For all Shots !! Parallelizable For all Frequencies For all Depths SSF FFT For all References PHS FFT Interpolation end References end Depths end Frequencies end Shots jeff@sep.stanford.edu

  27. Serial Optimization Test Results Freq Loop ncalls= 48 time= 713.116 s per =14.857 s PHS=tag ncalls= 205824 time= 451.979 s per = 2.196 ms SSF=tag ncalls= 205824 time= 215.695 s per = 1.048 ms REF=tag ncalls= 205824 time= 27.058 s per = 0.132 ms.0001315 Original Code Freq Loop ncalls= 48 time= 574.065 s per= 11.959 s PHS=tag ncalls= 205824 time= 453.302 s per= 2.202 ms SSF=tag ncalls= 205824 time= 75.414 s per= 0.366 ms REF=tag ncalls= 205824 time= 27.964 s per= 0.136 ms Subroutine Improvement Freq Loop ncalls= 71 time= 472.296 s per = 6.65 s PHS=tag ncalls= 286343 time= 92.600 s per = 1.46 ms SSF=tag ncalls= 36281 time= 16.435 s per = 0.380 ms REF=tag ncalls= 286343 time= 35.492 s per = 0.123 ms FFTs 69.5% in FFTs per frequency =4.62 s Reduce # of FFTs jeff@sep.stanford.edu

  28. Serial Optimization Test Results Freq Loop ncalls= 48 time= 713.116 s per =14.857 s PHS=tag ncalls= 205824 time= 451.979 s per = 2.196 ms SSF=tag ncalls= 205824 time= 215.695 s per = 1.048 ms REF=tag ncalls= 205824 time= 27.058 s per = 0.132 ms.0001315 Original Code Freq Loop ncalls= 48 time= 574.065 s per= 11.959 s PHS=tag ncalls= 205824 time= 453.302 s per= 2.202 ms SSF=tag ncalls= 205824 time= 75.414 s per= 0.366 ms REF=tag ncalls= 205824 time= 27.964 s per= 0.136 ms Subroutine Improvement Freq Loop ncalls= 71 time= 472.296 s per = 6.65 s PHS=tag ncalls= 286343 time= 92.600 s per = 1.46 ms SSF=tag ncalls= 36281 time= 16.435 s per = 0.380 ms REF=tag ncalls= 286343 time= 35.492 s per = 0.123 ms FFTs 69.5% in FFTs per frequency =4.62 s Reduce # of FFTs jeff@sep.stanford.edu

  29. CEES 157 2007 • More discussion on writing/designing efficient codes • Introduction to programming tools • Scons, make, svn • No seminars • Expansion of OpenMP

More Related