1 / 80

Introduction to Scientific Computing on the IBM SP and Regatta

Introduction to Scientific Computing on the IBM SP and Regatta. Doug Sondak sondak@bu.edu. Outline. Friendly Users (access to Regatta) hardware batch queues (LSF) compilers libraries MPI OpenMP debuggers profilers and hardware counters. Friendly Users. Friendly Users.

orpah
Download Presentation

Introduction to Scientific Computing on the IBM SP and Regatta

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Scientific Computing on the IBM SP and Regatta Doug Sondak sondak@bu.edu

  2. Outline • Friendly Users (access to Regatta) • hardware • batch queues (LSF) • compilers • libraries • MPI • OpenMP • debuggers • profilers and hardware counters

  3. Friendly Users

  4. Friendly Users • Regatta • not presently open to general user community • will be open to a small number of “friendly users” to help us make sure everything’s working ok

  5. Friendly Users (cont’d) • Friendly-user rules 1. We expect the friendly-user period to last 4-6 weeks 2. No charge for CPU time! 3. Must have “mature” code • code must currently run (we don’t want to test how well the Regatta runs emacs!) • serial or parallel

  6. Friendly Users (3) • Friendly-user rules (cont’d) 4. We want feedback! • What did you encounter that prevented porting your code from being a “plug and play” operation? • If it was not obvious to you, it was not obvious to some other users! 5. Timings are required for your code • use time command • report wall-clock time • web-based form for reporting results

  7. Friendly Users (4) • Friendly-user application and report form: • first go to SP/Regatta repository: http://scv.bu.edu/SCV/IBMSP/ • click on Friendly Users link at bottom of menu on left-hand side of page • timings required for the Regatta and either the O2k or SP (both would be great!)

  8. Hardware

  9. Hal (SP) • Power3 processors • 375 MHz • 4 nodes • 16 processors each • shared memory on each node • 8GB memory per node • presently can use up to 16 procs.

  10. Hal (cont’d) • L1 cache • 64 KB • 128-byte line • 128-way set associative • L2 cache • 4 MB • 128-byte line • direct-mapped (“1-way” set assoc.)

  11. Twister (Regatta) • Power4 processors • 1.3 GHz • 2 CPUs per chip (interesting!) • 3 nodes • 32 processors each • shared memory on each node • 32GB memory per node • presently can use up to 32 procs.

  12. Twister (cont’d) • L1 cache • 32 KB per proc. (64 KB per chip) • 128-byte line • 2-way set associative

  13. Twister (3) • L2 cache • 1.41 MB • shared by both procs. on a chip • 128-byte line • 4-to-8 way set associative • unified • data, instructions, page table entries

  14. Twister (4) • L3 cache • 128 MB • off-chip • shared by 8 procs. • 512-byte “blocks” • coherence maintained at 128-bytes • 8-way set associative

  15. Batch Queues

  16. Batch Queues • LSF batch system • bqueues for list of queues QUEUE_NAME PRIO STATUS MAX JL/U JL/P JL/H NJOBS PEND RUN SUSP p4-mp32 10 Open:Active 1 1 - - 0 0 0 0 p4-mp16 9 Open:Active 2 1 - - 0 0 0 0 p4-short 8 Open:Active 2 1 - - 0 0 0 0 p4-long 7 Open:Active 16 5 - - 0 0 0 0 sp-mp16 6 Open:Active 2 1 - 1 2 1 1 0 sp-mp8 5 Open:Active 2 1 - - 1 0 1 0 sp-long 4 Open:Active 8 2 - - 20 12 8 0 sp-short 3 Open:Active 2 1 - - 0 0 0 0 graveyard 2 Open:Inact - - - - 0 0 0 0 donotuse 1 Open:Active - - - - 0 0 0 0

  17. Batch Queues (cont’d) • p4 queues are on the Regatta • sp queues are on the SP (surprise!) • “long” and “short” queues are serial • for details see http://scv.bu.edu/SCV/scf-techsumm.html • will not include Regatta info. until it’s open to all users • bsub to submit job • bjobs to monitor job

  18. Compilers

  19. Compiler Names • AIX uses different compiler namesto perform some tasks which are handled by compiler flagson many other systems • parallel compiler names differ for SMP, message-passing, and combined parallelization methods

  20. Compilers (cont’d) Serial MPI OpenMP Mixed Fortran 77 xlf mpxlf xlf_r mpxlf_r Fortran 90 xlf90 mpxlf90 xlf90_r mpxlf90_r Fortran 95 xlf95 mpxlf95 xlf95_r mpxlf95_r C cc mpcc cc_r mpcc_r xlc mpxlc xlc_r mpxlc_r C++ xlC mpCC xlC_r mpCC_r gcc and g++ are also available

  21. Compilers (3) • xlc default flags -qalias=ansi • optimizer assumes that pointers can only point to an object of the same type (potentially better optimization) -qlanglvl=ansi • ansi c -qro • string literals (e.g., char *p = ”mystring”;) placed in “read-only” memory (text segment); cannot be modified

  22. Compilers (4) • xlc default flags (cont’d) -qroconst • constants placed in read-only memory

  23. Compilers (5) • cc default flags • -qalias=extended • optimizer assumes that pointers may point to object whose address is taken, regardless of type (potentially weaker optimization) • -qlanglvl=extended • extended (not ansi) c • “compatibility with the RT compiler and classic language levels” • -qnoro • string literals (e.g., char *p = ”mystring”;) can be modified • may use more memory than -qro

  24. Compilers (6) • cc default flags (cont’d) • -qnoroconst • constants not placed in read-only memory

  25. Default Fortran Suffixes xlf .f xlf90.f f90 .f90 xlf95.f f95.f mpxlf .f mpxlf90 .f90 mpxlf95.f Same except for suffix

  26. Compiler flags • Specify source file suffix -qsuffix=f=f90 (lets you use xlf90 with .f90 suffix) • 64-bit • q64 • use if you need more than 2GB

  27. flags cont’d • Presently a foible on twister (Regatta) • if compiling with -q64 and using MPI, must compile with mp…_r compiler, even if you’re not using SMP parallelization

  28. flags (3) • IBM optimization levels -Obasic optimization -O2 same as -O -O3 more aggressive optimization -O4 even more aggressive optimization; optimize for current architecture; IPA -O5 aggressive IPA

  29. flags (4) • If using O3 or below, can optimize for local hardware (done automatically for -O4 and -O5): -qarch=auto optimize for resident architecture -qtune=autooptimize forresident processor -qcache=auto optimize for resident cache

  30. flags (5) • If you’re using IPA and you get warnings about partition sizes, try -qipa=partition=large • default data segment limit 256MB • data segment contains static, common, and allocatable variables and arrays • can increase limit to a maximum of 2GB with 32-bit compilation -bmaxdata:0x80000000 • can use more than 2GB data with -q64

  31. flags (6) • -O5 does not include function inlining • function inlining flags: -Q compiler decides what functions to inline -Q+func1:func2 only inline specified functions -Q -Q-func1:func2 let compiler decide, but do not inline specified functions

  32. Libraries

  33. Scientific Libraries • Contain • Linear Algebra Subprograms • Matrix Operations • Linear Algebraic Equations • Eigensystem Analysis • Fourier Transforms, Convolutions and Correlations, and Related Computations • Sorting and Searching • Interpolation • Numerical Quadrature • Random Number Generation

  34. Scientific Libs. Cont’d • Documentation - go to IBM Repository: http://scv.bu.edu/SCV/IBMSP/ • click on Libraries • ESSLSMP • for use with “SMP processors” (that’s us) • some serial, some parallel • parallel versions use multiple threads • thread safe; serial versions may be called within multithreaded regions (or on single thread) • link with -lesslsmp

  35. Scientific Libs. (3) • PESSLSMP • message-passing (MPI, PVM) -lpesslsmp -lesslsmp -lblacssmp

  36. Fast Math • MASS library • Mathematical Acceleration SubSystem • faster versions of some Fortran intrinsic functions • sqrt, rsqrt, exp, log, sin, cos, tan, atan, atan2, sinh, cosh, tanh, dnint, x**y • work with Fortran or C • differ from standard functions in last bit (at most)

  37. Fast Math (cont’d) • simply link to mass library: Fortran: -lmass C: -lmass -lm • sample approx. speedups exp 2.4 log 1.6 sin 2.2 complex atan 4.7

  38. Fast Math (3) • Vector routines offer even more speedup, but require minor code changes • link to -lmassv • subroutine calls • prefix name with vs for 4-byte reals (single precision) and v for 8-byte reals (double precision)

  39. Fast Math (4) • example: single-precision exponential call vsexp(y,x,n) • x is the input vector of length n • y is the output vector of length n • sample speedups (single & double) exp 9.7 6.7 log 12.3 10.4 sin 10.0 9.8 complex atan 16.7 16.5

  40. Fast Math (5) • For details see the following file on hal: file:/usr/lpp/mass/MASS.readme

  41. MPI

  42. MPI • MPI works differently on IBM than on other systems • first compile code using compiler with mp prefix, e.g., mpcc • this automatically links to MPI libraries; do not use -lmpi

  43. POE • Parallel Operating Environment • controls parallel operation, including running MPI code

  44. Running MPI Code • Do not use mpirun! • poe mycode -procs 4 • file re-direction: poe mycode < myin > myout-procs 4 • note: no quotes • a useful flag: -labelio yes labels output with process number (0, 1, 2, …) • also setenv MP_LABELIOyes

  45. OpenMP

  46. SMP Compilation • OpenMP • append compiler name with _r • use-qsmp=ompflag SGI: f77 -mp mycode.f IBM: xlf_r -qsmp=omp mycode.f • Automatic parallelization SGI: f77 -apo mycode.f IBM: xlf_r -qsmp mycode.f

  47. SMP Compilation cont’d • Listing files for auto-parallelization SGI: f77 -apo list mycode.f IBM: xlf_r -qsmp -qreport=smplist mycode.f

  48. SMP Environment • Per-thread stack limit • default 4MB • can be increased by using environment variable setenv XLSMPOPTS $XLSMPOPTS\:stack=size where size is the new size limit in bytes

  49. Running SMP • Running is the same as on other systems, e.g., #!/bin/tcsh setenv OMP_NUM_THREADS 4 mycode < myin > myout exit

  50. OpenMP functions • On IBM, must declare OpenMP Fortran functions integer OMP_GET_NUM_THREADS (not necessary on SGI)

More Related